Keith's Assignment 14

SemaCS : A Semantic Classification Service and Framework for Digital Objects
Keith E. Maull
Dissertation Abstract

The explosion of digital libraries and repositories for digital objects has brought access and exposure to a wide array of digital material, from etexts, to historical photos to scientific data. While it may be argued that the quality of materials decreases as their quantity increases, a larger problem is that such quality can never be measured if 1) objects in the repository are classified based on a meaningful classification scheme, 2) the objects in the repository cannot be made useful to end user contexts that would benefit from access to them, and 3) the objects cannot be discovered or repurposed for contexts other than end user contexts, for example, as inputs to autonomous service agents that operate to build new contexts outside the repository. This dissertation proposes that ontological classification is a powerful mechanism for contextualization of digital resources.

This dissertation proposes an approach to gaining control and broadening the scope of digital objects so that these resources are classified in ways that make their discovery easier and end use more powerful. A framework, the Semantic Classification Service or SemaCS, is proposed as a mechanism to guide digital repositories toward deeper contextualization of resources. The framework is broken into three components analysis, classification, and contextualization of digital resources. The analysis component is first performed by traditional metadata analysis, and then by deep content analysis based on resource type. In this study, text documents provide the focus for such deep analysis. Once analyzed, each resource is classified by being assigned an initial ontology using a combination of text analysis methods and text similarity based on ontological hierarchy. This initial ontology is made available as the primary context for the resource but as part of the SemaCS framework, multiple ontologies may be attached to a single resource. Resources are then contextualized further by being analyzed against other resources with similar ontological affinity.

The framework is exercised on three data sets. The first is a scientific data repository containing scientific papers, electronic notes, textual data sets and archived online discussions on atmospheric research performed at the National Center for Atmospheric Research. The second data set is extracted from a large institutional repository at the University of Colorado, from which heterogeneous content is used in the form of academic papers, student projects, departmental reports, images, data sets, etc. The third data set is formed from the aggregation of institutional repositories formed by content contained within the Colorado Alliance of Research Libraries. Objects from these three sets are processed and classified through SemaCS. It is shown that merely utilizing metadata is not enough to provide deep classification of resources. Furthermore, resource similarity classification is improved dramatically through the use of SemaCS improving the use of such resources in a wider array of contexts. Finally, it is shown that the use of SemaCS provides a mechanism for a new class of harvesting agents to bring new life to digital resources through the classification power of the SemaCS framework.

Last modified 11 December 2007 at 1:29 am by K:M