About MyLibrary@OCKHAM
MyLibrary@OCKHAM is set of searchable databases (indexes) created from the content of the "hidden Web" for the purposes of facilitating teaching, learning, and research. This page describes the how's and why's of the system, and it is divided into the following parts:
How to search
Enter a word or phrase. For best results, enclose phrases in double quotes, but for simple searches, this is not necessary. Examples include:
- origami
- "pulmonary disease"
Search results will be returned in a relevance ranked order including each item's title, creator, URL, narrative description, and a list of keywords.
Each URL is associated with a link to Google, and where it will try to identify similar pages found on the Internet. Each individual keyword is linked to a pre-created search within the current index. A combination of all the keywords is linked to a second pre-created search also against the current index. Search results may include sets of alternative spellings and possible synonyms. Each of these items will be hot allowing you to search the current index for the selected term. Through the use of these features you should be able to begin to address the perennial problem of "finding more like this one."
You can do wildcard searching through the use of the asterisk (*) character. An example includes:
- librar*
This will locate records with the words librarian, libraries, and/or librarianship.
By default, searches are applied to entire records, but can limit your searches to particular parts of records through the use of field searches. Records are divided into titles, creators, and descriptions. If you want to limit your search to words to one of these parts, then you could enter queries like this:
- title = salmon
- creator = hocking
- description = "terrestrial ecosystems"
The system also supports standard Boolean logic and nesting. Consequently, queries like this are valid:
- simulation and measurements
- experiments or tests
- music not beginner
- sepsis and creator=Gerlach
- "routine practice" and title=energy
- (experiments or tests) and (dna or "desoxyribonucleic acid")
Purpose
The purpose of MyLibrary@OCKHAM is to demonstrate ways a Find More Like This One service can be implemented through the use of "light-weight" protocols and open source software.
In other words, using sets of well-established computing techniques and freely available software, this system strives to enable people to get more out of scholarly research by proactively offering "intelligent" suggestions to improving their searches and identifying similar items.
How the system was created
The system was created and currently maintained through a process divided into the following steps:
- identify items for the collection
- create a controlled vocabulary for organization
- fill the collection
- index content
- provide services against the index (search, alternative spellings, possible synonyms, etc.)
- evaluate
- go to Step #1
1. The first step is deciding what content to collect.
For these purposes the OAI content (part of the "hidden Web") available from the NSDL OAI Repository is the basis of the collection. The Repository is comprised of many OAI sets, and the content of each of these sets is selectively identified for addition, but the system is flexible enough to accommodate the content of any freely-accessible OAI data repository. OAI is considered to be the first of the "light-weight" protocols implemented here.
2. The next step is to create a structure for organizing the content.
MyLibrary is used for this purpose, and it is essentially a set of object oriented Perl modules providing input/output against a relational database. The database itself is composed of many tables, but the most important are the resources table, the facets table, and the terms table. The resources table is composed of essentially Dublin Core elements. The facts and terms tables contains sets of controlled vocabulary terms. The system allows for the creation of an unlimited number of values for facts and terms, and there is a many-to-many relationship between resource records and these controlled vocabulary term combinations. Consequently, each resource record can be described with any number of words/phrases.
Thus, Step #2 is to create a set of terms to be used classify and catalog each of the items from each of the sets of the NSDL OAI Repository. For these purposes terms such as articles, technical reports, the names of each of the OAI sets, mathematics, medicine, biological science, computer science, etc were created. A complete list of the facets and terms used to classify each of the OAI repositories as well as the number of records associated with each facet/term combination is available at http://mylibrary.ockham.org/?cmd=facets.
3. To fill the collection, Step #3, we wrote wrote an OAI harvesting program.
The harvesting program takes an OAI repository, a set name, and a number of MyLibrary terms as input. It harvests each of the items from the given set and saves the resulting OAI/Dublin Core data to the resources table of MyLibrary. Each of these resources is also classified/cataloged with the given MyLibrary terms. For example, each record from the Project Euclid set is saved to the MyLibrary resources table and each record is classified/cataloged as mathematics, articles, and Project Euclid. Similarly, BioOne can be harvested, and its items can be classified/cataloged as biological science, articles, and BioOne.
4. To index the content a report is written against the MyLibrary database and passed on to an indexer called Plucene. It is Plucene that provides all the searching functionality.
5. A number of services are then provided against the index:
- search - A Search/Retrieve via URL (SRU) interface provides function-rich access to the index, and SRU is considered to be another one of the "light-weight" protocols implemented here. (A simpler SRU interface, useful for testing purposes, is available at http://mylibrary.ockham.org/simple/.)
- alternative spellings - Each word in the index(es) is added to a set of ASPELL dictionaries. Incoming queries are parsed and "spell checked" against the content of a dictionary. Based on the output additional queries are returned in the search results.
- possible synonyms - Incoming queries are parsed and searched against a local copy of the WordNet database. Like the alternative spellings, the output of this process is converted into additional searches that can be applied to the current index.
- keywords - Each word from each record of the MyLibrary database is compared to the total number of times each word appears in the entire index. The result is a numeric relavancey score assoicated with each word. The five most relevant words from each record is then saved to the database as keywords, and as records are returned in search results they are marked-up and made automatically searchable against the index.
- Google searches - Each record is associated with a URL. Google allows users to find similar items through a specific "related" query syntax. As search results are returned the associated URLs are marked-up with this syntax providing another way of finding similar items on the Internet.
6. The evaluation step includes asking and then answering difficult questions.
How well is the system working? Is the system being used enough to justify continued maintenance? Does the system meet user expectations? Is the system outputing valid data. It the system providing a useful service and truly facilitating learning, teaching, and research? Are there other types of collections, services, "light-weight" protocols, or open source software applications that can be applied the problem. The answers to these and other questions inform next steps.
7. This process is never really finished; return to Step #1.
Downloads
You can download the source code of this project locally (mylibrary.ockham.org/src/ockham-mylibrary.tar.gz), or mirrored at Google Code (code.google.com/p/ockham-mylibrary/).
Author: Eric Lease Morgan <emorgan@nd.edu>
Date created: 2005-03-15
Date updated: 2007-03-01
URL:
http://mylibrary.ockham.org/