High Energy Physics Libraries Webzine |
|
HEP
Libraries Webzine
Issue 4 / June 2001
The OAIMH protocol was designed as a simple, low-barrier way to achieve interoperability through metadata harvesting. It is still an open question as to exactly how useful metadata sharing will be. However, there is certainly considerable interest in OAI and experience with early OAIMH implementations is encouraging.
This tutorial is organized in four main sections. In section 2, I hope to clear up some common misconceptions about what OAIMH is. In section 3, I review some of the concepts and assumptions that underly the OAIMH protocol. Then, in the remaining two sections, sections 4 and 5, I consider implementation of the data-provider and service-provider sides of the OAIMH protocol. Perl code examples are given to implement bare-bones versions of these two interfaces.
It is not my intention to offer a complete description of the OAIMH protocol but instead to describe its use in very practical terms, and to highlight common practice among implementers. A copy of the OAIMH protocol specification [2] should be at hand while reading this tutorial. I will refer to sections within the protocol specification as [2 section2.1] (for section 2.1).
OAIMH is not about direct interoperability between archives. It is based on a model which puts a very clean divide between data-providers (entities which expose metadata) and service-providers (entities which harvest metadata, presumably with the intention of providing some service).
While the model has a clear divide between data-providers and service-providers, there is nothing to say that one entity cannot be both; Cite Base [3] is one example. The model has an obvious scalability problem if every service-provider is expected to harvest data from every data-provider. It may be that is is not an issue if service-providers are specific to a particular community and thus harvest only from a subset of data-providers. We may also see the creation of aggregators which harvest from a number of data-providers and the re-export this data.
OAIMH is not limited to Dublin Core (DC) [4] metadata. However, since OAI aims to promote interoperability, DC metadata has been adopted as a lowest common-denominator metadata format which all data-providers should support. It is not intended that the requirement to export DC metadata should preclude the use of other metadata sets that may be more appropriate within particular communities. The OAI encourages the development of community-specific standards that provide the functionalities required by specific communities.
Metadata is disseminated via the GetRecord and ListRecords verbs. These requests result in zero or more records being returned. A record consists of 2 or 3 parts: a <header> container, a <metadata> container, and possibly an <about> container [2 section2.2].
The metadata for each item has a unique identifier which, when combined with the metadataPrefix, acts as a key to extract a metadata record. Note that although all metadata types for an item share the same identifier, the identifier is explicitly not an identifier for the item [2 section2.3]. Identifiers may be any valid URI [7] but an optional OAI identifier syntax [2 sectionA2] has been adopted widely. The OAI identifier syntax divides the identifier into three parts separated by colons (:), e.g. oai:arXiv:hep-lat/0008015 where `oai' is the scheme, `arXiv' identifies the repository, and `hep-lat/0008015' is the identifier within the particular repository.
The datestamps have the granularity of a day, they are in YYYY-MM-DD format and no time is specified. This simple date format actually creates some additional complexity because the service-provider and data-provider may not be in the same time-zones. This is considered further in section 4.2.
Typically, a service-provider would initially harvest all metadata records from a repository by issuing a ListRecords request without from or until restrictions. Subsequently, the service-provider would issue ListRecords requests with a from parameter equal to the date of the last harvest.
The ListMetadataFormats request will return the metadataPrefix, schema, and optionally a metadataNamespace, for either a particular record or for the whole repository (if no identifier is specified). In the case of the whole repository, all metadata formats supported by the repository are returned. It is not implied that all records are available in all formats.
In an environment where one of a set of servers may handle a request, the server may dynamically redirect a request using the HTTP 302 response. To date this has been implemented only by the NACA repository [8].
The algorithm for oai1.pl is simply:
read GET, POST or command line request check syntax of request if syntax correct return XML reply to request else return HTTP 400 error code and message
An example of an invalid request is:
simeon@fff>./oai1.pl -r 'bad-request' Status: 400 Malformed request Content-Type: text/plain No verb specified!
OAIServer.pm exports two subroutines, one (OAICheckRequest) to check the request against a grammar stored in a data structure, and another (OAISatisfyRequest) which calls the appropriate routine to implement the required OAI verb. I will consider each verb in turn.
Database.pm is a dummy database interface with a `database' of three records: record1, record2 and record3. Metadata for record1 and record2 is available in DC format; metadata for record1 is also available in another format with the metadataPrefix `wibble'; and record3 is a `deleted' record so no metadata is available.
An example request and response is:
simeon@fff>./oai1.pl -r 'verb=ListMetadataFormats&identifier=record1' Content-Type: text/xml <?xml version="1.0" encoding="UTF-8"?> <ListMetadataFormats xmlns="http://www.openarchives.org/OAI/OAI_ListMetadataFormats" xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_ListMetadataFormats http://www.openarchives.org/OAI/1.0/OAI_ListMetadataFormats.xsd"> <responseDate>2001-05-05T12:27:36-06:00</responseDate> <requestURL>http://localhost/oai1?verb=ListMetadataFormats& identifier=record1&verb=ListMetadataFormats</requestURL> <metadataFormat> <metadataPrefix>wibble</metadataPrefix> <schema>http://wibble.org/wibble.xsd</schema> </metadataFormat> <metadataFormat> <metadataPrefix>oai_dc</metadataPrefix> <schema>http://www.openarchives.org/OAI/dc.xsd</schema> </metadataFormat> </ListMetadataFormats>The response indicates that the record record1 may be disseminated in either oai_dc or wibble formats.
The record returned consists of two parts if the record is not deleted; a <header> block which contains the identifier and the datestamp (the information required for harvesting) and a <metadata> block which contains the XML metadata record in the requested format. The <metadata> block will be missing if the record is deleted or if the requested metadata format is not available.
For example, a request for oai_dc for record2 would be:
simeon@fff>./oai1.pl -r 'verb=GetRecord&identifier=record2&metadataPrefix=oai_dc' Content-Type: text/xml <?xml version="1.0" encoding="UTF-8"?> <GetRecord xmlns="http://www.openarchives.org/OAI/OAI_GetRecord" xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_GetRecord http://www.openarchives.org/OAI/1.0/OAI_GetRecord.xsd"> <responseDate>2001-05-05T12:50:23-06:00</responseDate> <requestURL>http://localhost/oai1?verb=GetRecord&identifier=record2& metadataPrefix=oai_dc&verb=GetRecord</requestURL> <record> <header> <identifier>record2</identifier> <datestamp>1999-02-12</datestamp> </header> <metadata> <oai_dc xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://www.openarchives.org/OAI/dc.xsd" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xmlns="http://purl.org/dc/elements/1.1/"> <Title>Item 2</Title> <Creator>A N Other</Creator> </oai_dc> </metadata> </record> </GetRecord>but a request for the unavailable format wibble would be:
simeon@fff>./oai1.pl -r 'verb=GetRecord&identifier=record2&metadataPrefix=wibble' Content-Type: text/xml <?xml version="1.0" encoding="UTF-8"?> <GetRecord xmlns="http://www.openarchives.org/OAI/OAI_GetRecord" xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_GetRecord http://www.openarchives.org/OAI/1.0/OAI_GetRecord.xsd"> <responseDate>2001-05-05T12:52:13-06:00</responseDate> <requestURL>http://localhost/oai1?verb=GetRecord& identifier=record2&metadataPrefix=wibble&verb=GetRecord</requestURL> <record> <header> <identifier>record2</identifier> <datestamp>1999-02-12</datestamp> </header> </record> </GetRecord>which includes a <header> block but no <metadata> block.
The protocol also permits the addition of an <about> container [2 section2.2] for each record This is provided as a hook for additional information such as rights or terms information. It is not currently used by any of the registered OAI data-providers and is not implemented in the example code.
In the case of ListIdentifiers the output consists of <identifier> elements which may include the attribute status="deleted" if the record is deleted. An example request without date restriction is:
simeon@fff>./oai1.pl -r 'verb=ListIdentifiers' Content-Type: text/xml <?xml version="1.0" encoding="UTF-8"?> <ListIdentifiers xmlns="http://www.openarchives.org/OAI/OAI_ListIdentifiers" xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers.xsd"> <responseDate>2001-05-05T12:59:30-06:00</responseDate> <requestURL>http://localhost/oai1?verb=ListIdentifiers&verb=ListIdentifiers</requestURL> <identifier>record1</identifier> <identifier>record2</identifier> <identifier status="deleted">record3</identifier> </ListIdentifiers>The response lists the identifiers of the three records in the repository and indicates that record3 is deleted. If the parameter until=2000-01-01 were added then only the first two identifiers would be returned since the datestamp of record3 is 2000-03-13.
In the case of ListRecords the output consists of <record> blocks similar to those obtained from GetRecord requests. ListRecords requests must include a metadataPrefix parameter.
As one of the maintainers of a heavily used archive I am painfully aware of the importance of avoiding inadvertent denial-of-service attacks created by badly written harvesting software. Automated agents should always include a useful user-agent string and a valid e-mail contact address in their HTTP requests. The flow-control elements of the protocol must be respected and careful testing is essential.
To detect changes other than the addition and deletion of records which are part of normal repository operation, one can compare the response to OAI requests that describe the repository between successive harvests. These requests are Identify and probably ListSets and ListMetadataFormats (for the whole repository as opposed to any single record). For all of the requests we expect the <responseDate> to change with each request but for these requests we expect the rest of the response to be unchanged. Note that to do the test correctly one should compare the XML data in such a way that valid transformations, say re-ordering elements, are ignored. However, in practice it is likely to be sufficient (if over sensitive) to do a string comparison of the responses so long as changes in the <responseDate> are ignored.
In the example harvester I have included the facility to specify a file containing the Identify response from the previous harvest. This is used both to extract the date of the last harvest and to check for changes in that response. I have not implemented a test for changes in the ListSets and ListMetadataFormats responses.
The one-day granularity of the datestamp and the possibility of data-providers and service-providers being in different time zones means that there must be some overlap between the date ranges of successive requests [10]. If the service-provider and data-provider share the same time-zone then a one-day overlap is sufficient to ensure that updates are not missed; records might be updated after the harvest on the day of the last harvest, but provided records that have changed on that day are reharvested then no changes will be missed. To cope with different time zones it is necessary to extend this to a two-day overlap if the harvester works with dates local to itself. An alternative strategy, which I prefer, is to use only the dates returned by the repository and thus, by working in the local time zone of the repository, reduce the required overlap to one day.
In the example harvester I implement this last strategy by taking the date of the last harvest from the <responseDate> of the stored Identity response (the <responseDate> must be specified in the local time zone of the repository [2 section3.1.2.1]. This date may then be used as the from date (inclusive) for the next ListRecords or ListIdentifiers request.
Perhaps the neatest way to implement a harvester would be to have it recombine partial responses into a complete reply. The example code does not do this but does parse all list requests to look for a <resumptionToken> so that further requests can be used to complete the original request.
read command line arguments check options and parameters issue Identify request compare response with previous Identify response if given extract `from' date from command line, previous Identify response or do complete harvest LOOP: issue ListRecords or ListIdentifiers request check for resumptionToken, LOOP if present
The subroutine OAIGet in OAIGet.pm is used to issue the OAIMH requests and this handles any retry-after or redirect replies. XML parsing is handled by the OAIParser.pm module which extends XML-Parser, which itself is based on the expat parser.
Let us take as an example, harvesting metadata from the example data-provider code which has be set up at the URL http://localhost/oai1. First we would issue a harvest command without any time restriction (to harvest all records). In the examples, I harvest just the identifiers using ListIdentifiers requests, the flags -r and -m metadataPrefix can be used to instruct oaiharvest.pl to issue ListRecords requests and to specify a metadataPrefix other than oai_dc.
simeon@fff>mkdir harvest1 simeon@fff>./oaiharvest.pl -d harvest1 http://localhost/oai1 oaiharvest.pl: Harvest from http://localhost/oai1 using POST OAIGet: Doing POST to http://localhost/oai1 args: verb=Identify OAIGet: Got 200 OK (479bytes) oaiharvest.pl: Doing complete harvest. OAIGet: Doing POST to http://localhost/oai1 args: verb=ListIdentifiers OAIGet: Got 200 OK (537bytes) oaiharvest.pl: Got 3 identifiers (running total: 3) oaiharvest.pl: No resumptionToken, request complete. oaiharvest.pl: Done. simeon@fff>ls harvest1 Identify ListIdentifiers.1
If we then do an incremental harvest specifying the file name of the last Identify response, harvest1/Identify, the harvester checks against this response for changes (none except date) and extracts the date of the last harvest (2001-06-05) to be used as the from date for the new harvest.
simeon@fff>mkdir harvest2 simeon@fff>./oaiharvest.pl -d harvest2 -i harvest1/Identify http://localhost/oai1 oaiharvest.pl: Harvest from http://localhost/oai1 using POST OAIGet: Doing POST to http://localhost/oai1 args: verb=Identify OAIGet: Got 200 OK (479bytes) oaiharvest.pl: Identify response unchanged from reference (except date) oaiharvest.pl: Reading harvest1/Identify to get from date oaiharvest.pl: Incremental harvest from 2001-06-05 (from harvest1/Identify) OAIGet: Doing POST to http://localhost/oai1 args: from=2001-06-05&verb=ListIdentifiers OAIGet: Got 200 OK (444bytes) oaiharvest.pl: Got 0 identifiers (running total: 0) oaiharvest.pl: No resumptionToken, request complete. oaiharvest.pl: Done.Since there have been no changes in the database this harvest results in no identifiers being returned.
To extend this example, I then edited the database (Database.pm) to add a new record (record4) with datestamp 2001-06-05 which simulates the addition of a record after the last harvest but on the same day. I then ran another harvest command.
simeon@fff>diff Database.pm~ Database.pm 24c24,26 < 'record3' => [ '2000-03-13', undef ] #deleted - > 'record3' => [ '2000-03-13', undef ], #deleted > 'record4' => [ '2001-06-05', { > 'oai_dc' => ['Title','Item 4', 'Creator','Someone Else'] } ] simeon@fff>mkdir harvest3 simeon@fff>./oaiharvest.pl -d harvest3 -i harvest2/Identify http://localhost/oai1 oaiharvest.pl: Harvest from http://localhost/oai1 using POST OAIGet: Doing POST to http://localhost/oai1 args: verb=Identify OAIGet: Got 200 OK (479bytes) oaiharvest.pl: Identify response unchanged from reference (except date) oaiharvest.pl: Reading harvest2/Identify to get from date oaiharvest.pl: Incremental harvest from 2001-06-05 (from harvest2/Identify) OAIGet: Doing POST to http://localhost/oai1 args: from=2001-06-05&verb=ListIdentifiers OAIGet: Got 200 OK (478bytes) oaiharvest.pl: Got 1 identifiers (running total: 1) oaiharvest.pl: No resumptionToken, request complete. oaiharvest.pl: Done.This harvest results in one additional identifier, record4, being returned as expected.
Below are two excerpts from harvests from real repositories which illustrate the flow-control features of the protocol. The first is from arXiv which uses 503 retry-after replies to enforce a delay between requests. The second if from NACA which uses 302 redirect replies to demonstrate a load-sharing scheme.
... OAIGet: Doing POST to http://arXiv.org/oai1 args: verb=ListIdentifiers OAIGet: Got 503, sleeping for 60 seconds... OAIGet: Woken again, retrying... OAIGet: Got 200 OK (27398bytes) oaiharvest.pl: Got 502 identifiers (running total: 502) oaiharvest.pl: Got resumptionToken: `1997-02-10___' OAIGet: Doing POST to http://arXiv.org/oai1 args: resumptionToken=1997-02-10___&verb=ListIdentifiers OAIGet: Got 503, sleeping for 60 seconds... OAIGet: Woken again, retrying... OAIGet: Got 200 OK (28330bytes) oaiharvest.pl: Got 520 identifiers (running total: 1022) oaiharvest.pl: Got resumptionToken: `1997-03-06___' ... ... OAIGet: Doing POST to http://naca.larc.nasa.gov/oai/ args: verb=ListIdentifiers OAIGet: Got 302, redirecting to http://buckets.dsi.internet2.edu/naca/oai/?... OAIGet: Doing POST to http://buckets.dsi.internet2.edu/naca/oai/ args: verb=ListIdentifiers OAIGet: Got 200 OK (336705bytes) oaiharvest.pl: Got 6352 identifiers (running total: 6352) ...
I hope the examples above provide a useful demonstration of some of the features of the OAIMH metadata harvesting. Be sure to exercise caution and restraint when running tests against registered repositories. There is some cost associated with answering OAIMH requests, and recklessly downloading large amounts of data for no good reason is not helpful.
The uptake of OAI is very encouraging and it is feedback from the current implementers which will shape the next version of the OAIMH protocol. Anyone implementing, or interested in implementing, either side of the OAIMH protocol should subscribe to the oai-implementers [15] mailing list. It is a helpful and friendly forum.
In order to run the example programs, you will require Perl 5.004 or later and the following modules (the precise version I used is given in parenthesis). For the the server:
Before running oaiharvest.pl you should first edit the line that defines the variable $contact and insert your e-mail address. This will then be specified as the contact address for all HTTP requests and will enable the server maintainer to contact you if there are problems. The example code has been tested only on a Linux system and with the Apache server. While I hope that it will work on other systems this has not been verified.
address: T-8, Los Alamos National Lab., Los Alamos, NM 87545, USA
e-mail: simeon@lanl.gov
[1] |
Open Archives Initiative (OAI),
URL: <http://www.openarchives.org/> |
[2] |
OAI metadata harvesting protocol v1.0, released on 21 January 2001, revised 24 April 2001
URL: <http://www.openarchives.org/OAI/openarchivesprotocol.htm> |
[3] |
Cite Base at the University of Southampton,
a prototype Open Archives federating service which extracts and re-exports citation information in addition to providing a search facility,
URL: <http://cite-base.ecs.soton.ac.uk/> |
[4] |
Dublin Core Metadata Element Set, Version 1.1: Reference Description (2 July 1999),
URL: <http://purl.org/DC/documents/rec-dces-19990702.htm> |
[5] |
HTTP - Hypertext Transfer Protocol v1.1,
URL: <http://www.ietf.org/rfc/rfc2616.txt> |
[6] |
XML - Extensible Markup Language,
OAI uses XML schemas to specify responses,
URL: <http://www.w3.org/XML/> URL: <http://www.w3.org/TR/xmlschema-0/> |
[7] |
Uniform Resource Identifiers (URI): Generic Syntax,
URL: <http://www.ietf.org/rfc/rfc2396.txt> |
[8] |
NACA - National Advisory Committee for Aeronautics Technical Report Server,
URL: <http://naca.larc.nasa.gov/> |
[9] |
The Apache web server
URL: <http://www.apache.org/> |
[10] | Harvesting strategies have been discussed on the oai-implementers list [15], I have drawn from the comments of Hussein Suleman in particular. |
[11] |
`arc' Cross Archive Searching Service,
an OAI service provider developed at Old Dominion University,
URL: <http://arc.cs.odu.edu/help/archives.htm> |
[12] |
The OAI Repository Explorer,
an interface to interactively test archives for compliance with the OAIMH protocol,
Hussein Suleman (Digital Libraries Research Laboratory, Virginia Tech.),
URL: <http://rocky.dlib.vt.edu/ oai/cgi-bin/Explorer/oai1.0/testoai> |
[13] |
Open Archives Initiative, Tools for Implementers list,
URL: <http://www.openarchives.org/tools/tools.htm> |
[14] |
Perl class library that allow the rapid deployment of an OAI compatible interface to an existing web server/database for OAI server and harvester implementation,
URL: <http://www.ecs.soton.ac.uk/~tdb198/oai/frontend.html> |
[15] |
oai-implementers,
a mailing list (and archive) for discussing the implementation of the OAIMH protocol,
URL: <http://oaisrv.nsdl.cornell.edu/mailman/listinfo/OAI-implementers> |
For citation purposes:
Simeon Warner, "Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial", High Energy Physics Libraries Webzine, issue
4, June 2001
URL: <http://webzine.web.cern.ch/webzine/4/papers/3/>
|