Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial

*Maintained by: HEPLW Team* · High Energy Physics Libraries Webzine

Identify
GetRecord
ListIdentifiers
ListRecords
ListSets
ListMetadataFormats
key=value
GetRecord
ListRecords
<header>
<metadata>
<about>
metadataPrefix
oai:arXiv:hep-lat/0008015
oai
arXiv
hep-lat/0008015
ListIdentifiers
ListRecords
ListRecords
from
until
ListRecords
from
metadataPrefix
metadataPrefix
metadataPrefix
oai_dc
http://www.openarchives.org/OAI/dc.xsd
ListMetadataFormats
metadataPrefix
schema
metadataNamespace
ListRecords
<record>
oai1.pl
OAIServer.pm
Database.pm
oai1.pl
ScriptAlias /oai1 /some/directory/oai1.pl
/some/directory
oai1.pl
-r
./oai1.pl -r 'verb=Identify'
oai1.pl
read GET, POST or command line request
check syntax of request
if syntax correct 
  return XML reply to request
else
  return HTTP 400 error code and message 

simeon@fff>./oai1.pl -r 'bad-request'
Status: 400 Malformed request
Content-Type: text/plain

No verb specified!

OAIServer.pm
OAICheckRequest
OAISatisfyRequest
Database.pm
record1
record2
record3
record1
record2
record1
metadataPrefix
wibble
record3
<description>
<oai-identifier>
<eprints>
identifier
identifier
identifier
identifier
Database.pm
<metadataPrefix>
<schema>
<metadataNamespace>
simeon@fff>./oai1.pl -r 'verb=ListMetadataFormats&identifier=record1'
Content-Type: text/xml

<?xml version="1.0" encoding="UTF-8"?>

<ListMetadataFormats xmlns="http://www.openarchives.org/OAI/OAI_ListMetadataFormats"
    xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_ListMetadataFormats
                        http://www.openarchives.org/OAI/1.0/OAI_ListMetadataFormats.xsd">
 <responseDate>2001-05-05T12:27:36-06:00</responseDate>
 <requestURL>http://localhost/oai1?verb=ListMetadataFormats&amp;
             identifier=record1&amp;verb=ListMetadataFormats</requestURL>
 <metadataFormat>
  <metadataPrefix>wibble</metadataPrefix>
  <schema>http://wibble.org/wibble.xsd</schema>
 </metadataFormat>
 <metadataFormat>
  <metadataPrefix>oai_dc</metadataPrefix>
  <schema>http://www.openarchives.org/OAI/dc.xsd</schema>
 </metadataFormat>
</ListMetadataFormats>

record1
oai_dc
wibble
disseminate
ListRecords
<header>
<metadata>
<metadata>
oai_dc
record2
simeon@fff>./oai1.pl -r 'verb=GetRecord&identifier=record2&metadataPrefix=oai_dc'
Content-Type: text/xml

<?xml version="1.0" encoding="UTF-8"?>

<GetRecord xmlns="http://www.openarchives.org/OAI/OAI_GetRecord"
  xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_GetRecord
                      http://www.openarchives.org/OAI/1.0/OAI_GetRecord.xsd">
 <responseDate>2001-05-05T12:50:23-06:00</responseDate>
 <requestURL>http://localhost/oai1?verb=GetRecord&amp;identifier=record2&amp;
             metadataPrefix=oai_dc&amp;verb=GetRecord</requestURL>
 <record>
  <header>
   <identifier>record2</identifier>
   <datestamp>1999-02-12</datestamp>
  </header>
  <metadata>
   <oai_dc xsi:schemaLocation="http://purl.org/dc/elements/1.1/
                               http://www.openarchives.org/OAI/dc.xsd"
	   xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
	   xmlns="http://purl.org/dc/elements/1.1/">
    <Title>Item 2</Title>
    <Creator>A N Other</Creator>
   </oai_dc>
  </metadata>
 </record>
</GetRecord>

wibble
simeon@fff>./oai1.pl -r 'verb=GetRecord&identifier=record2&metadataPrefix=wibble'
Content-Type: text/xml

<?xml version="1.0" encoding="UTF-8"?>

<GetRecord xmlns="http://www.openarchives.org/OAI/OAI_GetRecord"
 xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_GetRecord
                     http://www.openarchives.org/OAI/1.0/OAI_GetRecord.xsd">
 <responseDate>2001-05-05T12:52:13-06:00</responseDate>
 <requestURL>http://localhost/oai1?verb=GetRecord&amp;
   identifier=record2&amp;metadataPrefix=wibble&amp;verb=GetRecord</requestURL>
 <record>
  <header>
   <identifier>record2</identifier>
   <datestamp>1999-02-12</datestamp>
  </header>
 </record>
</GetRecord>

<header>
<metadata>
<about>
listEither
getIdsByDate
Database.pm
<identifier>
status="deleted"
simeon@fff>./oai1.pl -r 'verb=ListIdentifiers'
Content-Type: text/xml

<?xml version="1.0" encoding="UTF-8"?>

<ListIdentifiers xmlns="http://www.openarchives.org/OAI/OAI_ListIdentifiers"
  xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers
                      http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers.xsd">
 <responseDate>2001-05-05T12:59:30-06:00</responseDate>
 <requestURL>http://localhost/oai1?verb=ListIdentifiers&amp;verb=ListIdentifiers</requestURL>
 <identifier>record1</identifier>
 <identifier>record2</identifier>
 <identifier status="deleted">record3</identifier>
</ListIdentifiers>

record3
until=2000-01-01
record3
ListRecords
<record>
ListRecords
metadataPrefix
ListIdentifiers
ListSets
ListMetadataFormats
ListRecords
ListIdentifiers
ListRecords
Identify
ListSets
ListMetadataFormats
<responseDate>
<responseDate>
Identify
ListSets
ListMetadataFormats
<responseDate>
Identity
<responseDate>
from
ListRecords
ListIdentifiers
OAIGet.pm
ListIdentifiers
ListSets
ListMetadataFormats
ListRecords
resumptionToken
ListSets
ListMetadataFormats
ListIdentifiers
ListRecords
<resumptionToken>
oaiharvest.pl
OAIGet.pm
OAIParser.pm
oaiharvest.pl
oaiharvest.pl -h
read command line arguments
check options and parameters
issue Identify request
compare response with previous Identify response if given
extract `from' date from command line, previous Identify response or do complete harvest
LOOP:
  issue ListRecords or ListIdentifiers request
  check for resumptionToken, LOOP if present

OAIGet
OAIGet.pm
OAIParser.pm
XML-Parser
expat
http://localhost/oai1
ListIdentifiers
-r
-m metadataPrefix
oaiharvest.pl
ListRecords
metadataPrefix
oai_dc
simeon@fff>mkdir harvest1
simeon@fff>./oaiharvest.pl -d harvest1 http://localhost/oai1

oaiharvest.pl: Harvest from http://localhost/oai1 using POST
OAIGet: Doing POST to http://localhost/oai1 args: verb=Identify
OAIGet: Got 200 OK (479bytes)
oaiharvest.pl: Doing complete harvest.
OAIGet: Doing POST to http://localhost/oai1 args: verb=ListIdentifiers
OAIGet: Got 200 OK (537bytes)
oaiharvest.pl: Got 3 identifiers (running total: 3)
oaiharvest.pl: No resumptionToken, request complete.
oaiharvest.pl: Done.

simeon@fff>ls harvest1
Identify  ListIdentifiers.1

Identify
harvest1/Identify
from
simeon@fff>mkdir harvest2
simeon@fff>./oaiharvest.pl -d harvest2 -i harvest1/Identify  http://localhost/oai1

oaiharvest.pl: Harvest from http://localhost/oai1 using POST
OAIGet: Doing POST to http://localhost/oai1 args: verb=Identify
OAIGet: Got 200 OK (479bytes)
oaiharvest.pl: Identify response unchanged from reference (except date)
oaiharvest.pl: Reading harvest1/Identify to get from date
oaiharvest.pl: Incremental harvest from 2001-06-05 (from harvest1/Identify)
OAIGet: Doing POST to http://localhost/oai1 args: from=2001-06-05&verb=ListIdentifiers
OAIGet: Got 200 OK (444bytes)
oaiharvest.pl: Got 0 identifiers (running total: 0)
oaiharvest.pl: No resumptionToken, request complete.
oaiharvest.pl: Done.

Database.pm
record4
2001-06-05
simeon@fff>diff Database.pm~ Database.pm
24c24,26
<   'record3' => [ '2000-03-13', undef ]  #deleted
 - 
>   'record3' => [ '2000-03-13', undef ],  #deleted
>   'record4' => [ '2001-06-05', {
>     'oai_dc' => ['Title','Item 4', 'Creator','Someone Else'] } ] 

simeon@fff>mkdir harvest3
simeon@fff>./oaiharvest.pl -d harvest3 -i harvest2/Identify http://localhost/oai1

oaiharvest.pl: Harvest from http://localhost/oai1 using POST
OAIGet: Doing POST to http://localhost/oai1 args: verb=Identify
OAIGet: Got 200 OK (479bytes)
oaiharvest.pl: Identify response unchanged from reference (except date)
oaiharvest.pl: Reading harvest2/Identify to get from date
oaiharvest.pl: Incremental harvest from 2001-06-05 (from harvest2/Identify)
OAIGet: Doing POST to http://localhost/oai1 args: from=2001-06-05&verb=ListIdentifiers
OAIGet: Got 200 OK (478bytes)
oaiharvest.pl: Got 1 identifiers (running total: 1)
oaiharvest.pl: No resumptionToken, request complete.
oaiharvest.pl: Done.

record4
arXiv
NACA
...
OAIGet: Doing POST to http://arXiv.org/oai1 args: verb=ListIdentifiers
OAIGet: Got 503, sleeping for 60 seconds...
OAIGet: Woken again, retrying...
OAIGet: Got 200 OK (27398bytes)
oaiharvest.pl: Got 502 identifiers (running total: 502)
oaiharvest.pl: Got resumptionToken: `1997-02-10___'
OAIGet: Doing POST to http://arXiv.org/oai1 args: resumptionToken=1997-02-10___&verb=ListIdentifiers
OAIGet: Got 503, sleeping for 60 seconds...
OAIGet: Woken again, retrying...
OAIGet: Got 200 OK (28330bytes)
oaiharvest.pl: Got 520 identifiers (running total: 1022)
oaiharvest.pl: Got resumptionToken: `1997-03-06___'
...

...
OAIGet: Doing POST to http://naca.larc.nasa.gov/oai/ args: verb=ListIdentifiers
OAIGet: Got 302, redirecting to http://buckets.dsi.internet2.edu/naca/oai/?...
OAIGet: Doing POST to http://buckets.dsi.internet2.edu/naca/oai/ args: verb=ListIdentifiers
OAIGet: Got 200 OK (336705bytes)
oaiharvest.pl: Got 6352 identifiers (running total: 6352)
...

oai-implementers
oai1.pl
OAIServer.pm
oaiharvest.pl
OAIGet.pm
OAIParser.pm
examples.tar.gz
examples.zip
expat
http://www.cpan.org/
perl Makefile.PL; make; make test; make install
expat
XML-Parser
http://sourceforge.net/projects/expat/
oaiharvest.pl
$contact
oai-implementers
oai-implementers

[1]	Open Archives Initiative (OAI), URL: <http://www.openarchives.org/>
[2]	OAI metadata harvesting protocol v1.0, released on 21 January 2001, revised 24 April 2001 URL: <http://www.openarchives.org/OAI/openarchivesprotocol.htm>
[3]	Cite Base at the University of Southampton, a prototype Open Archives federating service which extracts and re-exports citation information in addition to providing a search facility, URL: <http://cite-base.ecs.soton.ac.uk/>
[4]	Dublin Core Metadata Element Set, Version 1.1: Reference Description (2 July 1999), URL: <http://purl.org/DC/documents/rec-dces-19990702.htm>
[5]	HTTP - Hypertext Transfer Protocol v1.1, URL: <http://www.ietf.org/rfc/rfc2616.txt>
[6]	XML - Extensible Markup Language, OAI uses XML schemas to specify responses, URL: <http://www.w3.org/XML/> URL: <http://www.w3.org/TR/xmlschema-0/>
[7]	Uniform Resource Identifiers (URI): Generic Syntax, URL: <http://www.ietf.org/rfc/rfc2396.txt>
[8]	NACA - National Advisory Committee for Aeronautics Technical Report Server, URL: <http://naca.larc.nasa.gov/>
[9]	The Apache web server URL: <http://www.apache.org/>
[10]	Harvesting strategies have been discussed on the `oai-implementers` list [15], I have drawn from the comments of Hussein Suleman in particular.
[11]	`arc' Cross Archive Searching Service, an OAI service provider developed at Old Dominion University, URL: <http://arc.cs.odu.edu/help/archives.htm>
[12]	The OAI Repository Explorer, an interface to interactively test archives for compliance with the OAIMH protocol, Hussein Suleman (Digital Libraries Research Laboratory, Virginia Tech.), URL: <http://rocky.dlib.vt.edu/ oai/cgi-bin/Explorer/oai1.0/testoai>
[13]	Open Archives Initiative, Tools for Implementers list, URL: <http://www.openarchives.org/tools/tools.htm>
[14]	Perl class library that allow the rapid deployment of an OAI compatible interface to an existing web server/database for OAI server and harvester implementation, URL: <http://www.ecs.soton.ac.uk/~tdb198/oai/frontend.html>
[15]	`oai-implementers`, a mailing list (and archive) for discussing the implementation of the OAIMH protocol, URL: <http://oaisrv.nsdl.cornell.edu/mailman/listinfo/OAI-implementers>

Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial

Simeon Warner

Abstract

1. Introduction

2. What OAIMH Is Not

3. OAIMH Concepts

3.1. Pull-Only Interaction via HTTP Using XML

3.2. Verbs

3.3. Items, Records and Identifiers

3.4. Datestamps

3.5. Sets

3.6. Metadata Formats

3.7. Exception Conditions

3.8. Flow Control, Load Balancing and Redirection

4. Exposing Metadata

4.1. Minimal Server Implementation

4.2. Identify

4.3. ListSets

4.4. ListMetadataFormats

4.5. GetRecord

4.6. ListIdentifiers and ListRecords

4.7. Partial Responses

5. Harvesting Metadata

5.1. Detecting Changes That Require Manual Intervention

5.2. Incremental Harvesting

5.3. Flow Control and Redirection

5.4. Parsing Replies

5.5. An Example Harvester

6. Conclusions

Appendix: Example Programs

About the Author

References

Reader Response