OAI‑PMH
Introduction
This tutorial walks you through how to retrieve metadata from the Rijksmuseum dataservices using OAI‑PMH.
The documentation page explains the available endpoints, parameters, and technical specifications. This tutorial focuses on something different:
How do you actually use OAI‑PMH in a real workflow?
After completing this tutorial, you will understand:
- when OAI‑PMH is the right choice
- how to make your first request
- how to harvest records
- how to interpret XML responses
- how pagination works with
resumptionToken - how to retrieve incremental updates
- how to build a simple harvesting workflow
When should you use OAI‑PMH?
OAI‑PMH is designed for systematically retrieving metadata.
Use OAI‑PMH when you want to:
- download complete datasets
- synchronize metadata regularly
- build a local copy of collection data
- collect datasets for research
- retrieve new or updated records periodically
OAI‑PMH is not the best choice when you need to:
- perform interactive searches
- apply filters or search queries
- retrieve individual objects directly
In those cases, the search API may be more appropriate.
How OAI‑PMH works
OAI‑PMH works differently from a search API. You do not send search queries. Instead, you harvest metadata in batches.
A typical workflow looks like this:
- Identify → discover repository information
- ListSets → explore available datasets
- ListMetadataFormats → discover available metadata formats
- ListRecords → harvest records
- resumptionToken → retrieve next batch
- from/until → request updates
Step-by-step tutorial
Step 1 - Verify the Endpoint
Start with a simple request:
https://data.rijksmuseum.nl/oai?verb=Identify
The Identify verb returns general information about the OAI-PMH repository. You will see details such as:
- repository name
- administrator information
- timestamp granularity — the required format for all timestamps, e.g.
YYYY-MM-DDThh:mm:ssZ - earliestDatestamp — the date of the oldest record available in the repository
This is a useful first step to confirm that the endpoint is available and to understand the capabilities of the repository.
Step 2 - Explore Available Sets
OAI-PMH often uses sets. Sets allow you to harvest subsets of data instead of retrieving the entire repository.
Use ListSets to retrieve available sets:
https://data.rijksmuseum.nl/oai?verb=ListSets
The response contains available collections or categories, for example:
<set>
<setSpec>2619</setSpec>
<setName>Drawings by Rembrandt and his School in the Rijksmuseum</setName>
</set>
In this example:
setSpecis the identifier used in future requestssetNameis the human-readable title of the dataset
Sets are useful when you only want to harvest a specific subset of the collection, rather than the entire repository.
Step 3 — Explore available metadata formats
Before harvesting records, you need to know which metadata formats are supported by the repository.
Use ListMetadataFormats to retrieve the available formats:
https://data.rijksmuseum.nl/oai?verb=ListMetadataFormats
The repository supports the following metadata formats:
| metadataPrefix | Format |
|---|---|
| edm | Europeana Data Model |
| oai_dc | Dublin Core |
Use edm if you need rich, structured metadata including rights statements, aggregation information, and links to related resources. Use oai_dc if you need a simpler, more widely compatible format with basic fields only.
The metadataPrefix value from this response will be used in the next step.
Step 4 — Harvest Records
Use ListRecords to retrieve metadata records from the repository.
Start with the following request:
https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm
This request returns records in the selected metadata format. Each record contains two main sections:
header— technical information used for harvestingmetadata— descriptive information about the object
To limit results to a specific set, add the set parameter:
https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm&set=26021
Step 5 - XML structure
OAI-PMH responses are XML documents. You do not need to understand the full schema immediately, but a few key sections appear in every record.
The examples below use EDM. If you selected oai_dc as your metadata format, the metadata section will contain simpler Dublin Core fields such as dc:title, dc:creator, and dc:date.
A simplified EDM record looks like this:
<record>
<header>
<identifier>https://id.rijksmuseum.nl/200107928</identifier>
<datestamp>2024-09-27T07:20:23Z</datestamp>
<setSpec>26021</setSpec>
</header>
<metadata>
<rdf:RDF>
<ore:Aggregation>
<edm:aggregatedCHO>
<edm:ProvidedCHO rdf:about="https://id.rijksmuseum.nl/200107928">
<dc:title xml:lang="en">The Night Watch</dc:title>
<dc:creator rdf:resource="https://id.rijksmuseum.nl/2103429"/>
<dc:description xml:lang="en">Rembrandt's largest, most famous canvas...</dc:description>
</edm:ProvidedCHO>
</edm:aggregatedCHO>
<edm:rights rdf:resource="http://creativecommons.org/publicdomain/mark/1.0/"/>
</ore:Aggregation>
</rdf:RDF>
</metadata>
</record>
Header fields:
identifier— unique URI for this recorddatestamp— date the record was last created, modified, or deletedsetSpec— the set(s) this record belongs to
Metadata fields:
dc:title— title of the object, may include anxml:langattributedc:creator— links to the Rijksmuseum URI for this person, rather than plain textdc:description— description, may include anxml:langattributeedm:rights— rights statement, expressed as a URI
URIs in rdf:resource attributes can be resolved separately.
Step 6 - Pagination
OAI-PMH returns results in batches. At the end of each response, a resumptionToken indicates that more records are available:
<resumptionToken completeListSize="839762">bWV0YWmaXg...</resumptionToken>
The completeListSize attribute shows the total number of records in the set. Use the token to retrieve the next batch:
https://data.rijksmuseum.nl/oai?verb=ListRecords&resumptionToken=bWV0YWmaXg...
When using a resumptionToken, do not include any other parameters — the token already encodes the original request. Continue until no resumptionToken is returned.
Step 7 - Incremental Harvesting
OAI-PMH supports incremental harvesting using the from and until parameters. This means you do not need to re-download the entire dataset every time — only records that have changed since your last harvest.
https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm&from=2026-04-01T00:00:00Z
Timestamps must follow the format YYYY-MM-DDThh:mm:ssZ, as indicated by the granularity field in the Identify response.
A typical synchronisation workflow looks like this:
- Perform an initial full harvest
- Store the datestamp of the last harvested record
- On subsequent runs, use that datestamp as the
fromparameter
Step 8 - Python example
Basic example
Before running this script, make sure the required library is installed:
pip install requests
The example below shows a basic way to retrieve records and extract key fields.
import requests
import xml.etree.ElementTree as ET
# Define the XML namespaces
ns = {
'oai': 'http://www.openarchives.org/OAI/2.0/',
'dc': 'http://purl.org/dc/elements/1.1/',
}
# Fetch and parse the response
url = "https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm"
response = requests.get(url)
root = ET.fromstring(response.content)
# Iterate through records and print key fields
for record in root.findall('.//oai:record', ns):
identifier = record.find('.//oai:identifier', ns)
objectnumber = record.find('.//dc:identifier', ns)
title = record.find('.//dc:title', ns)
creator = record.find('.//dc:creator', ns)
print(f"{objectnumber.text}")
print(f" Title: {title.text if title is not None else 'No title'}")
# For simplicity we only show the first creator (if present)
if creator is not None:
uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
print(f" Creator: {uri}")
print(f" Identifier: {identifier.text}")
print()
This will produce output like:
RP-T-1888-A-1510
Title: Monkey on a Chain, seated
Creator: https://id.rijksmuseum.nl/2102549
Identifier: https://id.rijksmuseum.nl/200117613
RP-P-1958-599
Title: Five Cranes
Creator: https://id.rijksmuseum.nl/21081369
Identifier: https://id.rijksmuseum.nl/200134354
Note: records may contain multiple titles or creators (e.g. different languages or multiple agents/roles). This basic version only returns the first occurrence.
Harvesting all records
To retrieve all records across multiple pages, wrap the previous script in a function and handle the resumptionToken.
This is the same example as above, extended with pagination support via resumptionToken. A set is used to limit the number of records, otherwise the full collection is harvested.
import requests
import xml.etree.ElementTree as ET
# Define the XML namespaces
ns = {
'oai': 'http://www.openarchives.org/OAI/2.0/',
'dc': 'http://purl.org/dc/elements/1.1/',
}
def fetch_records(url):
# Fetch and parse the response
response = requests.get(url)
root = ET.fromstring(response.content)
# Iterate through records and print key fields
for record in root.findall('.//oai:record', ns):
identifier = record.find('.//oai:identifier', ns)
objectnumber = record.find('.//dc:identifier', ns)
title = record.find('.//dc:title', ns)
creator = record.find('.//dc:creator', ns)
print(f"{objectnumber.text}")
print(f" Title: {title.text if title is not None else 'No title'}")
if creator is not None:
uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
print(f" Creator: {uri}")
print(f" Identifier: {identifier.text}")
print()
# Return the resumptionToken if there are more records
token = root.find('.//oai:resumptionToken', ns)
return token.text if token is not None and token.text else None
# Continue fetching until no resumptionToken is returned
url = "https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm&set=260216"
while url:
token = fetch_records(url)
url = f"https://data.rijksmuseum.nl/oai?verb=ListRecords&resumptionToken={token}" if token else None
Resolving creator URIs
The dc:creator field contains a URI rather than a name. You can resolve this URI to retrieve additional information about the creator.
This step is added inside the record loop, directly after extracting the creator URI.
Only the preferred creator name is extracted from the response; the full Linked Art structure is not displayed.
Note: this sends a separate request per record. For large datasets this can become slow quickly. If you need creator information at scale, the search API may offer a more efficient alternative.
for record in root.findall('.//oai:record', ns):
identifier = record.find('.//oai:identifier', ns)
objectnumber = record.find('.//dc:identifier', ns)
title = record.find('.//dc:title', ns)
creator = record.find('.//dc:creator', ns)
print(f"{objectnumber.text}")
print(f" Title: {title.text if title is not None else 'No title'}")
# Resolve creator URI (added step)
creator_name = None
if creator is not None:
uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
if uri:
response = requests.get(uri, headers={"Accept": "application/ld+json"})
data = response.json()
# Extract preferred creator name from Linked Art response
for item in data.get('identified_by', []):
for classification in item.get('classified_as', []):
if classification.get('id') == 'http://vocab.getty.edu/aat/300404672':
creator_name = item.get('content')
break
if creator_name:
break
if creator_name:
print(f" Creator: {creator_name}")
elif creator is not None:
# fallback: print URI if name not resolved
uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
print(f" Creator: {uri}")
print(f" Identifier: {identifier.text}")
print()
After adding the creator resolution step, the output will include the resolved creator name instead of the URI:
RP-T-1888-A-1510
Title: Monkey on a Chain, seated
Creator: Goltzius, Hendrick
Identifier: https://id.rijksmuseum.nl/200117613
RP-P-1958-599
Title: Five Cranes
Creator: Shunman, Kubota
Identifier: https://id.rijksmuseum.nl/200134354
Summary
In this tutorial, you learned how to retrieve metadata from the Rijksmuseum data services using OAI-PMH. You explored how the protocol works, how to interpret XML responses, and how to harvest records using Python.
You also learned how to work with OAI-PMH features such as sets, metadata formats, pagination using resumptionToken, and incremental harvesting using from and until.
Finally, you extended the basic workflow by resolving creator URIs into readable names using Linked Data, demonstrating how OAI-PMH can be combined with external APIs to enrich metadata.