OAI‑PMH

Introduction

This tutorial walks you through how to retrieve metadata from the Rijksmuseum dataservices using OAI‑PMH.

The documentation page explains the available endpoints, parameters, and technical specifications. This tutorial focuses on something different:

How do you actually use OAI‑PMH in a real workflow?

After completing this tutorial, you will understand:

when OAI‑PMH is the right choice
how to make your first request
how to harvest records
how to interpret XML responses
how pagination works with resumptionToken
how to retrieve incremental updates
how to build a simple harvesting workflow

When should you use OAI‑PMH?

OAI‑PMH is designed for systematically retrieving metadata.

Use OAI‑PMH when you want to:

download complete datasets
synchronize metadata regularly
build a local copy of collection data
collect datasets for research
retrieve new or updated records periodically

OAI‑PMH is not the best choice when you need to:

perform interactive searches
apply filters or search queries
retrieve individual objects directly

In those cases, the search API may be more appropriate.

How OAI‑PMH works

OAI‑PMH works differently from a search API. You do not send search queries. Instead, you harvest metadata in batches.

A typical workflow looks like this:

Identify → discover repository information
ListSets → explore available datasets
ListMetadataFormats → discover available metadata formats
ListRecords → harvest records
resumptionToken → retrieve next batch
from/until → request updates

Step-by-step tutorial

Step 1 - Verify the Endpoint

Start with a simple request:

https://data.rijksmuseum.nl/oai?verb=Identify

The Identify verb returns general information about the OAI-PMH repository. You will see details such as:

repository name
administrator information
timestamp granularity — the required format for all timestamps, e.g. YYYY-MM-DDThh:mm:ssZ
earliestDatestamp — the date of the oldest record available in the repository

This is a useful first step to confirm that the endpoint is available and to understand the capabilities of the repository.

Step 2 - Explore Available Sets

OAI-PMH often uses sets. Sets allow you to harvest subsets of data instead of retrieving the entire repository.

Use ListSets to retrieve available sets:

https://data.rijksmuseum.nl/oai?verb=ListSets

The response contains available collections or categories, for example:

<set>
<setSpec>2619</setSpec>
<setName>Drawings by Rembrandt and his School in the Rijksmuseum</setName>
</set>

In this example:

setSpec is the identifier used in future requests
setName is the human-readable title of the dataset

Sets are useful when you only want to harvest a specific subset of the collection, rather than the entire repository.

Step 3 — Explore available metadata formats

Before harvesting records, you need to know which metadata formats are supported by the repository.

Use ListMetadataFormats to retrieve the available formats:

https://data.rijksmuseum.nl/oai?verb=ListMetadataFormats

The repository supports the following metadata formats:

metadataPrefix	Format
edm	Europeana Data Model
oai_dc	Dublin Core

Use edm if you need rich, structured metadata including rights statements, aggregation information, and links to related resources. Use oai_dc if you need a simpler, more widely compatible format with basic fields only.

The metadataPrefix value from this response will be used in the next step.

Step 4 — Harvest Records

Use ListRecords to retrieve metadata records from the repository.

Start with the following request:

https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm

This request returns records in the selected metadata format. Each record contains two main sections:

header — technical information used for harvesting
metadata — descriptive information about the object

To limit results to a specific set, add the set parameter:

https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm&set=26021

Step 5 - XML structure

OAI-PMH responses are XML documents. You do not need to understand the full schema immediately, but a few key sections appear in every record.

The examples below use EDM. If you selected oai_dc as your metadata format, the metadata section will contain simpler Dublin Core fields such as dc:title, dc:creator, and dc:date.

A simplified EDM record looks like this:

<record>
  <header>
    <identifier>https://id.rijksmuseum.nl/200107928</identifier>
    <datestamp>2024-09-27T07:20:23Z</datestamp>
    <setSpec>26021</setSpec>
  </header>
  <metadata>
    <rdf:RDF>
      <ore:Aggregation>
        <edm:aggregatedCHO>
          <edm:ProvidedCHO rdf:about="https://id.rijksmuseum.nl/200107928">
            <dc:title xml:lang="en">The Night Watch</dc:title>
            <dc:creator rdf:resource="https://id.rijksmuseum.nl/2103429"/>
            <dc:description xml:lang="en">Rembrandt's largest, most famous canvas...</dc:description>
          </edm:ProvidedCHO>
        </edm:aggregatedCHO>
        <edm:rights rdf:resource="http://creativecommons.org/publicdomain/mark/1.0/"/>
      </ore:Aggregation>
    </rdf:RDF>
  </metadata>
</record>

Header fields

identifier — unique URI for this record
datestamp — date the record was last created, modified, or deleted
setSpec — the set(s) this record belongs to

Metadata fields

dc:title — title of the object, may include an xml:lang attribute
dc:creator — links to the Rijksmuseum URI for this person, rather than plain text
dc:description — description, may include an xml:lang attribute
edm:rights — rights statement, expressed as a URI

URIs in rdf:resource attributes can be resolved separately.

Step 6 - Pagination

OAI-PMH returns results in batches. At the end of each response, a resumptionToken indicates that more records are available:

<resumptionToken completeListSize="839762">bWV0YWmaXg...</resumptionToken>

The completeListSize attribute shows the total number of records in the set. Use the token to retrieve the next batch:

https://data.rijksmuseum.nl/oai?verb=ListRecords&resumptionToken=bWV0YWmaXg...

When using a resumptionToken, do not include any other parameters — the token already encodes the original request. Continue until no resumptionToken is returned.

Step 7 - Incremental Harvesting

OAI-PMH supports incremental harvesting using the from and until parameters. This means you do not need to re-download the entire dataset every time — only records that have changed since your last harvest.

https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm&from=2026-04-01T00:00:00Z

Timestamps must follow the format YYYY-MM-DDThh:mm:ssZ, as indicated by the granularity field in the Identify response.

A typical synchronisation workflow looks like this:

Perform an initial full harvest
Store the datestamp of the last harvested record
On subsequent runs, use that datestamp as the from parameter

Step 8 - Python examples

Basic example

Before running this script, make sure the required library is installed:

pip install requests

The example below shows a basic way to retrieve records and extract key fields.

import requests
import xml.etree.ElementTree as ET

# Define the XML namespaces
ns = {
    'oai': 'http://www.openarchives.org/OAI/2.0/',
    'dc': 'http://purl.org/dc/elements/1.1/',
}

# Fetch and parse the response
url = "https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm"
response = requests.get(url)
root = ET.fromstring(response.content)

# Iterate through records and print key fields
for record in root.findall('.//oai:record', ns):
    identifier = record.find('.//oai:identifier', ns)
    objectnumber = record.find('.//dc:identifier', ns)
    title = record.find('.//dc:title', ns)
    creator = record.find('.//dc:creator', ns)

    print(f"{objectnumber.text}")
    print(f"  Title:      {title.text if title is not None else 'No title'}")
    # For simplicity we only show the first creator (if present)
    if creator is not None:
        uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
        print(f"  Creator:    {uri}")
    print(f"  Identifier: {identifier.text}")
    print()

This will produce output like:

RP-T-1888-A-1510
  Title:      Monkey on a Chain, seated
  Creator:    https://id.rijksmuseum.nl/2102549
  Identifier: https://id.rijksmuseum.nl/200117613

RP-P-1958-599
  Title:      Five Cranes
  Creator:    https://id.rijksmuseum.nl/21081369
  Identifier: https://id.rijksmuseum.nl/200134354

Note: records may contain multiple titles or creators (e.g. different languages or multiple agents/roles). This basic version only returns the first occurrence.

Harvesting all records

To retrieve all records across multiple pages, wrap the previous script in a function and handle the resumptionToken.

This is the same example as above, extended with pagination support via resumptionToken. A set is used to limit the number of records, otherwise the full collection is harvested.

import requests
import xml.etree.ElementTree as ET

# Define the XML namespaces
ns = {
    'oai': 'http://www.openarchives.org/OAI/2.0/',
    'dc': 'http://purl.org/dc/elements/1.1/',
}

def fetch_records(url):
    # Fetch and parse the response
    response = requests.get(url)
    root = ET.fromstring(response.content)

    # Iterate through records and print key fields
    for record in root.findall('.//oai:record', ns):
        identifier = record.find('.//oai:identifier', ns)
        objectnumber = record.find('.//dc:identifier', ns)
        title = record.find('.//dc:title', ns)
        creator = record.find('.//dc:creator', ns)

        print(f"{objectnumber.text}")
        print(f"  Title:      {title.text if title is not None else 'No title'}")
        if creator is not None:
            uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
            print(f"  Creator:    {uri}")
        print(f"  Identifier: {identifier.text}")
        print()

    # Return the resumptionToken if there are more records
    token = root.find('.//oai:resumptionToken', ns)
    return token.text if token is not None and token.text else None

# Continue fetching until no resumptionToken is returned
url = "https://data.rijksmuseum.nl/oai?verb=ListRecords&metadataPrefix=edm&set=260216"
while url:
    token = fetch_records(url)
    url = f"https://data.rijksmuseum.nl/oai?verb=ListRecords&resumptionToken={token}" if token else None

Resolving creator URIs

The dc:creator field contains a URI rather than a name. You can resolve this URI to retrieve additional information about the creator.

This step is added inside the record loop, directly after extracting the creator URI.

Only the preferred creator name is extracted from the response; the full Linked Art structure is not displayed.

Note: this sends a separate request per record. For large datasets this can become slow quickly. If you need creator information at scale, the search API may offer a more efficient alternative.

for record in root.findall('.//oai:record', ns):
    identifier = record.find('.//oai:identifier', ns)
    objectnumber = record.find('.//dc:identifier', ns)
    title = record.find('.//dc:title', ns)
    creator = record.find('.//dc:creator', ns)

    print(f"{objectnumber.text}")
    print(f"  Title:      {title.text if title is not None else 'No title'}")

    # Resolve creator URI (added step)
    creator_name = None

    if creator is not None:
        uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')

        if uri:
            response = requests.get(uri, headers={"Accept": "application/ld+json"})
            data = response.json()

            # Extract preferred creator name from Linked Art response
            for item in data.get('identified_by', []):
                for classification in item.get('classified_as', []):
                    if classification.get('id') == 'http://vocab.getty.edu/aat/300404672':
                        creator_name = item.get('content')
                        break
                if creator_name:
                    break

    if creator_name:
        print(f"  Creator:    {creator_name}")
    elif creator is not None:
        # fallback: print URI if name not resolved
        uri = creator.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
        print(f"  Creator:    {uri}")

    print(f"  Identifier: {identifier.text}")
    print()

After adding the creator resolution step, the output will include the resolved creator name instead of the URI:

RP-T-1888-A-1510
  Title:      Monkey on a Chain, seated
  Creator:    Goltzius, Hendrick
  Identifier: https://id.rijksmuseum.nl/200117613

RP-P-1958-599
  Title:      Five Cranes
  Creator:    Shunman, Kubota
  Identifier: https://id.rijksmuseum.nl/200134354

Summary

In this tutorial, you learned how to retrieve metadata from the Rijksmuseum data services using OAI-PMH. You explored how the protocol works, how to interpret XML responses, and how to harvest records using Python.

You also learned how to work with OAI-PMH features such as sets, metadata formats, pagination using resumptionToken, and incremental harvesting using from and until.

Finally, you extended the basic workflow by resolving creator URIs into readable names using Linked Data, demonstrating how OAI-PMH can be combined with external APIs to enrich metadata.

Introduction​

When should you use OAI‑PMH?​

How OAI‑PMH works​

Step-by-step tutorial​

Step 1 - Verify the Endpoint​

Step 2 - Explore Available Sets​

Step 3 — Explore available metadata formats​

Step 4 — Harvest Records​

Step 5 - XML structure​

Step 6 - Pagination​

Step 7 - Incremental Harvesting​

Step 8 - Python examples​

Basic example​

Harvesting all records​

Resolving creator URIs​

Summary​