The DPLA is launching an open-source tool for fast, large-scale data harvests from OAI repositories. The tool uses a Spark distributed processing engine to speed up and scale up the harvesting operation, and to perform complex analysis of the harvested data. It is helping us improve our internal workflows and provide better service to our hubs. The Spark OAI Harvester is freely available and we hope that others working with interoperable cultural heritage or science data will find uses for it in their own projects.
What’s valuable and challenging about OAI harvests
OAI is the short-form acronym for Open Archives Initiative Protocol for Metadata Harvesting. OAI is a set of rules that allows a digital repository to share its metadata records over the web. Following the rules of OAI, another application can ask the repository for data, and interpret the response. A “harvest” is the process of collecting data from an OAI repository.
DPLA is not alone its wish for better and faster OAI harvests. In the worlds of cultural heritage and science data, OAI is a common way to share and collect data. It’s used by CONTENTdm, DSpace, LibGuides, WorldCat, Omeka, MarcEdit, bepress, NASA, the Department of Energy, and Wikimedia, to name a few. It’s also used by many of the hubs and contributors that share data with DPLA. We built the Spark OAI harvester in an effort to serve this diverse community.
The Spark OAI Harvester addresses two main challenges that DPLA often encountered: size and speed. OAI harvests can contain a very large amount of data, which means that some methods of processing and examining this data can require a lot of memory. They can also take a long time to complete. OAI harvests are comprised of a series of HTTP requests, each returning a partial list of data. A single harvest can consist of hundreds, even thousands of HTTP requests. A single HTTP request may only take a fraction of a second to complete – but these add up quickly. At DPLA, we would often have to wait hours for harvests to complete.
How Spark helps with speed, scale, and analysis
Spark helps us address the problems of both size and speed. Spark is a distributed processing engine. With Spark, you can divide the work of a single task across multiple processors running in parallel. At DPLA, we use the processors on our laptops for smaller jobs, and processors in the cloud for larger jobs. More processors working on a problem means more memory. It also means that in some cases, the HTTP requests that comprise an OAI harvest can be divided up among the processors and run concurrently (see the technical documentation for more details). This makes ingestions run much faster.
Another benefit of Spark is that we’re able to work with harvested data in a structured, distributed format. This makes it easier to examine, analyze, and manipulate the data after it’s been harvested. With Spark’s built-in libraries, you can run SQL queries over your data, apply machine learning algorithms, or perform graph computations. The Spark OAI Harvester returns data that’s ready to work with these libraries.
At DPLA, the Spark OAI Harvester is helping us to be more iterative in our workflow. The entire ingestion process includes harvesting data from our hubs, mapping it the the DPLA metadata application profile, enriching it, and publishing it to our API. With OAI harvests running in a fraction of the time, we can be more responsive to any problems or unexpected challenges. With the help of Spark’s libraries, we can also analyze and manipulate data, and makes changes earlier in the ingestion process.
We hope that the Spark OAI Harvester will prove useful for others in the cultural heritage and science communities. If you’re interested in learning more about the technical aspects, using the Spark OAI Harvester in your project, or collaborating with us, check out the technical documentation or view the code on github. We welcome your feedback!
DPLA Spark OAI Harvester
- OAI-PMH: Basics and Resources by James M. Day
- OAI for Beginners from the Open Archives Forum
- Open Archives Initiative Protocol for Harvesting Metadata (official documentation)