DPLA Prototype Platform Overview
Introduction to this document
This overview is an initial attempt to lay out provisional directions for the development of the DPLA platform. It assumes a milestone of April 27, 2012 (the San Francisco DPLA Plenary), another major milestone in October 2012 (another plenary), and a release date of a public beta in April 2013. (We will refer to these as Phase 1, Phase 2, and Phase 3.) This document at times looks further than that, while understanding that construction will be incremental.
This document is a slightly "friendlier" version of the overview section of the DPLA Platform Overview, which will be posted shortly.
Who is the DPLA platform for?
The DPLA prototype platform is a technical infrastructure enabling developers to build applications that use the metadata and content aggregated by the DPLA. With this platform, 1) the community of developers will be able to create innovative applications; 2) the information and services of the DPLA will be integrated into existing sites; and 3) the services and content the DPLA chooses to provide will be supported. Local libraries will benefit from the tools the community of developers will create by being able to integrate DPLA data and services into their own sites, and by using a downloadable version of the platform software both for their services and as a way to participate in the broader ecology.
The DPLA platform will have no direct user interface. Rather, it enables developers to create applications that end users can use. It is middleware that provides centralized access to content that is distributed across the Internet and a set of services that will facilitate the development of applications that use that content.
The DPLA platform has four key clients:
- Libraries, archives, museums—physical and online. The value of the DPLA platform comes from the organizations and institutions that provide access to our cultural heritage. The platform therefore needs to provide both direct and indirect value to those organizations, including libraries, archives, museums, and curators and aggregators of online collections.
- The community of developers. Developers will want to build applications that take advantage of the metadata made available through the DPLA platform. This means the platform should manage the content's metadata in ways that over time make it more and more useful and provide a set of services that will encourage developers to take up the task by making it easier. While the aim is to enable the development of unanticipated applications, some seem fairly likely, including end-user tools to create and browse collections, library analytics tools, recommendation engines, and OPAC enhancers.
- Other sites and services. The DPLA platform should make it easy for the DPLA's content, metadata, and services to be integrated into the web ecosystem. For example, it is easy to imagine ways Wikipedia could integrate information from the DPLA platform about cultural heritage and library items.
- The DPLA itself. The platform needs to make it possible for the DPLA to develop a home page or portal for end users. The platform will also support a DPLA repository that contains material digitized under the auspices of the DPLA.
Overview of how it will work
The DPLA prototype platform is a metadata server. It will aggregate and manage metadata about content available online and in local physical institutions (libraries, archives, museums, etc.). For example, it might know the photographers, dates, and URLs of all of the images in the San Francisco Public Library image archive, but it would not keep copies of the images themselves (other than thumbnails). It will provide links to online data so that applications can be written that allow interaction with that content served from its distributed hosts. (The platform will also provide links to content that results from DPLA digitization projects. In addition, the platform will provide hosting and management services for that content.)
The prototype platform 's success depends to a large degree on how much metadata it can gather, and how useful it can make that metadata. It will compile what can be thought of as a meta-catalog, that is, a catalog of catalogs. But it is not enough simply to have separate listings for the thousands and tens of thousands of catalogs (and tens of millions of items) the DPLA may gather rather quickly. The distributed materials are of little value if they cannot be searched and queried. This requires mapping the catalog structures to one another. For example, a developer might want to write an application that lets a user search for everything about whales in the DPLA distributed collection. To do this, the platform needs to know that what one collection refers to as "description," another might refer to as "summary.” Mapping these different collection structures ("schema") will be a big part of the DPLA platform's value...and a daunting task. We do not expect to get beyond incremental improvements, but imperfect mappings can still be highly useful.
Once the metadata is in and at least roughly mapped, it will be made available to the clients of the DPLA platform. Typically, this will be done through API's (application programming interfaces) that allow a program to request information from another. The DPLA platform's API will aim to provide responses to complex queries, including ones that perform commonly-requested tasks on the platform's servers.
The value it brings, and the trade-offs
The platform will lower the hurdle to importing many different types of metadata from library, archives, museums, and other curating institutions. Whenever feasible, the metadata will be put into a highly simplified schema, possibly based on Dublin Core with Schema.org extensions. This will enable cross-collection and cross-type querying, although of what might be in many cases just a few shreds of metadata.
The API will provide direct and easy access to the values captured by the simplified core schema. The metadata that does not fit into that core schema will be maintained by the platform, and will be accessible to developers who know the particularities of the original schema. Those schema will themselves be available via the API.
The platform’s data will be unevenly distributed. For example, library items in the MARC21 typically are fairly rich in human-curated metadata, including subject headings. Collection information in the EAD format tends to be rather sparse, and typically contains no information about individual items. The platform will include it all, in its raw state, on the grounds that some developer might find it useful. But this disparity in richness of metadata can create some issues, such as skewing query results in unproductive ways.
At the same time, the prototype platform will do what it can to enhance the utility of the metadata it’s ingested. It will take as a continuing project the use of algorithms to semantically associate its content so that queries through the API return more and more of the value. This might involve more inclusive normalizing techniques, vocabulary matching, and using linked data to build “metadata clouds” around materials that have little native metadata associated with them. These techniques will, we hope, incrementally enrich the metadata so that queries through the API will return more useful results.
How will it get built?
The scope of this project is, of course, gigantic. Here is how we propose it be built:
- Incrementally and continuously. Many items in this scope document are long term
- With multiple teams, with the core dev team as a reliable resource.
- Using existing software whenever possible.
As re-deployable open source software
The DPLA platform will be developed as open source software so that anyone can download a copy for her or his own use.
This local version of the DPLA platform will have all the functionality (plus some additional capabilities required by local institutions), but not the metadata or content. The local institution would instead typically use its own metadata and content, as well as accessing whatever the DPLA platform makes available through its APIs. For example, a state library system might use it to pull together metadata from across its multiple libraries, creating local services and performing cross-library analytics. The infrastructure of the DPLA platform will support content nodes that communicate with the DPLA main node via software provided by the DPLA.
Any institution or node using the DPLA platform locally will find it easy to keep its own metadata updated at the central version of the platform and to take advantage of the metadata and services offered by that central platform.
We have several design aims when thinking about the types of metadata to ingest and how best to handle it:
- We want to lower the hurdle to ingesting metadata so that many organizations will contribute.
- We assume that metadata will cover a wide variety of object types, both as items and as collections that will range from the unstructured to the highly sophisticated.
- We assume that only some of the metadata will conform to well-known standards.
- To the extent possible, and incrementally, we want to add value to the mix of metadata we expect to come in.
- We want to make as much of the incoming metadata as possible available to developers.
- The DPLA platform is on a short deadline.
Thus, the platform intends to provide access to a wide and rich range of metadata. To accomplish this in the short time available, we plan to assimilate incoming records to a modestly extended version of standard schema, especially Dublin Core (DC) and Schema.org. At the same time the platform will preserve the original records for access via the API or as datafile downloads. The platform in Phase 1 will natively support MARC21 records. Other library, collection and museum standards (for example, METS, MODS, MARCxml, CDWA, etc.) will be folded in throughout later phases. (The priority of the standards to support is to be determined; we look forward to guidance from the community.)
Collections with unique schema will be supported via a schema-free database. The API will allow calls using those unique schemas, but at least initially the platform will make no effort to integrate them with one another. This means that developers will have access to the core attributes of items and collections via a simple, supported set of calls, but will be able to access the more complex and perhaps idiosyncratic attributes of particular collections via those collections' particular schemas.
The types of metadata will include:
- Items: metadata about particular items in collections. This includes books, photos, multi-media, etc. It also includes any Web content cataloged by the platform, such as open courseware, item-related Wikipedia pages, TED talks, curated YouTube channels, etc.
- Collections: metadata about collections, including user-generated ones. (Collections and items are part of the same core schema.)
- Events: Usage of the contents of DPLA collections, of Web content, circulation and other usage information from affiliated libraries
- Creators: Pointers to authority files such as VIAF, to identify the creators of works
- Citations: Pointers to works cited, or cited by. We are likely to use existing citation services.
- Contributors: Contributing libraries, archives, museums, nodes, Web services, etc.
- Users: Possibly retain information about the preference, activities, and created content (e.g. reviews, ratings) of users of DPLA platform-based services, for users who opt-in. Possibly track social graphs also.
The DPLA prototype platform will provide the following services for contributors to the DPLA's distributed collection, to the community of developers, and to other sites and services.
The API will logically isolate the data store from the ingestion process, and will allow mapping of unsupported formats to the simple core schema. The ingestion tools will provide help in importing the standard library and museum standards.
We expect the growth of aggregating nodes to make it easier to harvest and maintain sets of metadata, particularly from smaller and more local institutions.
The DPLA can predict some but not all of the types of metadata it will need to take in. We propose to address this challenge as follows:
- File-based upload of known schemas: provide automated mappings from known formats, such as those used in libraries and museums, into the DPLA core schemas.
- File-based upload of unknown schemas: Allow for the upload and processing of unknown schemas provided that a schema mapping file is also provided. We will over time provide tools to help automate the process, including the tools already contributed by the Metadata Interoperability Services (MINT) team.
- Direct access to the API. Allow direct access to the DPLA API so that data can be uploaded in the DPLA core schema directly.
- Web harvester. Many collections are expressed only as Web sites. A harvester will help automate the creation of ingestible collections from these sites. The details of the functions of the harvester remain to be determined.
The metadata the platform aggregates gains significant value when it is processed so that queries return more signal and less noise. This immediately raises, however, some of the hardest challenges known to libraries: identifying works as the same, identifying their many relationships, and making semantic associations that are implicit in the works and their context. The platform will provide some basic and imperfect services. Eventually (out of the scope of this plan), librarians and users can perhaps be enlisted in the effort by creating their own relationships and "debugging" existing ones.
- Entity rationalization: Best-effort de-duping, edition-clustering, and - perhaps out of the scope of this project - some awareness of FRBR relationships
- Best-effort semantic associating of entities
- ID resolution: IDs from multiple standard services (such as OpenLibrary, OCLC, and isbns) will be identified as pointing to the same entities. This will be offered as an API service in Phase 3.
Because the aim of the platform is to facilitate the development of applications that take advantage of the wealth of metadata, data, and content contributed by libraries, archives, museums, and other curators and aggregators, the platform will create a union search index. This will ensure performance, provide sophisticated search and ranking capabilities, and enable the platform to perform computationally intense operations on the aggregated metadata. It does, however, require the creation of update and maintenance processes that do not impose too big a burden on contributors. We welcome discussion of techniques and trade-offs.
In addition, there may also be some time-sensitive data that requires federated search. The platform as an Open Source project that can be used by local library systems will need federated search in any case, at least for information about the availability of items.
Results generated by queries to the API will be in standard formats, including but certainly not limited to JSON and Linked Open Data.
- API Query Builder: interactive tool to show developers how to create API queries
- Program stubs: templates to help developers get started with projects
- Collaborative and communicative tools: wiki, blog, mailing lists, GitHub repositories, more as needed...
- The DPLA platform will provide standard maintenance services, such as keeping itself backed up
- The DPLA platform will manage the content of a special DPLA repository that contains digitizations and other content created directly by the DPLA
In accord with the DPLA's commitment to openness, all software developed by the core dev team will be Open Source. We have already started posting software to GitHub. In addition, we plan on using Open Source software in our infrastructure, unless there is no other alternative.
We have begun a discussion of what license to publish the platform software under, and look forward to continued community advice.
The development process will be open in terms of its communication. It will be open source (available at GitHub) so the development community can get maximum value from the code. We intend to have weekly releases, open to all to try. Weekly releases will add new services, but also whenever possible new collections and new apps that run on top of the platform. We intend to simultaneously develop the technical specifications on this wiki, where it is open to community and public participation and collaboration.
By the April 27, 2012 milestone
There are some dependencies that are out of our control - for example, the willingness of organizations to make their collections' metadata public - but here is where we hope to be by the April 27, 2012 Plenary Meeting in San Francisco.
- Metadata about items in public and research library collections. (The number of records is subject to too many variables to be predictable at this time.)
- Metadata about in items in online collections of images, scans, audio and video files, etc.
- Metadata about items from the open Web, possibly including Wikipedia pages about books
- Event data for some of the library items
- Access to core schema, and to individual schemas
- Manual ingestion of new content
- Uniform title clustering (partial)
- Functioning API for querying the aggregated metadata
- Basic union search
- Documentation for the existing API
- API Query Tool
- Sample apps
- GitHub repository
- Collaborative and communication tools: wiki, blog, mailing lists
- List of the apps developed
The following list is necessarily provisional:
- Library analytics
- Multi-collection search: Search across collections
- Stackview: interactive book list visualizer
- Data widget: Display DPLA info about an item within a listing on a local library's page (or on any other page)