The content on this wiki is being preserved for historical purposes, but is not being maintained and is probably no longer accurate.
For current information about DPLA Development, see the Development Portal
For latest API documentation, see the API documentation
Technical Overview
DPLA Platform Overview
This overview attempts to lay out provisional directions for the development of the Digital Public Library of America (DPLA) platform. It assumes a milestone of April 27, 2012 (the date of DPLA West, taking place in San Francisco, CA), another major milestone in October 2012, and a release date of a public beta in April 2013. (We will refer to these as Phase 1, Phase 2, and Phase 3.) This document at times looks further than that, while understanding that construction will be incremental. This document linked to a set of tasks managed by Redmine.
- Technical Overview: frozen version (pdf), April 2, 2012
- Technical Overview for comments and edits: the Google Docs version
What is the DPLA Platform?
The DPLA prototype platform is a technical infrastructure enabling developers to build applications that use the metadata and content aggregated by the DPLA. With this platform, (i) the community of developers will be able to create innovative applications, (ii) the information and services of the DPLA will be integrated into existing sites, and (iii) the services and content the DPLA chooses to provide will be supported. Local libraries will benefit from the tools the community of developers will create by being able to integrate DPLA data and services into their own sites, and by using a downloadable version of the platform software both for its services and as a way to participate in the broader ecosystem.
The DPLA platform will have no direct user interface. Rather, it will be a repository of data, tools, and functions that support the goals of the DPLA. It will aggregate and manage metadata about content available online and in local physical institutions (libraries, archives, museums, etc.). For example, it might know the photographers, dates, and URLs of all of the images in the San Francisco Public Library (SFPL) online image archive, but it will not keep copies of the images themselves (other than thumbnails). It could also know about the SFPL’s catalog of physical items. Additionally, the platform will host some content, including scans created by DPLA digitization projects, and possibly from other organizations as well. It will provide a rich set of functionality for its hosted metadata.
Initially, the DPLA platform will be built to host metadata from existing digitization projects that have already aggregated content, and from nodes that aggregate aggregations, with a view to supporting the development of applications that are useful to the public, to libraries, and to online sites and services.
The platform's success depends to a large degree on how much metadata it can gather and how useful it can make that metadata. It will compile what can be thought of as a meta-catalog, that is, a catalog of catalogs. But it is not enough simply to have separate listings for what might eventually be tens of thousands of catalogs and tens of millions of items—the distributed materials are of little value if they cannot be searched and queried. This requires mapping the catalog structures to one another. For example, a developer might want to write an application that lets a user search for everything about whales in the DPLA-distributed collection. To do this, the platform needs to know that what one collection refers to as "description," another might refer to as "summary.” Mapping these different collection structures (schema) will be a big part of the DPLA platform's value and a very significant—albeit, gradual — task, undertaken with the knowledge that such mappings are always imperfect.
Once the metadata has been aggregated and is at least roughly mapped, it will be made available to the clients of the DPLA platform. Typically, this will be done through APIs (application programming interfaces) that allow one program to request information from another. The DPLA platform's API will aim to provide responses to complex queries, including ones that perform commonly requested tasks on the platform's servers.
Who is the DPLA platform for?
The DPLA platform has four key clients:
- Libraries, archives, museums—physical and online. The value of the DPLA platform comes from the organizations and institutions that provide access to our cultural heritage. The platform therefore needs to provide both direct and indirect value to those organizations, including libraries, archives, museums, and curators and aggregators of online collections.
- The community of developers. Developers will want to build applications that take advantage of the metadata made available through the DPLA platform. This means the platform should manage the content's metadata in ways that over time make it more and more useful and provide a set of services that will encourage developers to take up the task by making it easier. While the aim is to enable the development of unanticipated applications, some seem fairly likely, such as end-user tools to create and browse collections, library analytics tools, recommendation engines, and OPAC enhancers.
- Other sites and services. The DPLA platform should make it easy for the DPLA's content, metadata, and services to be integrated into the web ecosystem. For example, it is easy to imagine ways Wikipedia could integrate information from the DPLA platform about cultural heritage and library items.
- The DPLA itself. The platform needs to make it possible for the DPLA to develop a home page or portal for end users. The platform will also support a DPLA repository that contains material digitized under the auspices of the DPLA.
How will it get built?
The scope of this project is, of course, gigantic. Here is how we propose it be built:
- Incrementally and continuously. Many items in this document are long-term projects.
- With multiple teams, with the core dev team as a reliable resource.
- Using existing software and design patterns whenever possible.
As re-deployable open source software
The DPLA platform will be developed as open source software so that anyone can download a copy for her or his own use.
This local version of the DPLA platform will have all the functionality (plus some additional capabilities required by local institutions), but not the metadata or content. The local institution would instead typically use its own metadata and content, as well as accessing whatever the DPLA platform makes available through its APIs. For example, a state library system might use it to pull together metadata spanning across its multiple libraries, creating local services and performing cross-library analytics. The infrastructure of the DPLA platform will support content nodes that communicate with the DPLA main node via software provided by the DPLA.
Any institution or node using the DPLA platform locally will find it easy to keep its own metadata updated at the central version of the platform and to take advantage of the metadata and services offered by that central platform.
Components of the DPLA Platform
There are three main components of the DPLA Platform: (1) ingestion, (2) an API, and (3) metadata management. At the simplest level, (1) ingestion brings metadata and content into the DPLA data repositories, (2) the API executes functions on and provides access to that metadata and content, and (3) metadata management makes the metadata more usable and useful.
In this section we make proposals about how these components should be thought about, how they interact, and how we think they could be built.
The DPLA API
Two of the guiding principles of the DPLA platform are: (1) It is a platform on which people can build interesting things, and (2) it is an open platform. To support these principles, we propose exposing all DPLA functionality through a set of public APIs that have common design characteristics:
- Intuitive. We want many developers to use it, so the API needs to be understandable.
- Scalable. The platform should enable projects that involve lots of data or serve many people.
- Secure. We hope the platform becomes a highly visible part of the Internet ecosystem, which means it has to be designed with security in mind.
- Versatile. People will want to use the API in many way.
- Versioned. We will not get it right the first time, and needs will change over time.
Why an API?
The principal reason for using an API, rather than simply exchanging data with developers, is that an API encapsulates both functionality and data.
- Functionality. The initial versions of the DPLA platform will encapsulate simple functionality, but as the platform evolves, and as client developers create new and interesting applications, it is likely to evolve to support much more sophisticated functions. Indeed, some of those functions are likely to be developed initially by external developers working with the API.
- Data. An API can encapsulate access to the data. An API provides a “contract” with developers that their programs’ access to the data will work as stipulated (although it may be extended) even as the data requirements, tools, and technologies change.
Public APIs
Publicly accessible APIs have the same design issues that private APIs have, but they also have to accommodate the fact that people unknown to the creator of the API, and over whom the author has no control, are trying to integrate with it.
The key areas to consider are:
- Versioning. All APIs should be versioned (or otherwise provide backwards compatibility). Public APIs, in particular, bear this burden because there is often only one instance (meaning that people cannot choose not to “upgrade” when a new version is released), and the development timelines of the client applications integrating with the API are not under the control of the API author.
- Security. The hope is that a large number of applications will be built against the API, and there will be little or no control over how these applications will behave. While it is obvious that write access to the API needs to be authorized, it is tempting to say that read-only access to the API does not. But permitting unlimited read access is not a good idea. Unauthorized access could allow unintentionally rogue applications to overload the DPLA servers, could encourage attempts to bring down those servers by malicious third parties, could have an effect on caching and throttling strategies, and might limit the information that the DPLA can maintain about its client applications (which would harm the DPLA’s ability to collaborate with the client application community).
- Ease of use. All APIs should be intuitive. A public API serves a broader community with less access to its authors, and thus needs to meet a higher standard. The design of the API should follow best practices used in other publicly facing APIs, should be consistent across its functions, and should be well encapsulated and documented.
Scalability
If the DPLA platform is to be capable of its ambitions to handle a large number of users, client applications, and downstream nodes, it needs to be scalable. The architecture will support clustering and sharding at the data storage layer, using powerful open source engines (e.g. MongoDB, ElasticSearch, etc.). This will allow horizontal scaling and the fulfillment of complicated or large queries via a larger number of computational nodes.
The application server layer will support caching internally for particularly expensive computational tasks, caching “fragments" of data to ensure requests can be fulfilled quickly without hitting the data storage layer. The application will send cache control headers to the delivery layer, allowing responses to be cached depending on the client, the request, or any number of other criteria.
The delivery layer—the front-end HTTP server—will support all content acceleration best practices: expires headers, content compression, front-end proxy caching, and load balancing across multiple application server nodes. The front-end server will be able to fulfill many responses without ever needing to involve the application or data storage layer. It will enforce API limits and ensure a protection against "greedy" clients and malicious third parties.
Security
The platform will use standard authentication methods and techniques. We are proposing using oAuth as our standard security protocol because of its strength, openness, and broad acceptance.
The DPLA Ingestion Process
To offer useful functions to client applications and the public the DPLA platform needs data. The ingestion process is the means by which data gets into the platform.
Supported Schemas
The DPLA can predict some but not all of the types of metadata it will need to take in. We propose to address this challenge as follows:
- File-based upload of known schemas: provide automated mappings from known formats, such as those used in libraries and museums, into the DPLA core schemas.
- File-based upload of unknown schemas: Allow for the upload and processing of unknown schemas provided that a schema mapping file is also provided. We will over time provide tools to help automate the process, including the tools already contributed by the Metadata Interoperability Services (MINT) team.
- Direct access to the API: Allow direct access to the DPLA API so that data can be uploaded in the DPLA core schema directly.
As part of the import process the entire existing data set will be preserved. This is discussed in more detail in the schema section later in this document.
Supported Data Ingestion Formats
The DPLA platform ingestion tools will support data serialized in standard formats, implemented over time.
File Ingestion Engine
The file ingestion engine will be the de facto ingestion mechanism for collections, at least initially, primarily because it provides the lowest barrier to entry for organizations wishing to contribute their metadata. Also, this engine will provide the long-term validation process for API bulk uploads.
Once a file has been uploaded to the file ingestion engine, it will be validated, and then uploaded to the DPLA data stores via the DPLA API.
Upload to the file ingestion engine will be supported via (S)FTP, and via HTTP/S through the bulk upload function of the API. Support for message queuing may also be included at some point.
Like any other write operation to the DPLA data stores, access to the file ingestion engine will have to be secured and authorized. Ideally a single authentication approach would be sufficient for the DPLA; however, since FTP access will be required, and oAuth (the presumed target authentication system for the API) is not supported by FTP or SFTP, a secondary authentication approach will be required.
Direct API Ingestion
The API will allow contributors to map their formats to the internally supported format and to upload their data in bulk or incrementally. In addition, the API will logically isolate the data stores from the file ingestion process.
Contributors can address the API directly through their own custom development effort, or over time through plugins to existing data systems, such as common ILSs. The creation of plugins will allow for better integration of the publishing workflow of those existing data systems with the publishing of content to the DPLA, and would likely be a better mechanism for incremental and timely updates of data.
Over time, the platform will provide a simple web form by which contributing organizations and aggregation nodes can provide metadata about their collections prior to ingestion. In addition, it will provide web-based tracking of the ingestion process.
DPLA Web Harvester
The need for a web harvester to gather records from partner institutions has been identified, but the details of the functions of the harvester have not been discussed in detail in this initial effort to describe the DPLA platform. We expect that such a harvester will:
- Spider partner institution websites and data stores
- Leverage Z39.50 and OAI-PMH
- Integrate directly with the API
Web harvesting is a “pull” ingestion process and, as such, is likely to be a less efficient process than the “push” processes of file upload or API access; this is true in general of pull processes because the harvester cannot determine whether it makes sense to pull content from the contributor, whereas the contributor can determine whether, and what, content has changed and should be pushed. In addition some of the harvesting approaches, e.g., crawling and parsing websites, may not result in high-quality data.
The Ingestion Process
Metadata and content from existing collections will almost certainly be ingested, at least initially, using file-based upload. Over time as applications are developed that make use of the API and as integration from existing collections becomes deeper, it is likely that in the vast majority of cases the API will replace FTP based-file uploads, and incremental updates will replace bulk updates.
Schemas and Metadata
If the platform is going to provide useful programmatic access to a wide and rich range of metadata, it needs schema to make that metadata easily findable and to discover relationships among that information. Our schema strategy is informed by the following considerations:
- We want to lower the hurdle to ingesting metadata so that many organizations will contribute.
- We assume that metadata will cover a wide variety of object types, both as items and as collections that will range from the unstructured to the highly sophisticated.
- We assume that only some of the metadata will conform to well-known standards.
- To the extent possible, and incrementally, we want to add value to the mix of metadata we expect to come in.
- We want to make as much of the incoming metadata as possible available to developers.
- The DPLA platform is on a short deadline.
We are proposing a schema strategy that we hope will keep the platform robust and simple enough to be built by April 2013 while supporting the above considerations:
- Adopt and extend a simple core schema that supports the minimal functional needs of the DPLA API.
- Preserve all original records so that, when there is a need to extend that schema to support new functions, the schema can be extended without needing to go back to the source collection(s).
- Provide direct access to the data elements that do not map to the core schema, as well as enabling access to collections’ unique schemas for developers who want to work with that data.
The DPLA recognizes that there will be needs and uses for the metadata collected that are a superset of the functions that are provided by the API; the creation of a schema to support those needs is not an initial focus of the development efforts. Rather, the DPLA will provide access to the raw original records.
Importing Metadata
The platform will phase in support for standard formats. Initial support will process MARC21 records into the simple core schema (while preserving the fields that that schema doesn’t capture), as well as the manual import of individual collections in a variety of formats dictated by the collections that are contributed. Initial formats may include instances of standards such as EAD, RDF, etc . Other library, collection, and museum standards (for example, METS, MODS, MARCxml, CDWA, etc.) will be folded in throughout later phases. Collections with unique schemas will be supported. Contributors will be expected to upload two additional files: 1) a mapping between their schema and the DPLA core schemas, and 2) documentation of their schema. (If we extend the core schema(s), a new mapping file will need to be provided by the schema contributor.)
The known and unknown schemas will be treated in much the same way. API calls will be possible across the data that have been mapped to the DPLA core schema, and developers who have consulted the documentation of unique schema will have full access to all the contributed data.
The quality of platform’s data will be unevenly distributed. For example, library items in MARC21 typically are fairly rich in human-curated metadata, including subject headings, while collection information in the EAD format tends to be rather sparse, and typically contains no information about individual items. The platform will include it all, in its raw state, on the grounds that some developer might find it useful. But this disparity in richness of metadata creates issues, including skewing query results in unproductive ways, which we discuss in the next section.
Enriching Metadata
The platform will do what it can to enhance the utility of the metadata it has ingested. It will take as a continuing project the use of algorithms to semantically associate the platform’s content so that queries through the API return more and more of that content’s value. This might involve more inclusive normalizing techniques, vocabulary matching, title clustering, and using linked data to build “metadata clouds” around materials that have little native metadata associated with them. These techniques will, we hope, incrementally enrich the metadata so that queries through the API will return more useful results. This is obviously already a rich area of research and development for the entire community. This enrichment of contributed metadata presents its own problems: how do we handle updates of enriched metadata, and how do we push back this enrichment to the contributors?
Record Identification
At a bare minimum all data imported into the DPLA will need to have a unique record identifier as part of the dataset provided by the contributor. Lack of a record identifier would mean that updating metadata and content would be unnecessarily difficult.
Types of Data
The sources of metadata will include (pending decisions by the DPLA workstreams and Steering Committee):
- Information about collections of online books, music, videos, documents, photos, manuscripts, etc. from libraries, museums, archives, aggregation nodes, and other curators
- Select web content collections (e.g., TED Talks, OpenCourseWare)
- Metadata (including bibliographic) about holdings and usage data from participating libraries, archives, and museums
- Digitized material sponsored by the DPLA
The types of metadata will include:
- Items: Metadata about particular items in collections. This includes books, photos, multimedia, etc. It also includes any web content cataloged by the platform, such as OpenCourseWare, item-related *Wikipedia pages, TED talks, curated YouTube channels, etc.
- Collections: Metadata about collections, including user-generated ones.
- Events: Usage of the contents of DPLA collections and of web content, circulation and other usage information from affiliated libraries
- Creators: Pointers to authority files (such as VIAF) to identify the creators of works
- Citations: Which works have been cited, and which are doing the citing. (We are likely to use existing citation services.)
- Contributors: Contributing libraries, archives, museums, cultural societies, persons, web services, etc.
- Users: Information about the preference, activities, and created content (e.g., reviews, ratings) of users of DPLA platform-based services, for users who opt in. This may also include data about the interactions among users (the “social graph”).
Main Platform
The main platform for the DPLA will likely be a complex installation served from multiple redundant servers, possibly geographically dispersed, and possibly serving different functions. This longer term design for the DPLA main platform is not examined in this document since these features are not yet needed.
Local Platform
The DPLA platform will operate as an open source project. That is, it can be downloaded by anyone, and installed locally. This openness can be used to support many activities including:
- Creation of a local development version of the DPLA against which developers can build applications
- Collaborative development on the source code that can be contributed back to the DPLA
In addition, part of the vision of the DPLA is to provide institutions with the ability to install local nodes of the DPLA platform so that they can take advantage of the functionalities of the platform and, eventually, so that these nodes can participate in the distributed architecture vision of the DPLA.
This local platform will be a version of the software that constitutes the DPLA platform itself, with some additions and deletions of functionality as appropriate for local usage. For example, we do not currently propose that the central DPLA platform have real-time access to the availability of items in holding institutions, but that would likely be very useful in many uses of the local platform; this might require adding federated search capabilities to the local platform.
The local platform software will be architected so that it can adopt a nodal structure, connecting institutions within a geographic or interest-based area.
Tools and Applications
The DPLA platform will only have value insofar as developers make use of its data and services. To encourage and enable this, the platform will provide applications and tools. We currently envisage three types of applications and tools:
- Publicly facing applications that allow any developer to access the functions and data of the DPLA. (Over time, we developers will develop tools that make it easier for end users to use the platform productively; some of those tools might become officially supported by the platform.)
- Integration tools and plugins that make integrating with existing tools and collections easier. For example, plugins to ILSs would vastly lower the barrier to under-resourced local libraries (which is to say: local libraries) participating in the DPLA.
- Research applications and databases: there will be functions that the DPLA API does not support, such as analysis over data that is not included in the DPLA core schema. Applications, and databases with versions of the data, are likely to be created to support these needs
Appendix 1: A set of scenarios explore the different ways in which the DPLA platform might be used.
Scope
This scope section provides some details about what will be built, and the expected sequencing of that functionality. An initial scope, phase 1, is envisaged for release in beta format prior to the DPLA West, to be held in San Francisco, CA on April 27, 2012.
All other scope is currently unscheduled pending discussion and approval of the technical approach outlined in this document, prioritization, and the creation of a project and resource plan.
API Scope
The phase 1 scope for the API is a beta scope.
Authorization and Registration
Neither authorization nor registration is part of the phase 1 beta scope, but both will need to be implemented very soon thereafter.
Registration is the process whereby an application developer or content partner (an API consumer) requests access to the API and is granted a token, or tokens, with which to access the API.
Once registered, users will need to generate a variety of types of tokens depending on the application usage.
- For publicly available websites with read-only access to the DPLA it would make sense for the application developer to be able to create a long-lived read-only token.
- For applications or content partners who need write access to the API, a shorter lived read-write token may be more appropriate. The applications and content partners may want to authenticate their users using their own credential store.
The oAuth 2.0 specification provides for precisely this kind of distributed and granular access to applications, and it is our recommendation that it should be used for the DPLA.
Throttling
It seems likely that a throttling mechanism will need to be put in place to limit the number of requests a given application or user can make to the API. This is a requirement to handle unintentionally chatty applications, as well as malicious apps.
This is not part of the phase 1 beta scope.
Content Negotiation and Content Types
Content negotiation is a mechanism by which a given API request can specify the content type it wants to return from a request. Related to this is support for sending specific content types to the API for processing. Support for GET will be provided in the phase 1 beta. Support for PUT/POST/DELETE will be provided in a later phase. The API will support, in its phase 1 beta, the following content types:
- JSON
- XML
- RDF XML (linked open data)
- Turtle (linked open data)
For GET’s the accept-header of the request will be populated with the appropriate mime type. An alternate method for specifying the content type to return would be to specify an extension on the URL. This will be considered for scope in a later phase.
For PUT/POST/DELETE (supported in a later phase) the content type of the request will be specified.
In the event that a content type is not specified on a GET, JSON will be returned. A content type must always be specified on a PUT/POST/DELETE.
Functions and Resources
Resources
The phase 1 beta will include all functions detailed below except as noted, but only for the item/collections and contributors resources. Later phases will include searching and retrieving events, users, and jobs, and will also include PUT/POST/DELETE for both individual and bulk resources.
URL conventions
In general the URLs for the API will follow this convention:
- http://api.dp.la/{version}/{resource}/{id}, for CRUD operations, where {resource} is item, collection, etc. and {id} is the specific item, collection, etc. for single resources; and http://api.dp.la/{version}/{resource}/{action}, where the resource is a list of resources.
Search
Search will be a union search across all data sets, e.g., all items regardless of whether they come from multiple collections, limited by the attributes of the DPLA common schema. All lists of resources will be searchable by attribute with the convention:
- http://api.dp.la/{version}/{resource}/search?filter={key}:{value}, e.g., http://api.dp.la/dev/item/search?filter=language:English;subject:Canada;subject:Whales, where {key} can be any attribute of the resource being queried
Faceted Search
Facets will be generated for all attributes of the DPLA common schema. A list of facets for a given result set can be returned with the result set with the convention:
- http://api.dp.la/{version}/{resource}/search?filter={key}:{value}&facet={facet_key}, e.g., http://api.dp.la/dev/item/search?filter=language:English;subject:Canada&facet=subject, where {facet_key} can be any attribute of the resource being queried
Full text search
Full text search will not be a part of the phase 1 beta.
Federated Search
Federated search will not be a part of the phase 1 beta.
Identity Resolution
Identity resolution and translation provides a service by which unique identifiers within multiple namespaces are returned by querying on an identifier within another name space. For example, querying on an Open Library identifier returns the item’s OCLC number and ISBN.
The DPLA platform will not offer this as part of the phase 1 beta, but in later phases will explore both the need for such a service and whether existing services meet that need.
Versioning
There will always be a dev version (http://api.dp.la/dev) of the API available. Other versions will be published on the wiki as they are released along with release notes.
Ingestion Scope
File Ingestion Engine
A very basic and highly manual ingestion engine comprising a set of scripts is part of the phase 1 beta. The goal of this version of the engine is to support the manual ingestion of the data sources detailed in the data acquisition section. This version of the ingestion engine does not support either incremental or bulk updates.
File Receipt
The delivery of metadata and content files is not part of the phase 1 beta. When implemented, It will need to support the secure, authenticated delivery of very large files, probably via FTP and HTTP.
File Acquisition
The harvesting of metadata and content files from known sources, is not part of the phase 1 beta, at least not in an automated way. In later phases, this acquisition will be done using the DPLA web harvester.
Validation
There is no validation in the phase 1 beta. The validation process for files dealing with multiple resources, particularly large or very large numbers of resources, will differ from the validation process for individual resources. On the presumption that individual resources are uploaded to the DPLA using the API, it seems likely that a validation will form part of the API error response.
Large files will need to have an asynchronous validation process which will require a notification mechanism for the contributor, probably email and/or a web page response, and will need some form of retry mechanism.
Error Handling
In phase 1, error handling will be manual. Error messages will consist of personal emails to the contributing institution. In later phases, when errors are thrown as part of the ingestion process, notifications will be automatically generated and sent to the contributing institution, or returned via the API.
Ingestion API
The ingestion API is separately detailed under the API scope, and is not part of the phase 1 beta.
DPLA Web Harvester
The DPLA web harvester is not part of the phase 1 beta.
Schemas and Metadata Scope
Data Models
Item and collection data model
See Appendix 2 for detailed item data model. The item and collection data models are combined in the phase 1 beta.
The proposed schema supports basic discovery use cases. It is available on the Dev Core wiki: http://dp.la/dev/wiki.
We propose using a simplified core schema based on existing standards (Dublin Core and Schema.org are under consideration.) The core schema will be designed to capture objects from a wide variety of data domains within the cultural heritage space—including datasets typical of library catalogs, archives, museums, and the web. This should keep the metadata mapping task manageable on the part of DPLA data providers.
We will extend the schema as needed when terms emerge as important enablers of API functionality. We will always supplement this core schema with the full descriptive vocabulary (local metadata schema) supplied with the original record. The core schema will also be used for the top-level description of collections, supplying basic discovery functionality and interoperability with item-level description.
Contributor data model
A contributor data model will form part of the phase 1 beta. This will provide full description of any actual or prospective institutional or personal contributor to the DPLA. A basic descriptive vocabulary will be supplemented with the full local metadata structure of the original records.
Event data model
An event data model will form part of the phase 1 beta. This will provide basic description of any event documenting the “use” of an item or collection. “Use” is employed here in a broad sense, to capture any documentable “touch” of an item or collection by a user—from circulation checkouts, recalls, reserve placements, or acquisition decisions in libraries to web page views to reviews. A basic descriptive vocabulary will be supplemented with the full local metadata structure of the original records.
User data model
The user data model is not part of the phase 1 beta.
Job data model
The job data model is not part of the phase 1 beta.
Title Clustering
Title clustering will form part of phase 1, enabling the linkage of all editions or versions of the same work or “Uniform Title,” as defined by current library cataloging practice, following industry-standard AACR2 guidelines. As AACR2 Uniform Title cataloging rules operate on records structured via MARC21 tags, title clustering in this initial phase will be limited to MARC21 datasets.
Data Acquisition
Although in phase 1 metadata will be ingested manually, the platform of course needs to automate the processes by which data is acquired.
The bulk of the data is expected to come from organizations that step forward to offer it to the DPLA. Nodes that serve as aggregation points for multiple organizations are especially important to the platform’s data acquisition strategy because of the richness of their data and because they already have services for aggregating, maintaining, and sharing data.
We anticipate that in later phases contributors will be able to share their data by filling in a simple web form and providing a mapping file that maps their data to the platform’s simple schema appropriate to their data type.
The DPLA may decide to pull some data in. For example, in phase 1 the platform will have acquired some curated content from the open web, such as TED Talks and NPR archives.
Main Platform
The main platform release for the phase 1 beta will contain the functionality detailed in this scope section.
The installation will be a single instance spread across multiple computing nodes hosted at the Berkman Center for Internet & Society. It will serve as a reference implementation to ensure the delivery, application, and data storage layers work well together and will use common open source software to ensure maximum deployment flexibility. All configuration files and tweaks will be made available.
Local Platform
The local platform is available along with supporting documentation at the development team’s github repository, https://github.com/dpla/platform.
The development branch of the github repository contains the bleeding-edge code that is currently being worked on. The master branch always reflects the latest, stable release. Versions are always tagged and can be downloaded in whole at https://github.com/dpla/platform/tags.
Local installation requires some systems administration work:
- Installation of Apache
- Installation of memcached
- Installation of Solr
- Installation of supporting PHP modules
Installation details can be found at https://github.com/dpla/platform/blob/master/INSTALL.md
Tools and Applications
The platform will include tools and applications that anticipate some of the most pressing needs of developers in order to help fulfill the platform’s mission of encouraging and facilitating the creation of useful and innovative applications that use its data. Some of those tools will be created by the platform’s developers, but many will be created by independent developers.
The phase 1 beta will include at least one tool: the API Query Builder. It will also include integration (at prototype level) with the MINT metadata mapping engine. Other tools are being developed by other external groups, some of which may be ready for the phase 1 beta.
A note on physical architecture
This document has not focused on the details of the proposed physical architecture of the DPLA platform. In part this is because there is already a wealth of technical writing about how to build scalable, secure web applications, but mainly it’s because, at this point in time, the detailed physical architecture is not nearly as important as the conceptual architecture and technical strategy. The approach proposed in this document should allow any part of the technical implementation or physical architecture to be switched out without disruption to any clients of the platform.
Appendix I
Scenarios List
The DPLA prototype platform is designed primarily to enable external developers to create applications using the rich metadata it hopes to gather. Because the aim is to maximize opportunities for innovation, scenario planning is necessarily sketchier than when creating a software application. So, these scenarios are aimed at addressing a wide range of possible uses others may make of this. These are not applications the DPLA platform team itself plans on developing, but rather, they are likely uses of the metadata the prototype platform is gathering. The platform needs to support those likely uses, as well as the unlikely ones we cannot yet imagine. The “phase” refers to when work on these might plausibly begin: Phase 1: now through April 27, 2012); Phase 2: April-October 2012; Phase 3: October 2012-April 2013; Phase n: Beyond mortal ken.
- This list is available in editable form on the DPLA platform dev wiki Scenarios page.
Appendix II
Item and Collection Data Model
- dpla_id
- description
- non-repeatable, required
- the internally generated DPLA id
- use cases
- DPLA needs to have unique identifiers for all its objects
- data type: GUID
- description
- contributor_record_id
- description
- non-repeatable, required
- unique record id within contributor’s dataset
- use cases
- DPLA needs to be able to de-dup records
- user needs to be able to update own records
- data type: text, required
- description
- contributor
- description
- non-repeatable, required
- a set of contributor id’s and names. The names are used for human readability and search and the id’s are used to look up contributor via the contributor API. It seems unnecessary to return the full contributor object
- use cases
- search, filter or browse by contributor name or id
- properties
- id
- data type: text, required
- name
- data type: text, required
- id
- description
- identifier
- description
- repeatable, optional
- a set of id’s and id types for well-known identifiers associated with this item
- use cases
- search, filter or browse by identifier source or id
- properties
- id
- data type : text, required
- source
- data type : text from enumerated list, required
- enumeration:
- ISBN
- LCCN
- OCLC
- OpenLibrary
- enumeration:
- data type : text from enumerated list, required
- id
- description
- title
- description
- repeatable, required
- the official title of the object (item or collection), with an optional link to an authority record
- use cases
- user searches for an object by title name
- data type: text
- description
- uniform_title
- description
- non-repeatable, optional
- an inferred or curated parent collection that contains all the variations/editions of this object
- use cases
- user wants to fetch all the variations/editions of a given title
- properties
- name
- data type: text, required
- example: “Hamlet : Shakespeare”
- data type: text, required
- id
- data type: GUID, required
- authority_record
- data type: url, optional
- name
- description
- creator
- description
- repeatable, optional
- use cases
- user wants to search books by author
- properties
- name
- data type: text, required
- type
- data type: text from enumerated list, required
- enumeration
- author
- editor
- translator
- interviewer
- director
- enumeration
- data type: text from enumerated list, required
- authority_record
- data type: url, optional
- name
- description
- publisher
- description
- repeatable, optional
- the entity that publishes or distributes the object
- use cases
- user wants to search books by publisher
- data type: text
- description
- location
- description
- repeatable, optional
- location information for placed associated in some essential way with item (publisher location,
- use cases
- user wants to be able to interact with map visualization of geographically themed books
- properties
- name
- description: city, state/province, country
- data type: text, required
- geo_coordinates
- lat
- data type: text, optional
- lon
- data type: text, optional
- lat
- relationship
- data type: text from enumerated list, required
- enumeration
- production location (publisher, etc.)
- topic (item treats history of Boston)
- enumeration
- data type: text from enumerated list, required
- name
- description
- date
- description
- repeatable, optional
- a set of events in the life of the object
- use cases
- search, filter or browse by published date
- properties
- value
- data type: text, required
- type
- data type: text from enumerated list, required
- enumeration
- published_date
- enumeration
- data type: text from enumerated list, required
- value
- description
- format
- description
- repeatable, required
- describes physical format of item
- use cases
- developer needs to know what the format type is before offering options to the user
- user wants to find a specific digital format for an ebook e.g. they want the kindle version
- user wants to see if local library has book/dvd in its holdings
- properties
- e_ness
- whether item is primarily represented in digital or analog format
- data type: text from enumerated list, required
- enumeration
- analog
- digital
- enumeration
- medium
- data type: text, optional
- examples
- mp3
- tiff
- paper
- dvd
- blueray
- cd
- epub
- conceptual
- data type: text from enumerated list, required
- enumeration
- book
- periodical
- video
- audio
- Web page
- enumeration
- data type: text from enumerated list, required
- e_ness
- description
- language
- description
- repeatable, optional
- records language(s) in which object represented
- use cases
- user wants to read book in particular language
- data type: text
- description
- dimension
- description
- repeatable, optional
- use cases
- user wants only video clips not running longer than 2 minutes
- length
- properties
- type
- data type: text from enumerated list, required
- enumeration
- duration
- page-count
- height
- width
- enumeration
- data type: text from enumerated list, required
- quantity
- data type: text, required
- type
- description
- subject
- description
- repeatable, optional
- controlled terms, such as Library of Congress Subject Headings
- use cases
- user wants to search for items relating to a particular well-defined subject
- data type: text
- description
- description
- description
- repeatable, optional
- free-form descriptive content, such as tags, introductions, tables of contents
- use cases
- user wants to search for items relating to crowd-sourced tags
- data type: text
- description
- call_number
- description
- repeatable, optional
- local text-based code for sequencing physical books on library shelves
- use cases
- user wants to find location of a book in a local library
- properties
- type
- data type: text, required
- examples:
- Library of Congress
- Dewey Decimal
- National Library of Medicine
- value
- data type: text, required
- type
- description
- content_link
- description
- repeatable, optional
- URL to a digital version of this item
- use cases
- user wants to read e-book version of item online
- data type: url
- description
- related_resource
- description
- repeatable, optional
- URL to digital resource related to this item
- use cases
- user wants to check Wikipedia article about this item
- data type: url
- description
- rights
- description
- repeatable, optional
- description of legal restrictions on redistribution of item
- use cases
- dpla wants to know if they can make available a given e-book
- data type
- properties
- rights_holder
- data type: text, optional
- license_type
- data type: text, optional
- date
- properties
- began
- data type: date, optional
- expires
- data type: date, optional
- began
- properties
- description
- data type: text, required
- link
- data type: url, required
- rights_holder
- description
- parent
- description
- repeatable, optional
- immediate parent collection of current item or collection, if there is one
- use cases
- user wants to see which collections this item is a member of
- user wants to see all items/collections whose parent this is
- properties
- id: text, DPLA internal UUID, required
- name: text, optional
- description
- resource_type
- description: distinguishes between item- and collection-level records
- use cases
- user only wants to query for collection-level results
- user only wants to query for item-level results
- data type: text from enumerated list, required
- enumeration
- item
- collection
- enumeration
- content_rating
- description
- repeatable, optional
- age-related rating for viewing this item’s content
- use cases
- user wants to find children’s books
- data type: text
- description
