Metadata Aggregation Webinar Video and Extended Q&A
Posted by Gretchen Gueguen in January 28, 2015.
Thanks to all of you who attended our webinar. We had a great turnout and hope you found it interesting and informative.
As promised, you can now find the video for our recent Metadata Aggregation webinar below or over at our Vimeo account. Links to download each presenter’s slides are included in this post as well. Unfortunately, we didn’t have enough time to get to all of the questions that came up during the webinar. However, our presenters agreed to answer a few more in writing for our blog. You can find them below in the Extended Q&A section.
Download the presentation slides from:
- Lisa Gregory and Stephanie Williams of the North Carolina Digital Heritage Center
- Heather Gilbert and Tyler Mobley of the South Carolina Digital Library
- Gretchen Gueguen of DPLA
How do you prepare potential data contributors to ensure metadata quality? For example, do you provide any training or work closely with them in the first few months?
NCDHC: We haven’t yet had a request for formal training, but we do work closely with potential contributors, providing feedback on mappings and any quirks we find with the structure of their data. Because we require so few fields and don’t check metadata quality very far, this may be part of why little training has been needed or sought.
How many programmers does your operation require? FTE and PTE?
NCDHC: Stephanie is our programmer. For starting as a hub, she spent around .25 FTE for two months. Now, for maintenance, it varies, but it’s about 1 hour per new partner, and 4-5 hours if we start taking in a new metadata format (for instance, when we started taking in METS).
DPLA: Our tech team has four full time members. Not all devote their entire day to what might be described as “programming” though. We have a Director of Technology, Mark Matienzo, who still finds some time to develop code in between administrative and executive duties. Our Metadata and Platform Architect, Tom Johnson, also develops code but spends a lot of his time designing the overall system. Our Technology Specialists Audrey Altman and Mark Breedlove work on developing our codebase, but also work on server administration, web development, and support for our API.
SCDL: We currently run SCDL’s aggregation technology with the use of one full time Digital Services Librarian who is an experienced programmer. His time commitment to the project varies as needed but I would estimate that it is less than 25% of his time.
I’d be interested in hearing from all of the presenters about where they think we are on the adoption curve. Are “most,” or “some” or “few” of the potential partners in each state already contributing to DPLA through the hubs?
NCDHC: We currently have 14 data providers, and have made contact with 17 others in the state. Of those 17, we’ve heard back varying degrees of interest. Right now, we figure we may add 1-3 additional data providers per year.
SCDL: I think I can safely say that most of the potential partners in the state are either contributing to DPLA through the hubs or are in the process of getting ready to contribute. This “most” figure does not include potential parters who aren’t interested in participating. For the most part, those who want to participate and reach out to SCDL eventually get to participate.
Are contributing partners responsible for fixing the mapping/missing field issues you identify? (rather than you updating the data after aggregating)? How is that process managed with contributors? Is there a timeline/turnaround time they are given to remediate their data?
NCDHC: We generally give our data providers about 10 days notice before a harvest. That leaves us a good bit of time to identify any outstanding issues. Generally, changing mappings is so quick and easy that data providers can fix it within minutes. It’s the other issues (fixing rights statements, for example) that often take longer.
SCDL: If the partner is still under review and their data has not been harvested (at all) in yet, we usually give them a much longer window to correct mismappings and missing fields. However, if they are a regular partner and they’ve added new collections that, post-harvest, are noticed to be incorrectly formatted (mismapping, etc.), then we usually request a quick correction (under a week) and then we reharvest. This quick turn around time is usually ok becuase at this point the problem is isolated to one or more new collections. Reharvesting is not a problem because we are dealing with relatively small numbers from our partners. Our smallest partner has just under 1,000 records and our largest has just over 40,000 records.
If you want to wind up with mods why not start with MODS/METS? If an institution was just starting their local repository would DPLA suggest they adopt MODS as their metatdata standard or should they choose Dublin Core.
NCDHC: We take what our institutions provide from their systems, and by a large majority that’s Dublin Core. We don’t ask them to change their metadata schema in order to participate. I’m almost positive that would mean little to no participation! CONTENTdm doesn’t accommodate MODS or METS, and most of the digital collections in content management systems in North Carolina are in CONTENTdm.
DPLA: We try not to recommend any metadata standard. It all really depends on the data you are receiving from your partners and what makes sense. Yes, MODS is very robust and provides a lot of granularity, but its complexity can also be a burden leading to complicated crosswalks. That said, we have successfully worked with MODS, as well as both simple and qualified Dublin Core, as well as MARC data. In the next month we will be announced an updated metadata application profile along with updated documentation. We will be providing a generic crosswalk from both MODS and Dublin Core to provide some guidance for institutions starting from scratch who want to ensure their data can be easily harvested by DPLA. Stay tuned to our blog for details.
For SCDL – What are the benefits of mapping Date Digitized to dc:date for DPLA? I have seen items mapped like this showing up on the timeline in ways that may be confusing to users – e.g. a photo of a prehistoric item is showing up under 2015 on the timeline.
SCDL: We actually don’t do that. We map date.digital to None and date.created (which is the date of the analog object) to dc:date. Sorry for any confusion.
DPLA: What you’ve found in the timeline is an instance where inconsistent metadata within a data feed has caused an error. This is why consistency in metadata application is so important. We can write the crosswalk to map any appropriate field to DPLA’s date field. However, if that field is mistakenly used in some records for the digitization date rather than the creation date, there isn’t really anything we can do about that until after the mapping when the error is noticed. This is an example of the type of quality control we try to do in the final steps of aggregation. The severity of the problem merits how quickly we look for a solution. Some of the misapplied dates, unfortunately, won’t be fixed until the next harvest.
Are there notable aggregation issues with disparate instances (eg. ContentDM vs. BePress)?
SCDL: In our experience, BePress users had a lot less control over their feed and that created problems when it cam time to adjust their feed for aggregation. We are actively working with a BePress feed now and it has been under review longer than any other feed. But, we are making progress.
A question for heather in SC – could she talk about the process of getting her clean metadata back into CDM after cleaning in google refine? we haven’t figured out a process of getting it back in easily.
SCDL: I know I answered this during the webinar, but I wanted to clarify further that we primarily use Open Refine as a step before to adding content to a repository. Metadata Creation–>Check MD in Open Refine –> Ingest to Repository (CONTENTdm, Dspace, etc.) . When I use it to assist in creating Metadata qa/qc sheets, I don’t pass corrected metadata back to the supplying repository. I use Open Refine to quickly identify common errors across collections and then use that data to create the qa/qc spreadsheet. It is not often that I would delete and re-import a collection into CONTENTdm after passing the data through Open Refine. Once the collection is in the CONTENTdm, I might use Open Refine to identify problems, but I’d use CONTENTdm’s find and replace to fix the problems.
Why doesn’t DPLA or the aggregators pass the cleaned data back to the source?
SCDL: This would probably be better answered by DPLA but I know that a big issue for us would be getting the clean data back into our individual repositories without having to rebuild all of the collections.
DPLA: There are really two issues: the first for us is the question of how best to provide that data back to Hubs. We now store the data as JSON-LD, and in the near future will begin storing it as RDF triples. These data formats are typically not what is used by Hubs. If we were to provide data back to them reformatting it would lead to another workflow for mapping and quality assurance. Secondly, in the case of Service Hubs we are dealing with the metadata aggregator in most cases, not the creator. We could provide them with updated records to load into their system, but as soon as they re-harvest from their partners, those would be eliminated, or at the very least could lead to problems with versioning between the Hub and its partners. In short, the problems with providing clean or enriched data back to providers is a logistical one, not a technical one. It is an issue that DPLA is interested in solving for the benefit of all, but figuring out how that work can be done is yet to be determined.
Is the Copyright work being coordinated with the parallel work being done by DPLA/Europeana?
SCDL: We’ve been working with the DPLA on our forthcoming copyright materials, yes.
DPLA: For more information on DPLA’s rights work see our blog post on the Knight News Challenge-winning project, Getting it Right on Rights (http://dp.la/info/2014/06/23/dpla-wins-knight-news-challenge-award/). The project whitepaper should be released soon.
How are you preparing DPLA for the Semantic Web?
DPLA: We have been going through the process of updating our application profile since last fall to accommodate more linked data in our records. Version 4.0 of the MAP will have expanded classes for things like agents and concepts in addition to the place and collection classes that already exist. We will be adding properties for URI matches to linked data endpoints and will begin by incorporating these for geographic entities to start. In addition, the new MAP will incorporate several more namespaces to increase its interoperability in the Linked Open Data / Semantic Web world. These changes will open a path for DPLA to think about how LOD can be used by us and our partners in a sensible and beneficial way.
Is DPLA (SCDL, NC DHC) considering the use/implementation of ResourceSync as the NISO standardized successor of OAI-PMH?
DPLA: Yes. We are very interested in ResourceSync. It has a lot of benefits over OAI-PMH, particularly in the area of syncing collections without having to entirely rewrite. However, use of ResourceSync would require that we also have Hubs that use it, and as of yet we do not have a Hub adopting it. This would be a great opportunity for our Hub network to work together, however, and it could easily become a feature integrated with “Aggregation-in-a-box.”
Does DPLA have a preferred format for geographic locations?
DPLA: Our only preference really is for consistent, discretely identifiable geographic locations. That means, first of all, that all geographic properties in the Hubs feed are expressed in the same way (all look like “Erie (Pa.)” for example, or “Erie; Pennsylvania” and so on). It also means that if one collection uses, say, county names and the others do not, we have a way to detect that. For example, if the place names are just semi-colon separated values and some have “Country; State; County; City” but another has “Region; State; City” that can be very confusing. A better method in that case would be to have the place names in separate elements that indicate what they are (such as MODS <hierarchicalGeographic> elements). This way we only have to have one set of logic for the enrichment module’s parser.
The second requirement is that the parts of the place be discretely identifiable. This means that we can differentiate the city name from the county name or the country name. this is achieved by either separating the place names with punctuation (preferably semicolons, although in the case of LC geographic terms, we can use the parenthesis) or by breaking them up into separate elements.