Supporting curatorial work with machine learning

By Audrey Altman, March 30, 2022.

Audrey Altman is a senior software engineer at DPLA.  In this post, she discusses an experimental machine learning tool, designed for curators of the Black Women’s Suffrage Digital Collection

Curatorial work with large archival collections is challenging.  In order to find artifacts related to a chosen topic, curators have to sift through thousands – sometimes millions – of materials and determine which are truly relevant.  Machine learning can help streamline curatorial workflows by analyzing complex datasets and making some of their underlying patterns legible.  Yet, computers do not understand the underlying meaning of the patterns they uncover, and the results are imperfect and sometimes nonsensical.  Curators can use their expertise to make sense of machine learning outputs, and utilize them within a larger decision-making process about which materials belong in a curated compilation, and which do not.

I recently co-developed an experimental machine learning tool for curators of large library or archival collections.  The tool uses market basket analysis to identify topically-related artifacts.  Pairing human expertise with machine learning tools allows curators to work more efficiently and make large, complex archival collections legible to the public.  Ultimately, the partnership between human and machine intelligence helped curators at the Digital Public Library of America uncover stories of people and communities that would have otherwise been lost or obscured.

The problem of curatorial overload

The motivation for this project originated in my experience assisting curators of the Black Women’s Suffrage Digital Collection.  Over several months, I worked with curators Shaneé Yvette Murrain and Kat Williams to explore the DPLA’s aggregation of over 40 million cultural heritage materials.  We identified artifacts related to the history of Black women’s participation in voting rights movements and included them in a curated compilation with its own discovery interface.

The curators and I developed an automated query to select materials with relevant keywords and phrases in their metadata records (brief descriptions that archivists give to each artifact).  Since the DPLA aggregation grows and changes on a regular basis, the query is executed every other week to get the latest materials and add them to the Black Women’s Suffrage collection.  

The process of developing the final list of keywords and phrases for the biweekly query of DPLA’s aggregation proved much more difficult than we expected.  The curators started with a list of about one hundred terms related to their topic.  We soon found that there was so much variation in the ways that records were described, we would need to expand that list significantly in order to capture a comprehensive set of relevant materials.  The problem of inconsistent description is especially pronounced for materials about historically under-represented communities, which creates a risk that the curated collection could be biased if all appropriate subject terms are not found.  The process took months, and the current list includes over 12 hundred subject terms.[1]  

The laborious nature of manually building a comprehensive list of subject terms is a barrier for curators who want to build compilations similar to the Black Women’s Suffrage Digital Collection.  It also makes the process of maintaining the collection over time rather daunting, as new materials are constantly being added to DPLA’s aggregation, and these new materials may include new subject terms that curators need a way to discover.

Can machine learning help?

I wanted to try using machine learning to help curators do their work faster and more efficiently.  If an algorithm could identify subject terms that are likely to be topically similar to those already selected by the curators, then these new terms could be presented to curators as suggestions that may help them find more materials related to their area of interest.  For example, if a curator started with a subject term like “Antislavery movements,” the algorithm might suggest “Underground railroad” or “Sojourner Truth.”  If successful, the algorithm could help curators to efficiently and comprehensively identify subject terms related to any given topic.  

I chose an algorithm called market basket analysis or affinity analysis. This algorithm identifies items in a dataset that frequently co-occur.  It then performs a statistical evaluation to determine which frequent itemsets represent an interesting or meaningful correlation.[2]  It is often used by retailers to find items that shoppers commonly purchase together.  For example, the algorithm might discover that shoppers who purchase dog treats are likely to also purchase tennis balls.  An online store might use that insight to automatically display advertisements for tennis balls to any shopper who puts dog treats in their cart.

Shopping cart filled with groceries.
The model of finding retail items frequently bought together can be applied to the problem of finding subject terms that frequently co-occur in archival records. Courtesy of OKHub and Oklahoma Historical Society.

In our model, each metadata record is like a market basket that contains several subject terms.  Just as the algorithm might notice that dog treats and tennis balls often appear in the same shopping carts, it could discover that certain subject terms tend to co-occur in metadata records with statistically significant frequency, such as “Voter registration” and “Segregation,” or “Coretta Scott King” and “Civil Rights Leaders.”  It can then suggest that curators who “like” certain subject terms may also want to look for records with related subject terms.

One benefit of market basket analysis is that it resembles the work curators were already doing.  They would run a query using subject terms they had already chosen, select relevant artifacts from the search results, and see if those artifacts contained any additional subject terms to add to their query.  In other words, they were already relying on the co-occurrences of subject terms within metadata records to aid discovery.

The nature of subject terms also lends itself to market basket analysis.  Subject terms are drawn from controlled vocabularies, which means that artifacts will be described using the exact same words and phrases as other artifacts with the same topics.[3] Seeing the same subject terms in different combinations in metadata records is like seeing the same retail items in different combinations in shoppers’ carts. 

Market basket analysis ultimately helped uncover more stories of Black women who had been absent from the Black Women’s Suffrage Digital Collection, including Maria Weems, who escaped slavery on the Underground Railroad; Amanda Smith, an African American Methodist evangelical preacher; and Ceola Millner, who took her five children to Washington D.C. in 1968 for the Poor People’s Campaign.

Ceola Millner and her five children
Ceola Millner and her five children of Watts, Calif., waiting to board a bus to Washington, D.C. for the Poor People’s Campaign, 1968. Courtesy of California Digital Library and UCLA, Library Special Collections, Charles E. Young Research Center.

If you’re interested in technical details, read on – otherwise, skip to “Testing and results.

Pre-processing the dataset

I implemented the market basket analysis in collaboration with three graduate students in the University of Michigan-Deaborn’s Computer and Information Science Department: Tom Settle, Mike Derkowski, and Joe Chaaban.

The graduate student team retrieved all DPLA’s records with subject terms and normalized them.  Records following the Library of Congress standard often include compound subjects, in which individual subject terms are combined to express more granular concepts, e.g. “African Americans–Civil Rights” or “Boycotts–Alabama–Montgomery.”  Compound subjects were split into their component terms to enable the algorithm to find more similarities between artifacts.  Subject terms were downcased, and any records with fewer than two subjects were discarded.

Since subject terms can be quite long, we mapped each unique term to a number (a 64-bit integer), which takes up less memory than the average subject words or phrases.  In later processing, using the more efficient numeric representation reduced the amount of overhead associated with reading and writing data to and from disk, minimized the amount of memory needed to process the data, and reduced data-processing costs.

Sample entries from the subject directory.
The subject dictionary maps each normalized subject term to a unique number.

We also had to consider groups of artifacts with exactly the same subject terms.  Archival artifacts are typically organized into collections, and sometimes an archivist will assign the same generic set of subject labels to every item in a collection, regardless of each item’s individual features.  There are collections in the DPLA consisting of hundreds or even thousands of artifacts with the exact same subject labels, which have the potential to skew the market basket algorithm by making the co-occurrence of certain subject terms seem much more significant than is actually merited.  To limit the influence of collection-level description, we only allowed up to 100 items with identical subject headings from any individual archives.

The final preprocessed dataset included about 14 million records and about 3 million normalized subject terms.  Each subject term was represented by a unique number.

The market basket algorithm

The first step of the market basket analysis is to identify subject terms that commonly co-occur in metadata records.  Finding frequent pairs within the dataset requires the most computational resources of any step in the algorithm.  Given that there are about 3 million unique subject terms, that means that there are about 4.5 trillion possible pairs of terms.  

Granted, all 4.5 trillion possible pairs would not actually occur in the dataset, but it is still a large upper boundary on the computation, even with big data tools like Apache Spark and AWS cloud clusters at our disposal.  As you iterate through the dataset, you need to hold in memory a list of all the pairs you’ve encountered, and how many times you’ve seen them.  Some back-of-the-napkin calculations show that with our dataset, it would take about 90 terabytes of memory to count 4.5 trillion pairs.

The graduate student team’s approach to finding frequent pairs is based on the Apriori algorithm.  This algorithm starts by finding all items in a dataset that occur frequently as singletons (e.g. subject terms that appear in many records), and then uses this information to streamline the search for frequent pairs.

If a pair of items occurs frequently within a dataset, then it follows that each item in the pair also occurs frequently.  Say the subjects “Civil rights” and “Voting” occur together 100 times.  That means that “Civil rights” appears at least 100 times in the dataset, as does “Voting.”  By the same logic, if an item is infrequent as a singleton, it cannot be part of a frequent pair.  In the search for frequent pairs, you can therefore ignore all pairs that contain an infrequent singleton.  This reduces the memory required for computation – any pair that you ignore doesn’t need to be held in memory.

How many times must a term or pair of terms occur in order to be considered frequent?  It depends on the dataset, and the use case.  In general, a higher frequency threshold means that the algorithm executes more efficiently, and it will make fewer suggestions.  A lower pair frequency will result in more suggestions, but the quality of the suggestions may be lower.  During the testing phase, we experimented with many different frequency thresholds to find one that suited our curators’ needs.

In market basket analysis, it is not enough for a pair to be frequent – it must have a meaningful correlation.  For example, the subject terms “African Americans” and “Education” are both very frequent in the aggregation, each appearing in over 70 thousand records.  They also co-occur in about 5 thousand records.  How can you tell if the presence of “African Americans” indicates that “Education” is also likely, or if the two terms are just so prevalent their co-occurrence isn’t actually that significant?  

The interest measurement helps us make this distinction.  It tells us if the likelihood of encountering a record with both “African Americans” and “Education” is significantly greater than the likelihood of encountering a record with just “Education.”  The market basket algorithm would only suggest “Education” to a curator who likes the term “African Americans” if the interest level was high.

The formula for interest.
The formula for interest.  A and B represent two unique subject terms in the dataset.

As with the frequency threshold, the interest threshold can also be adjusted to optimize the algorithm’s performance to fit your specific use case.  Interest ranges from -1 to +1, and general wisdom suggests that anything greater than +0.5 is considered interesting.  In our case, setting the threshold at +0.5 worked well.

Testing and results

We put the market basket algorithm to work for the Black Women’s Suffrage Digital Collection.  The final result was the addition of about 5 thousand new items to the collection, covering new topics, individuals, and communities.  

We first tested the algorithm’s efficacy by inputting the curators’ initial list of about one hundred subject terms for the Black Women’s Suffrage Digital Collection and counting how many of the computer-suggested terms matched their current, expanded list of about twelve hundred subject terms.

If the algorithm learned nothing, it ought to be about as effective as suggesting a set of random subject terms.  In that case, about 0.04% of the suggestions would match the curators’ current list of terms.  If it learned to perfectly mimic the work of the curators, then theoretically 100% of its suggestions would match the curators’ current list.

We tuned the algorithm by adjusting the thresholds for pair frequency and interest (see the “The Market Basket algorithm” section for details). When optimized, about 25% of the computer-suggested subject terms matched the curator’s final list.  This is significantly better than random guessing, but not good enough to be trusted to make decisions without human oversight.  It could, however, be helpful to increase curator efficiency.  Instead of paging through artifacts by hand and looking for relevant subject terms, the algorithm could take the list of terms that curators had already chosen and use it to generate recommendations for additional subject terms.  Based on our tests, curators would be expected to accept about 1 in 4 of the recommended terms.

Finally, I took the list of about twelve hundred subject terms that the curators had already chosen, and submitted it to the market basket algorithm.  The algorithm returned about 350 suggestions for new terms.  The curators and I reviewed the suggestions and accepted about 1 in 6.  The review process also sparked new ideas for themes and keyword terms that were added to the query. 

The algorithm has limitations.  It won’t help curators find items with rare subject terms, or those without any subject terms.  Its effectiveness at identifying novel and highly relevant subject terms wanes as the curator’s existing term list becomes more comprehensive.  But used as a part of a larger workflow, it can help streamline the curatorial process and inspire new ideas for relevant subtopics.

This experiment shows that machine learning can augment the curatorial process by mechanizing tedious and time-consuming parts of a larger workflow.  For tasks that involve finding patterns in datasets, such as identifying subject terms that frequently and meaningfully co-occur in metadata records, computers can save time by doing an initial pass over the data.  Coarse, somewhat messy results are still helpful to an expert, who can interpret and selectively apply them to user-facing compilations, tools, and services.

As DPLA develops new ways to share and access information, we will continue to explore systems that utilize both human and machine intelligence.

If you are interested in doing computational analysis of library and archival metadata, DPLA’s aggregation is freely available and openly licensed for reuse.

You may also like


  1. ^ The algorithm that selects materials for the Black Women’s Suffrage Digital Collection involves several metadata fields, including subject, title, creator, date, and collection.  For the purposes of this study, we focused only on subject terms.
  2. ^ Our implementation of the market basket algorithm is based on: Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2020. “Frequent Itemsets” in Mining of Massive Datasets 3rd edition. 206-242. New York, NY: Cambridge University Press.
  3. ^ Artifacts in the DPLA are described using multiple controlled vocabularies, including Library of Congress Subject Headings, Getty Art and Architecture Thesaurus, and Faceted Application of Subject Terminology.  Since there is more than one vocabulary in use, the aggregation’s subjects are only quasi-standardized. Furthermore, even within a controlled vocabulary, there can be variation in how artifacts are described. Despite these inconsistencies, there is enough similarity for algorithms like market basket to work reasonably well.