Searching for Black women in the archives: Part 4
This is the fourth and final post in a series from DPLA’s Audrey Altman about the curatorial and technological challenges involved in the development of the Black Women’s Suffrage Digital Collection. As a data engineer, Audrey worked alongside the curators Shaneé Yvette Murrain and Kathleen Williams to address underlying biases in the collection and surface representative stories about Black women’s contributions to voting rights movements.
An intentional algorithm
The Black Women’s Suffrage Digital Collection brings together thousands of artifacts about the often-neglected history of Black women’s contributions to voting rights movements. It includes photographs, newspapers, diaries, artworks, letters, pamphlets, and more. In order to find and collect these materials, curators mined the Digital Public Library of America, an aggregation of over 40 million cultural heritage materials from libraries, archives, and museums across the United States. Since humans cannot examine all of these materials individually, we developed an algorithm to analyze DPLA’s corpus and select materials for inclusion in the Black Women’s Suffrage collection. I worked with DPLA’s Director of Technology, Michael Della Bitta, to build an algorithm that would embody and reflect three of the core values of the Black Women’s Suffrage project: diversity, equity, and inclusion (DEI).
DPLA contains a rich trove of historical artifacts that tell the stories of Black women’s leadership in the Suffrage Movement. Yet, like many libraries and archives in the United States, the majority of materials in DPLA are by and about White people and men. Surfacing materials about Black women is no small challenge. In part 1 of this series, I explored the reasons behind the relative scarcity of materials about Black women and the difficulty it creates for curators. In part 2, I discussed the challenge of finding materials that genuinely represent Black womens’ perspectives. Part 3 focuses on the many different ways archivists label and organize materials about Black women, and how this can complicate the search process. Understanding all of these nuances helped us build an effective and accountable algorithm.
There was a danger that the selection algorithm would replicate the same patterns of under-representing and misrepresenting Black experiences that exist in DPLA’s aggregation. As information scientists Bayta Friedman and Helen Nissenbaum observed, “Preexisting bias has its roots in social institutions, practices, and attitudes. When computer systems embody biases that exist independently, and usually prior to the creation of the system, then the system exemplifies preexisting bias.” We needed to account for all of the variety and skew in the underlying data, and create a selection algorithm that would correct bias rather than replicating it.
When developing technologies for the Black Women’s Suffrage Digital Collection, we considered their social implications and how their treatment of racial representation would impact our users.
The functional goal of the Black Women’s Suffrage selection algorithm is to select materials from DPLA to include in the collection. The act of selecting what will be included in — and excluded from — a library collection is not neutral. It is one of the archival practices that sociologists Richard Harvey Brown and Beth Davis-Brown describe as “shap[ing] the collective national public memory and, with this, people’s shared conception of their origins, nature, and destiny.” Communications scholar Tarleton Gillespie compared the act of selecting information for a public datastore to the types of information gatekeeping performed by newspapers and other commercial media. Both practices “can shape the diversity and character of public discourse.” In the context of the Black Women’s Suffrage project, the selection algorithm is part of a curatorial process that ultimately makes a value judgment about whose stories are important in our public memory of voting rights activism and attempts to correct a social imbalance by uplifting historically marginalized voices. In order to build DEI into our algorithm, we focused on four objectives: transparency, accuracy, responsiveness, and collaboration.
We needed our selection algorithm to be transparent, meaning that all project collaborators could understand why the algorithm selected certain items and how to make changes to the selection process. Rather than using a “black box” algorithm, in which a computer makes decisions that humans cannot readily comprehend or easily alter, we designed a system in which curators and programmers could follow the computational reasoning and exercise fine-grained control over the outcomes. Transparency is especially important in a project centered around DEI because humans are able to understand complex social concepts like racism and representation that computers simply cannot.
The selection algorithm uses keyword-based rules to determine if each of the 40+ million items in the DPLA aggregation belong in the Black Women’s Suffrage collection. Each DPLA item is reduced to a set of keyword tokens, or simplified lexical units that are useful in computational processing. For example, the book title Scenes from the Life of Harriet Tubman could be broken down into the tokens, scene, life, harriet, and tubman. In this case, upper-case letters are replaced with lower-case letters; the plural “scenes” is changed to its singular form, scene; and the words “from,”, “the,” and “of” are ignored because they don’t help us understand what the document is about. Curators and programmers worked together to generate a set of rules that specify which tokens must be present in (or absent from) a document in order for it to be selected for the collection. For example, a rule might state that if an item’s title, subject, or creator includes the tokens harriet and tubman, it should be added to the collection. In all, the selection algorithm contains hundreds of individual rules. These rules are relatively easy to create, understand, and change.
The selection algorithm must faithfully reflect curators’ intentions by including and excluding appropriate artifacts. This means that keyword-based rules need to identify as much relevant material as possible while effectively filtering out irrelevant material. This was a difficult task due to the underlying biases and nuances in DPLA’s dataset — under-representation of materials about Black women, over-representation of materials that reflect a White worldview, and significant variety in the ways in which relevant materials are labeled and described. Over several months of evaluating and refining the algorithm, the team learned about the particular characteristics of the DPLA dataset, and how to account for different types of bias and skew.
The DEI focus of the project changed the way we thought about algorithmic accuracy. Selection algorithms are often judged by their ability to optimize precision (retrieving only relevant documents) and recall (retrieving all relevant documents). But these two measures alone do not capture the DEI goals for the algorithm. For example, the curators determined that in order to provide important historical context, the collection should include materials about White suffragists as well as Black suffragists. The DPLA aggregation has an over-abundance of materials about the former, and relatively few about the latter. If we built an algorithm with perfect precision and recall, the result would be a collection in which Black voices were lost among the sheer volume of technically relevant materials about White people. Therefore, we needed our algorithm to select a limited number of artifacts about White subjects from the pool of all relevant materials, and to perform exceptionally well at finding all relevant artifacts about Black subjects. We achieved this through an iterative process of fine-tuning and revision.
At the time of this writing, the Black Women’s Suffrage Digital Collection contains about 200,000 artifacts, and is continually growing and changing as new materials are added and the selection algorithm is refined. With such a large and malleable collection, it is possible that our users will find inappropriate materials that need to be filtered out, or that we will discover new ways to capture relevant materials. Our technical infrastructure allows us to be responsive to these eventualities. Each week, the selection algorithm is run against the entire DPLA corpus, re-constructing the Black Women’s Suffrage collection, adding new materials, and applying any new keyword-based rules the curators have submitted. With this system in place, we can continually improve the selection algorithm, and can make changes to the collection quickly and effectively.
The success of the Black Women’s Suffrage project depends on people with specialities in technology, metadata, and history sharing information and working together to solve problems. The transparency of the algorithm ensures that all members of the team could make meaningful contributions to the selection process, and the responsiveness of the system allows us to iterate through revisions at an effective pace. To further facilitate collaboration, after each weekly run of the selection algorithm, we generate a set of automated reports to give insight into the collection and algorithm’s performance. These reports include information about any new materials that have been identified for inclusion, which subject terms are appearing most frequently in the collection, and the number of documents retrieved by each keyword-based rule. Through several months of development, I met regularly with curators Shaneé Yvette Murrain and Kathleen Williams to review these reports and discuss ways to improve the selection algorithm. Now that the project has launched, we continue to generate and review reports. We also provided users of the Black Women’s Suffrage website a means of contacting us if they find inappropriate or irrelevant content in the collection.
Throughout the Black Women’s Suffrage project, our team learned valuable lessons about how racial inequity is reflected in the DPLA’s aggregation, and in libraries and archives across the United States. We also identified some strategies to counter these inequities through intentional curatorial and technological practices. Our work is not done, but we are excited to apply these new insights toward continued innovation around DEI in digital libraries.
More from the Searching for Black women in the archives series
You may also like
- “DPLA launches Black Women’s Suffrage Digital Collection”
- “Our Race, Gender, Politics, and History Event”
- ^ Batya Friedman and Helen Nissenbaum, “Bias in Computer Systems,” ACM Transactions on Information Systems, 14, no. 3 (July 1996): 330-347.
- ^ Richard Harvey Brown and Beth Davis-Brown, “The Making of Memory: The Politics of Archives, Libraries, and Museums in the Construction of National Consciousness,” History of the Human Sciences 11, no. 4 (1998): 17-32.
- ^ Tarleton Gillespie, “The Relevance of Algorithms” in Media Technologies: Essays on Communication, Materiality, and Society, edited by Tarleton Gillespie, Pablo J. Boczkowski, and Kirsten A. Foot (Cambridge: MIT Press, 2014), 167-193.