Wikimedia Project update

By Dominic Byrd-McDevitt, May 19, 2022.
Published under:

At the start of 2020, the Digital Public Library of America embarked on an ambitious program to assist our network in providing their digital collections to Wikipedia and in realizing the resulting increase in access to those images. Since that time, DPLA has added over 2.5 million files to Wikimedia Commons—with over 200 million pieces of metadata from about 1 million items—and these have already received over 100 million page views. As we wrap up the second year of this initiative, we’d like to share some of our outcomes so far, and discuss the new phase we will soon enter.

Year One: Building a digital asset pipeline

In our first year, with funding from the Alfred P. Sloan Foundation, the DPLA Tech Team launched our digital asset pipeline to Wikimedia Commons and began onboarding partners to the project. Our initial tranche of first-year participating hubs included the North Carolina Digital Heritage Center, the National Archives and Records Administration, Indiana Memory, the Ohio Digital Network, the Plains to Peaks Collective, Digital Commonwealth, The Portal to Texas History, and the Digital Library of Georgia. Each participating hub received training about editing Wikipedia, with a special focus on the specific task of adding their DPLA-uploaded materials to appropriate articles. We then began to track the page view traffic to Wikipedia articles using the images we had provided.

Challenges encountered

Along the way, we learned a lot about the pain points and the challenges for our partners. For example, only institutions implementing standardized rights statements or Creative Commons licenses can participate, since we need to be able to filter out ineligible items from upload. As a small organization, DPLA committed to providing training and support, along with the service of uploads to Wikimedia Commons, but has counted on the institutions themselves to perform the actual editorial work on Wikipedia for their own images. This led to a lot of variability of outcomes, with some institutions being highly engaged in proactively editing Wikipedia, and others leaving the editing to the Wikipedia community.

Our pipeline also requires one crucial piece of data that DPLA had never asked its providers for before: a link to download their items’ full images. This was unavoidable, since there is no way we can upload a file without access to the file itself, but it introduces a bottleneck in the process, since hubs must implement this new data element (usually by including a IIIF manifest URL) before we can work with their items. Finally, participants have sometimes encountered resistance from other editors on Wikipedia, when their contributions were questioned, and this social component of the project is one of the most complex, since Wikipedia is a large online community composed of individual actors. We are learning from all of these difficulties, and are working on ways for partners to more easily provide their digital assets, and for our contributors to have better experiences on Wikipedia.

Delivering impact for our partners

Despite the challenges faced, the project has already resulted in major successes, recording huge impact for the DPLA network since its inception. By the second half of 2020, DPLA had already become the single largest contributor in Wikimedia Commons history, now with over 2.5 million uploads to date. These uploaded images come from 10 different DPLA hubs, with over 200 contributing institutions represented, including some of our smallest public library systems all the way up to our very largest institution. Images uploaded by DPLA are now included in nearly 2,500 articles across 78 language versions of Wikipedia. DPLA recorded nearly 8 million page views to all of our media across Wikipedia in our most recent single month of web traffic data (March 2022). And this is up from an average of around 3 million per month at the end of 2020 and 5 million per month by the middle of 2021. With Wikipedia, usage tends to accumulate, as the initial investment of time to add an image continues to reap benefits over time as those images typically remain in their articles without any additional effort. In total, by the end of this month, the DPLA network as a whole will have received over 100 million page views via Wikipedia since we launched 2 years ago.

Aside from all of these aggregate numbers, we’ve recorded a number of success stories that show the value of the project for individual institutions:

  • As part of an effort to increase women biographies on Wikipedia, a new article was written in February 2022 for “Vanessa Bryant” and the Wikipedia editor was able to utilize an image we uploaded in 2020 before the subject had an article. The article and image from the National Archives were subsequently featured on Wikipedia’s main page, and received 1.4 million page views in March 2022 alone.
  • After former White House Chief of Staff Ken Duberstein passed away on March 2, 2022, a single photograph of him from the National Archives was viewed over half a million times just on the Russian-language Wikipedia, when it was included in the “Deaths in 2022” article linked from the site’s main page.
  • Prior to May 2021, staff from Digital Library of Georgia member institutions placed historical images in the Wikipedia articles on Memorial Day and May Day. These two timely articles were viewed nearly a million times in that month.
  • Not all successes are the result of timely or featured articles on Wikipedia; sometimes DPLA is adding value to foundational articles on Wikipedia with regular viewership. For example, after adding a historical photo of coal transport to the “coal” article in January 2021, the Toledo-Lucas County Public Library’s image has steadily recorded about 50,000 views a month for the last 16 months.

We have also seen other benefits beyond usage, such as inspiring improvements to our data. As mentioned, providing standardized rights statements can be a challenge for some, but, on the other hand, this project is an excellent argument for the value of machine-readable rights. Some providers, who had not previously implemented rights statements before we launched this practical application, were incentivized to adopt them in order to meet the Wikimedia program’s data requirements. The result was that the number of rights statements have more than quadrupled since mid-2020—from 5 million to over 20 million—led by the National Archives. All these new machine-readable rights statements also allowed us to develop a rights facet in the DPLA search portal for the first time, benefiting all users. (To learn more, view the Rights Statements User Group’s “Rights Statements 101” webinar which covered the topic.)

Year Two: Data improvements and synchronization

With new funding from the Wikimedia Foundation, DPLA was able to undertake a second year of the project with a new focus on making improvements to the data we provide with our uploads to Wikimedia Commons. DPLA moved to implement Structured Data on Commons, a new way of representing our metadata as linked data using Wikidata-based properties and entities. This approach also let us introduce synchronization to our Wikimedia workflow for the first time, and allowed us to experiment with using that data in new ways.

Synchronization

Starting in the summer of 2021, DPLA began work on a data model in Structured Data on Commons, becoming one of the first cultural institutions to model their item-level descriptive metadata for Wikimedia Commons. Because this was not something that really had been tried before, our proposal was developed in public, on a wiki page, and we invite continual community feedback to help us improve on it.

Initially, DPLA began to use the model to migrate existing text from all of our prior uploads on Wikimedia Commons to structured data. But one of the main values of structured data—and an important goal of our project—was to use this crosswalk to be able to detect changes in the metadata and update the record. And so we have begun to periodically compare the version of each uploaded file’s data in Wikimedia with the item record in DPLA’s API. Since the start of the project, DPLA has added structured data statements to all of its 2.4 million uploads. With data in 14 different properties in DPLA’s data model, as of April 2022, this work represents the addition of 37,986,743 statements—or, considering, each top-level statement is added with qualifiers and references pointing to DPLA as the source, we have ultimately added 209,374,628 triples. This makes DPLA the largest institutional contributor of structured data to Wikimedia Commons by far.

By continually synchronizing the data, we accomplish two things:

  1. We maintain our data across time on this third-party platform, so they are not just a frozen snapshot of the record at upload time;
  2. We can iterate on the format of our data—e.g. mapping new elements, reconciling new entities, or responding to community feedback on our modeling—and the synchronization process will simply update all the records to the new approach. 

Ultimately, all of this complex technical work is for a very simple purpose, ensuring the ongoing integrity of our data. Our institutions expend a lot of intellectual labor describing their items, and improving those descriptions, and just as they can expect any changes to be reflected in DPLA once we are provided with the update, so too can they now expect those changes to be made in Wikimedia Commons. This is a novel approach that no other cultural institution with collections in Wikimedia Commons has yet developed.

Experimentation

The final component of our second year on this project was developing new approaches that leverage our work on our files’ structured data. While Structured Data on Commons is still relatively new and not all features are complete, DPLA envisions that this work will have additional benefits beyond the metadata records themselves, by improving discoverability of our items and the quality of contextual information accompanying our images in Wikipedia.

Using the structured data, DPLA developed a prototype for a template on Wikipedia that would automatically populate image captions with citations generated from the metadata. The mockup below shows one example of how image captions could be generated in this way from the data stored in Wikimedia Commons added by DPLA. Another way to express this data would be for the user to be able to use any free-text caption appropriate to the article, while the metadata is used to generate a citation for the image in a footnote. In the future, we would imagine this being used to allow for a one-click function to make it even easier for new and infrequent Wikipedia editors from our institutions to get involved in adding their content to Wikipedia articles. The more images our network can add to Wikipedia articles, the greater access and impact we will see for our work.

DPLA recently hosted an open meeting with about 30 attendees to demonstrate these prototypes and solicit feedback on the design. We have also been working with the Wikimedia Foundation on articulating user stories and priorities for future development of the Structured Data on Commons project to serve these needs.

Coming soon

From the start, DPLA realized that the success of the program hinged on being able to continue offering the upload service to new partners beyond the initial grant term, supporting our participating institutions in their Wikipedia engagement, and maintaining our assets and data on Wikimedia Commons on an ongoing basis. This was why we designed the program, even in our initial pilot year, as a service to the DPLA network, rather than a single batch upload.

Stay tuned for our next blog post, as we discuss where the project is going in our next phase. As always, if you are interested in your hub or institution getting involved in the program, please contact DPLA Data Fellow Dominic Byrd-McDevitt

Starting on June 1, DPLA will be kicking off regular monthly Wikimedia office hours, from 2–3 pm ET the first Wednesday of every month. Join us for a low-key conversation with the DPLA team, whether you are an existing participant with a question/concern, or just interested in learning more. Come any time in the hour, no need to pre-register. Find out more and get the link to join Office Hours.


On Thursday, June 16, at 1 pm ET, Dominic Byrd-McDevitt, DPLA Director of Community Engagement Shaneé Yvette Murrain, and the DPLA tech team will host a DPLA Member Meet Up to share an update on our Wikimedia Project. Please register here to join us.