Technology Spotlight: Member Analytics Dashboards

By Audrey Altman, August 28, 2018.
Published under:

This post is the first in our fall technology spotlight series. In this and each of the following posts in this series, we’ll offer an inside look at the development of our recent technology initiatives, including the tools we used, obstacles we encountered, and our thinking behind how and why we have built each new product.

The Digital Public Library of America (DPLA) has just released a new Analytics Dashboard, a web application that provides our Member Hubs with detailed data about how their materials are being used, and the quality of their metadata records. Our launch announcement earlier this month outlined the new capabilities and data made available to our Member Hubs through the dashboards and, in this post, we wanted to share the tools and services powering this new product behind the scenes.

To support the collection, analysis, and reporting of metadata quality and usage data, DPLA applications interface with two key external services, Google Analytics and Amazon S3 cloud storage. Thanks to readily available libraries and packages in a variety of coding languages, we can manage requests to and responses from Google Analytics and Amazon S3 APIs across DPLA’s diverse technology stack. We use Apache Spark to analyze large quantities of metadata records and produce insights into their quality.

This diagram shows interactions between several DPLA applications, Google Analytics, and Amazon S3.

Google Analytics

DPLA has been using Google Analytics for some time to track traffic to its website. For this project, we enhanced the data we were already collecting about website usage, and added data about use of the public DPLA API. While Google Analytics provides a web interface to explore usage data, we created a custom dashboard to better serve the needs of our Member Hubs.

The DPLA website tracks traffic with Google Analytics global site tag (gtag.js), a JavaScript library.  We supplement the standard, out-of-the-box metrics with custom events.  These events collect data about use of our partners’ content in various contexts, such as our digital library catalog and various curated resources, such as Exhibitions and Primary Source Sets.

The DPLA API also tracks usage with Google Analytics. A client-side solution such as gtag.js is impractical for the DPLA API, since most of our users do not engage with it though a web browser. Instead, usage tracking is handled server-side. The DPLA API application uses the Google’s measurement protocol to POST hit and event data to Google Analytics through HTTP.

The Member Analytics Dashboard interfaces with two of Google’s read APIs. On the server-side, Google’s reporting API facilitates queries over DPLA’s usage data. This API can process a variety of complex queries, allowing us to tailor the dashboard to the specific needs of our members. On the client-side, the Google Analytics embed API transforms usage data into visualizations, such as maps and timelines. Authentication through OAuth2 produces a key that can be shared between server-side and client-side requests.

This screenshot from the Member Analytics Dashboard shows a visualization rendered using the Google Embed API.

Amazon S3 and Apache Spark

Over the past two years, the DPLA tech team has been building out our capacity to work with our data in both Amazon S3 and Apache Spark to support both ETL (extract, transform, load) and data analysis. The Member Analytics Dashboard is one example of how that investment is paying off.

Pre-processed metadata quality reports are derived from analysis of all the records in the DPLA’s index. This is a relatively large dataset (roughly 30 million records at the time of this writing), which is available for bulk download on Amazon S3. The metadata QA application uses Apache Spark to load the dataset into a distributed processing engine, run analytical queries, and produce lightweight CSV summary reports. These reports are then written back to S3, and accessed by the Member Analytics Dashboard.

Handling the challenges of third-party APIs

Relying on third-party services does create some challenges.  One downside is that we cannot control their performance. In the Member Analytics Dashboard, this is especially noticeable in the time it takes Google Analytics to process some of our more complex queries. Asynchronous page rendering helps improve user experience in this context by allowing the rest of the page to render while waiting on the slower requests to complete.

Despite the challenges, the overall incorporation of Google Analytics and Amazon S3 Spark have allowed us to improve our data collection and reporting for Member Hubs.

These before and after screenshots the Member Analytics Dashboard show the benefit of asynchronous loading when relying on external API requests.  The item count appears to the user almost instantaneously, while the more costly request for API use data completes.

Next steps

During the next phase of the project, we will be collecting feedback from Member Hubs with the help of the Member Network Assessment Working Group. We are especially interested to hear how the dashboard fits into our Members’ workflows, impacts strategic decision-making, and helps them communicate their value to stakeholders. We will scope future development cycles based on our findings.  With the infrastructure now in place to support more sophisticated data collection and reporting, DPLA can be more responsive to the needs of our Member Hubs going forward.

If you’d like to get in touch with us concerning this or other tech projects, email tech@dp.la.

This work is funded by the Andrew W. Mellon Foundation.