Challenges Tackled and Opportunities Ahead: An Update from the DPLA Tech Team
Posted by Michael Della Bitta in February 13, 2018.
I arrived at DPLA in September of 2016 in the midst of multiple changes for the organization. Several long-time staff members with pivotal roles were moving on to other opportunities, and remaining staff faced the challenge of adapting to these departures, both in our everyday work and in terms of how these changes would impact the future of DPLA.
In spite of the changes underway and challenges at hand at DPLA, I came with a fairly different mindset. I had been working at my previous company for six years and was grateful to have the opportunity to leave some of the difficulties at that workplace behind me. Having a chance to make an impact by working on some new problems felt invigorating.
The staff on the Technology Team at DPLA was mired in more than one technological purgatory. We had an ingestion system that even after years of investment wasn’t keeping up with demand. Our frontend setup required doing four deployments of entirely different pieces of software just to add a link to our menu navigation. Making small changes or upgrades to the common open source software we ran like WordPress or Omeka could easily blow up in your face because they were so heavily customized. It was hard to make progress.
In spite of these woes, I felt lucky. I had joined a team that was eager to take control of the situation and work through these challenges. It is perhaps a more exciting time to join an organization when things are in flux than when things are moving perfectly smoothly. Lessons learned in previous positions that were taken for granted in those settings might have huge impacts in a new one. Also, I would much rather work with a team with ambitions for bettering things even when the chips are down, because that drive for improvement will continue to propel them to new heights when things get better.
Sometime in December of 2016, the team started to discuss the topic of what we might talk about at Code4Lib 2017. At that time, all of us dreaded the prospect because we didn’t feel like we had anything to teach our community. We knew we’d turn things around, but we were leery of making a big presentation about what we thought we were going to do only to have to correct the record later when some of the ideas didn’t pan out. So DPLA was mostly in a read-only mode for Code4Lib 2017.
Now that it’s time for Code4Lib 2018, it’s a good time to look back over the previous year. I can say that our team did a great job turning things around, and things will progress smoothly throughout this year.
DPLA’s technology is continuing to grow and improve. We have a new frontend currently in beta that we’ll roll out sometime over the next month, and we expect to ship ongoing improvements to our site regularly now. We’re updating our search and our API for better performance and more capabilities. Our ingestion system is keeping pace with our providers and performing better than ever. For more technical information about what we have been up to this year and our plans for the coming months, keep reading!
Previously, our frontend consisted of four web applications stuck together with NGINX as a proxy: we ran WordPress for our blog and information pages; Omeka ran our exhibitions section with customized code for the theme and to interface with our public API; we built a customized CMS in Rails to power Primary Source Sets; and finally, the search and discovery layer for DPLA is another custom in-house Rails app.
There were some big drawbacks to this approach. On the security side, our attack surface is essentially as big as all of those four applications strung together. But possibly even worse is that any change to the navigation or the global template for DPLA had to be implemented four times, and then four deployments needed to happen to get it out. Each of these triggered a code review and version control release process, and each could go wrong as dependencies went out of date. This was a significant drain on resources, and likewise a significant source of friction keeping us from contemplating and executing on changes. Not what you want to have to deal with as a public-facing Internet site!
Our new frontend is based on a React/Node.js/Next.js stack. We’ve taken Omeka, WordPress, and Primary Source Sets and redeployed them as REST API endpoints on our internal network so they’re not exposed to the open Internet. Much of the navigation and content on the information pages of the site is now editable in the backend WordPress instance, so now many content updates don’t require us to redeploy one Web application, let alone four.
Since the services hosting the remaining content are hidden, we’re able to transparently migrate those over to WordPress if we wish, or we can migrate to another content management system if that works for us better. Using these CMSes as merely REST-enabled content repositories rather than web applications means we don’t need to reimplement the frontend in whatever language the CMS uses for templating, and we can mix and match them.
We worked with an outside vendor over a good portion of last year to develop a new design and get started on this new frontend code for https://dp.la. We took their work, expanded on it and refined it, and put a big chunk of it online at https://beta.dp.la to let people see where we’re headed. We’ve spent a bunch of time improving accessibility and search engine optimization of this new site as well.
During the first half of 2017, we started getting aggressively harvested by some new bots. These scripts were not behaving according to the rules; they’d ignore our robots.txt directives and would do nasty things like run searches on common terms and crawl all the facets. Traffic like this ended up knocking our search cluster offline and wreaking havoc on the services that depended on it. We started suffering through outages at random times throughout the day.
We were able to make short work of this problem thanks to some new options in Amazon Web Services, most notably: CloudFront, their Content Delivery Network solution; Web Application Firewall, which provides IP blacklist, XSS and SQL Injection blocking; and AWS Shield, which provides protection against Distributed Denial of Service attacks.
Ingestion and Backend
In terms of our backend systems, we’ve made a lot of progress toward transforming them to a flexible architecture that allows us to move faster and provide better service.
We moved from our ever-growing ElasticSearch index as our de facto system of record to storing versioned data in Amazon Web Services’ S3. This makes a number of things possible. We’re now testing an Apache Spark-driven indexing system based on this pool of data that lets us rebuild our search index from scratch in about a half hour, which means schema migrations, upgrades, and disaster recovery are all very easy to do. If we ever make a major mistake while ingesting a new provider, it’s easy to roll back to an older version since the data is right there in S3. And we can throw as much load as we need at S3 without risking taking down our production infrastructure, so we can do analytics and other experiments without worry. These capabilities are going to be key to get some of our work done in 2018.
Development on DPLA’s new ingestion system, Ingestion 3, was extensively tested and put in production for a small number of providers in December of 2017. We will continue to migrate providers over to the new system throughout 2018, as well as onboard new providers using this system on an ongoing basis.
We based this system on Apache Spark as the computing framework, and continue to see great performance gains from this choice. An early goal was to be able to operate at the scale of the current dataset from National Archives. We were able to ingest and index a data dump from NARA of nearly 4 million objects at the end of 2017, which was the largest individual ingest that DPLA has ever accomplished by a margin of about 40%. Unlike previous ingests which generally take the better part of a week, this process took around a day to accomplish.
This year, you can expect to see DPLA make significant gains on the UX of our site in terms of refinements and new features. We’re working on a new version of our API that will be powered by a modern version of ElasticSearch, and should offer some new capabilities. We’ll be working on some services and features directed at our new Member Network. And we’ll continue work on our Ingestion 3 project, with a focus on making more of it useful for other organizations.
In conclusion, I’d like to thank the members of the technology team at DPLA for making all this possible, and the rest of the staff and greater DPLA community for making this a great place to work!