How will the integrity of the digitized documents be maintained

From Digital Library of America Project
Jump to: navigation, search
DPLA Wiki Navigation
About the DPLA
DPLA Website
Main PageBerkman Center
Steering Committee
Ongoing Work
Workstreams
Audience and ParticipationContent and Scope
Financial/Business ModelsGovernance
Legal IssuesTechnical Aspects
Additional Activities
Beta SprintWorkshopsEvents
Media and Blog Mentions
Possible Models
List of Models
Concept Note
Get Involved
Community PortalSign on
Join the listservListserv archives
Weekly listserv recapsSuggested Resources

I'm not sure how data in the DPLA will be made available, but I'd like to start the discussion that is overlooked and needs some serious attention - data integrity.

It is not enough to categorize and make data available and just let people have access to it. It is not enough to just secure a server where the data resides and ensure the important files and processes haven't been tampered with. It is of the utmost importance that the data can be trusted to be original and there is management of the data from the time it is created to the time it is destroyed - data provenance.

What value is well organized data when the data has been altered and there is no way to determine what changed or even which files were changed? Who would trust the well organized data? How long would it take to determine which of the thousands or millions of files have changed?

Data integrity should be one of the foundations of the DPLA. Whether the DPLA simple provide the links to the content archives or the DPLA stores the content on its own servers, integrity should be a foundation of how the material is made available from beginning to end. The integrity of the data should be managed from the moment it is being digitized.

Checksums are mathematical computations that create a "unique" digital fingerprint of a document and it is supposed to be unique to only that document, mathematically speaking. That is, once the document is created and in its final form, a checksum could be created for that document. In practice, it works well, if the mathematical algorithm that creates the checksum is well written and open for public review and scrutiny. These checksums can be accompanied with the document when it is made available and automated mechanisms can be put in place to allow users the opportunity to determine if the document is original.

Another method is using digital signatures. In this case, a publisher of data generates a "unique" fingerprint of a document when it is finalized and users can then download the publisher's public key and check the fingerprint to ensure that not only is the integrity intact, but a particular publisher created the document. This option is more trustworthy and provides some level of audit trails, going back to data provenance, since it will provide the name of the person, or company, that created the document and the date and time it was modified. It doesn't keep a database of changes, but that could easily be solved.

The former CIO of NESDIS (National Environmental Satellite and Data Information Service), Zach Goldstein, and I discussed a proposal for a "Deep Archive (DA)." A DA is unique in that the servers and databases where all the original data is stored is disconnected from the Internet, which dramatically limits the exposure of the system and the data from attacks from the Internet. Access is only allowed out on a scheduled basis to receive system updates (e.g. Operating System enhancements, security patches, etc.). Any of the original data that needs to be made public would first have a digital signature created and stored in a database on and off-site. That data is then pushed out and made available on a separate server for the public to access (and check the digital signature). If the data became corrupt or was altered by an unauthorized person, the DA can refresh the data in its original form. This may be a great model for providers of material in the DPLA.

Ask yourself and your organization this question. If the digitized data that is available on our public site is compromised or becomes corrupt, can we quickly refresh it from an "original" copy? What level of certainty is there that the original data was not modified without a checksum or digital fingerprint to check it?


UPDATED 10/12/2011

Below are some benchmarks results on some research I've been conducting. I work in a data center where we ingest 5 million+ files a day. Not only do we need to generated a

lot of signatures on files, but it needs to be done quickly to keep up with the volume of data.

These results show how fast a digital signature can be generated along with an original document. In the case of the DPLA, it can be an E-book, video, audio, or an image.

I'll explain how this is applicable to the DPLA shortly, bear with me.

Commands were executed on Ubuntu 10.04 using OpenSSL Dual Core Intel Xeon 3.0GHz and 16GB RAM


File Size Time in seconds
2.1GB 5
337MB 1.9
10MB .231

These are a range of files we have to deal with on a daily basis. I didn't include all file sizes or it would have been too boring and redundant.

To make it more applicable to DPLA, I took the Project Gutenberg (PG) "April 2010 (Dual Layer) DVD" available here: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project and created digital signatures of each of the 64,162 files and took just under 7 minutes. I created signatures on every file, not just the zip files. The zip files ranged in size from just over 600 bytes to just over 58MB.

What does this mean?

Management:

Implementing digital signatures *may* not require a lot of resources (people, time, or financial). I put "may" because it would depend on the size of the archive. The signatures, for the PG DVD archive, were created with software that is already available on most operating systems, by default, except Windows, called OpenSSL. That is the same software, more than likely, that allow you to order from Amazon.com and transmit your credit card over a secure connection.

In your organization, you would need, at a minimum, a server to store the original files (archive server), in this case all the e-books for PG and the digital signatures, and a public server to distribute the e-books and its respective signatures.

The reason those are separate is so that the original e-book and signature can be refreshed on the public server if it is becomes corrupted, the public server dies, or there

is a compromise of the public server and you want to ensure your customers have an unmodified copy of the e-book. This is the concept of the "Deep Archive." This also works

for the cloud. If all the e-books from PG were published on servers within an Amazon cloud and the server was compromised, all the e-books can be refreshed in

the cloud along with its digital signature, from the archive server. The PG archive server would NOT be located within the Amazon cloud, but within the local network of the Project Gutenberg main office.

Archive server:

This server does nothing but hold the original e-books and the signatures for each one. No one surfs the web from this server, no logins to test anything on this server, or

any non-administrative work. The only login is to upgrade for software or security updates (or other administrative tasks). Limited users have access to this server.

Backup server:

It is good practice to backup all critical servers within an organization so it there in case the archive or public server needs to be restored.

Public server:

Serve e-books and provide signatures for verification by customers. IMPORTANT! This server cannot connect directly to the Archive server. However, the Archive server can

connect to the public server - ONLY to refresh e-books and signatures.

In summary, adding digital signatures to a digital library is possible. Some archives will take longer than others to create signatures, but the technology is available to make that process manageable. Additionally, for smaller archives, it doesn't require a lot of resources to implement. For existing large archives, it will require a change that can be incorporated into the creation process of digital documents, whatever it may be, but creating signatures for existing data may be time-consuming. However, I do believe it will be well worth it when the *integrity* of a digital document is in question.


Technical notes:

The signatures above were created using a 384-bit Elliptic Curve key and used SHA1 for the checksum. The process is done by computing the SHA1 checksum and then creating a signature on the checksum to create the digital signature.

KEY GENERATION:

You can test this by generating ECC keys with the latest version of OpenSSL:

openssl ecparam -name secp384r1 -genkey -out secp384r1.pem openssl ec -in secp384r1.pem -pubout -out secp384r1-public.pem

Signing a file:

openssl dgst -sha1 -sign secp384r1.pem -out filename.sig <file-to-sign>

Verify a signature and file using the public key:

openssl dgst -sha1 -verify secp384r1-public.pem -signature filename.sig <file-that-was-signed>


Verify a signature and file using the secret key:

openssl dgst -sha1 -prverify secp384r1.pem -signature filename.sig <file-that-was-signed>

I installed incrond to monitor a directory for new files and compute the digital signature when a new file was created in the monitored directory. According to systat, after running these test numerous times, the system resource utilization was very low.

System statistics over a few hours before running the benchmark starting at 03:11 to 03:17.

Time CPU  %user  %nice  %system  %iowait  %steal  %idle
11:35:01 AM all 0.06 0.00 0.05 0.00 0.00 99.90
11:45:01 AM all 0.01 0.00 0.03 0.00 0.00 99.95
11:55:01 AM all 0.01 0.00 0.03 0.00 0.00 99.96
12:05:01 PM all 0.05 0.00 0.06 0.00 0.00 99.89
12:15:01 PM all 0.02 0.00 0.01 0.00 0.00 99.97
12:25:01 PM all 0.04 0.00 0.02 0.00 0.00 99.94
12:35:01 PM all 0.08 0.00 0.02 0.00 0.00 99.89
12:45:01 PM all 0.03 0.00 0.04 0.00 0.00 99.93
12:55:01 PM all 0.02 0.00 0.02 0.00 0.00 99.96
01:05:01 PM all 0.09 0.00 0.01 0.00 0.00 99.89
01:15:01 PM all 0.03 0.00 0.02 0.00 0.00 99.95
01:25:01 PM all 0.02 0.00 0.02 0.00 0.00 99.96
01:35:01 PM all 0.10 0.00 0.04 0.00 0.00 99.86
01:45:01 PM all 0.02 0.00 0.04 0.00 0.00 99.94
01:55:01 PM all 0.02 0.00 0.04 0.00 0.00 99.94
02:05:01 PM all 0.10 0.00 0.03 0.00 0.00 99.87
02:15:01 PM all 0.03 0.00 0.04 0.00 0.00 99.92
02:25:01 PM all 0.02 0.00 0.02 0.00 0.00 99.95
02:35:01 PM all 0.82 0.00 1.10 0.24 0.00 97.84
02:45:01 PM all 0.83 0.00 1.48 0.25 0.00 97.44
02:55:01 PM all 0.10 0.00 0.80 0.56 0.00 98.54
03:05:01 PM all 0.18 0.00 0.31 0.89 0.00 98.62
03:15:02 PM all 1.99 0.00 2.64 6.70 0.00 88.67
03:25:01 PM all 2.63 0.00 3.84 2.26 0.00 91.27
Personal tools