NDSA:Harvard IQSS - DVN/Murray Archive

From DLF Wiki

These responses pertain to Harvard's Digital Repository Service (DRS). See [1].

1. What is the particular preservation goal or challenge you need to accomplish? (for example, re-use, public access, internal access, legal mandate, etc.)

The DVN provides virtual archiving service for Harvard affiliates and for a wider research community. Both long-term access and preservation and short term dissemination needs are addressed. Much of the content is curated by others -- so a big challenge is to provide rich preservation and access services on content deposited in a diverse set of formats and with potentially minimal metadata.

2. What large scale storage or cloud technologies are you using to meet that challenge? Further, why did you choose these particular technologies?

Internally we use Netapp NAS for storage. We also deploy applications for access to collection via AWS (as a prototype), and we are using the SafeArchive system with LOCKSS to replicate content across the Data-PASS partner storage network.


3. Specifically, what kind of materials are you preserving (text, data sets, images, moving images, web pages, etc.)

Numeric data sets, qualitative text-data collections, audio and video (qualitative data/interviews).

4. How big is your collection? (In terms of number of objects and storage space required)

Approximately 500K files, 80 TB. Most of the TB come from a few large video collections, most of the files are numeric data.

5. What are your performance requirements? Further, why are these your particular requirements?

All content needs to be online for delivery without delay. On-line analysis is offered on numeric data, so need low latency requirements.


6. What storage media have you elected to use? (Disk, Tape, etc) Further, why did you choose these particular media?

A combination of disk & tape.

7. What do you think the key advantages of the system you use?

The Dataverse system automates much of the deposit workflow of diverse contributors, and provides metadata extraction, format conversion and identifier assignment. This leads to a much more preservable collection at minimal effort. LOCKSS is a robust system for automatic replication of the content. SafeArchive provides a policy based auditing layer that automatically reports when freshness, number of copies, distribution, and integrity checks are met. Universal Numeric Fingerprints guarantee integrity across format migrations.

8. What do you think are the key problems or disadvantages your system present?

Scaling up to PetaScale science. Automatic provisioning of replication network. Semantic fingerprints/UNF's not available for audio/video.

9. What important principles informed your decision about the particular tool or service you chose to use?

Transparency, maintainability, workflow support for contributors.

10. How frequently do you migrate from one system to another? Further, what is it that prompts you to make these migrations?

Storage back ends evolve over time, but this rests on institute enterprise storage and is essentially transparent from the archivist/curator point of view.

11. What characteristics of the storage system(s) you use do you feel are particularly well-suited to long-term digital preservation? (High levels of redundancy/resiliency, internal checksumming capabilities, automated tape refresh, etc)

Non-proprietary storage formats, semantic digital fingerprints, distributed replication, trac-based auditing.

12. What functionality or processes have you developed to augment your storage systems in order to meet preservation goals? (Periodic checksum validation, limited human access or novel use of permissions schemes)

Continuous integrity checking, digital fingerprints, auditing, workflow.

13. Are there tough requirements for digital preservation, e.g. TRAC certification, that you wish were more readily handled by your storage system?

Integration with data management plan policies. Confidential data management.