NDSA:HathiTrust

From DLF Wiki
Revision as of 12:16, 16 May 2011 by Csnavely (talk | contribs)
Jump to navigation Jump to search

HathiTrust Response to Implementations of Large Scale Storage Architectures

  1. What is the particular preservation goal or challenge you need to accomplish? (for example, re-use, public access, internal access, legal mandate, etc.)
    • HathiTrust's mission is (jointly) multi-institutional long-term preservation and access of digitized library materials.
  2. What large scale storage or cloud technologies are you using to meet that challenge? Further, which service providers or tools did you consider and how did you make your choice?
    • HathiTrust does not use a cloud storage provider. At least within the scope of digital preservation of library materials, we would consider ourselves to be similar to a cloud storage provider. The large scale storage system we use is from Isilon.
  3. Specifically, what kind of materials are you preserving (text, data sets, images, moving images, web pages, etc.)
    • Currently, print volumes comprises the bulk of preserved content. Pilot projects for continuous tone images and audio are well underway.
  4. How big is your collection? (In terms of number of objects and storage space required)
    • As of May 2011, HathiTrust is preserving 8.7 million print volumes which consume approximately 400TB of storage.
  5. What are your performance requirements?
    • Checksum validation requires all data to be read over an approximate 90-day periodicity, which, at the current repository size, translates to 54MB/s read activity which is attainable through parallelization. The most recent peak in the ingest rate was ~500,000 volumes per month which translates to an average of ~7MB/s writing.
  6. What storage media have you elected to use? (Disk, Tape, etc)
    • HathiTrust uses two instances of disk in different cities (several hundred miles apart) for primary storage and two instances of tape in one city (several miles apart) for backup and 6 months of previous-version retention.
  7. What do you think the key advantages of the system you use?
    • Being NAS-based, the storage environment is a simple filesystem which offers easy integration and movement between vendor platforms. The Isilon system has a number of advanced data integrity features that are well-suited to digital preservation including internal checksums and inline data correction to protect against bit rot, misdirected/torn writes, and similar data storage risks.
  8. What do you think are the key problems or disadvantages your system present?
    • The simplicity of filesystems can be seen as a risk for the same reason it is beneficial; without imposed structure, filesystems can accommodate informal and unrigorous access to data. When using a filesystem, the price paid for simplicity is that greater attention is required to permissions and formalized processes for data management.
  9. What important principles informed your decision about the particular tool or service you chose to use?
    • The storage solution was chosen for its ability to scale in capacity and performance while keeping maintenance requirements constant and low.
  10. How frequently do you migrate from one system to another?
    • HathiTrust migrated to Isilon storage in 2007 from temporary DAS and has not since needed to do another migration, although the first cycle of hardware replacement (100TB at both locations) was completed using the built-in capability for removing and adding storage nodes; there was no direct handling of data required.
  11. What characteristics of the storage system(s) you use do you feel are particularly well-suited to long-term digital preservation? (High levels of redundancy/resiliency, internal checksumming capabilities, automated tape refresh, etc)
    • The Isilon system enables us to have greater parity redundancy than conventional storage systems: we use N+3, as opposed to the conventional RAID 5 (N+1) or slightly novel RAID 6 (N+2). The latest release of Isilon system software computes checksums on write and validates checksums on read, correcting data as needed, and the cluster architecture scales the compute capacity required to do this work, as opposed to conventional head/tray design. The system provides features for built-in data migration for hardware replacement that eliminate the need for manual data handling or even downtime during hardware replacement.
  12. What functionality or processes have you developed to augment your storage systems in order to meet preservation goals? (Periodic checksum validation, limited human access or novel use of permissions schemes)
  13. Are there tough requirements for digital preservation, e.g. TRAC certification, that you wish were more readily handled by your storage system?
    • There are no issues or shortcomings with respect to TRAC requirements.