NDSA:HathiTrust: Difference between revisions

From DLF Wiki
Csnavely (talk | contribs)
m 7 revisions imported: Migrate NDSA content from Library of Congress
 
(5 intermediate revisions by one other user not shown)
Line 2: Line 2:


# What is the particular preservation goal or challenge you need to accomplish? (for example, re-use, public access, internal access, legal mandate, etc.)
# What is the particular preservation goal or challenge you need to accomplish? (for example, re-use, public access, internal access, legal mandate, etc.)
# What large scale storage or cloud technologies are you using to meet that challenge? Further, which service providers or tools did you consider and how did you make your choice?  
#* HathiTrust's mission is (jointly) multi-institutional long-term preservation and access of digitized library materials.
# Specifically, what kind of materials are you preserving (text, data sets, images, moving images, web pages, etc.)  
# What large scale storage or cloud technologies are you using to meet that challenge? Further, which service providers or tools did you consider and how did you make your choice?
#* HathiTrust does not use cloud storage providers due to potential legal issues with copyrighted content. At least within the scope of digital preservation of library materials, we would consider ourselves to be a cloud storage provider, or similar to one. The large scale storage system we use is from Isilon which was chosen for its ability to scale in capacity and performance while keeping maintenance requirements constant and low. All well-known storage vendors (about 15) were considered.
# Specifically, what kind of materials are you preserving (text, data sets, images, moving images, web pages, etc.)
#* Currently, print volumes comprises the bulk of preserved content. Pilot projects for continuous tone images and audio are well underway.
# How big is your collection? (In terms of number of objects and storage space required)
# How big is your collection? (In terms of number of objects and storage space required)
# What are your performance requirements?
#* As of May 2011, HathiTrust is preserving 8.7 million print volumes which consume approximately 400TB of storage.
# What storage media have you elected to use? (Disk, Tape, etc)
# What are your performance requirements? Further, why are these your particular requirements?
# What do you think the key advantages of the system you use?
#* Fixity checking (checksum validation) requires all data to be read over an approximate 90-day periodicity, which, at the current repository size, translates to 54MB/s continuous read activity over 90 days (or higher over fewer days). This has, so far, been easily attainable through parallelization. The most recent peak in the ingest rate was ~500,000 volumes per month, which translates to an average of ~7MB/s write activity, also easily attainable.
# What do you think are the key problems or disadvantages your system present?
# What storage media have you elected to use? (Disk, Tape, etc) Further, why did you choose these particular media?
# What important principles informed your decision about the particular tool or service you chose to use?  
#* HathiTrust uses two instances of disk in different cities (several hundred miles apart) for primary storage and two instances of tape in one city (several miles apart) providing 6 months of previous-version retention (backups). Disk was chosen for primary storage because the archive is light, with materials continuously accessible, and because repository activities such as full-text search indexing and fixity checking require frequent low-latency access to content. Tape was chosen for backups because, as is the conventional wisdom, previous-version retention inflates storage requirements (in our case, roughly 25%) but does not demand low-latency access, and so tape is the most cost-effective medium.
# How frequently do you migrate from one system to another?
# What do you think are the key advantages of the system you use?
#* Being NAS-based, the storage environment is a simple filesystem which offers easy integration with applications and easy movement between vendor platforms. The Isilon system has a number of advanced data integrity features that are well-suited to digital preservation including internal checksums and inline data correction to protect against bit rot, misdirected/torn writes, and similar data storage risks.
# What do you think are the key problems or disadvantages your system presents?
#* The simplicity of filesystems can be seen as a risk for the same reason it is beneficial; without imposed structure, filesystems can accommodate informal and unrigorous access to data. When using a filesystem, the price paid for simplicity is that greater attention is required to permissions and formalized processes for data management.
# What important principles informed your decision about the particular tool or service you chose to use?
#* The need to use an in-house storage system was heavily influenced by the sensitive nature of copyrighted materials that are the subject of an ongoing lawsuit.
# How frequently do you migrate from one system to another? Further, what is it that prompts you to make these migrations?
#* HathiTrust migrated to Isilon storage in 2007 from temporary DAS and has not since needed to do another whole migration, although the first cycle of hardware replacement (100TB at both locations) was completed using the built-in capability for removing and adding storage nodes; there was no direct handling of data required. The initial move was simply the initial hardware acquisition, and the hardware replacement is an annual cycle to maintain hardware currency; the equipment lifetime is 3-4 years.
# What characteristics of the storage system(s) you use do you feel are particularly well-suited to long-term digital preservation? (High levels of redundancy/resiliency, internal checksumming capabilities, automated tape refresh, etc)
# What characteristics of the storage system(s) you use do you feel are particularly well-suited to long-term digital preservation? (High levels of redundancy/resiliency, internal checksumming capabilities, automated tape refresh, etc)
#* The Isilon system enables us to have greater parity redundancy than conventional storage systems: we use N+3, as opposed to the conventional RAID 5 (N+1) or slightly novel RAID 6 (N+2). The latest release of Isilon system software computes checksums on write and validates checksums on read, correcting data as needed, and the cluster architecture scales the compute capacity required to do this work, as opposed to conventional head/tray design. The system provides features for built-in data migration for hardware replacement that eliminate the need for manual data handling or even downtime during hardware replacement.
# What functionality or processes have you developed to augment your storage systems in order to meet preservation goals? (Periodic checksum validation, limited human access or novel use of permissions schemes)
# What functionality or processes have you developed to augment your storage systems in order to meet preservation goals? (Periodic checksum validation, limited human access or novel use of permissions schemes)
#* We have developed fixity checking and other repository auditing tools as a check on our use of the storage system, ensuring that our processes have stored data correctly. This check doubles as bit rot detection in addition to that performed by the system. We have developed a small diagnostic script to isolate problems in the data synchronization from one site to another if the storage system reports an error.
# Are there tough requirements for digital preservation, e.g. TRAC certification, that you wish were more readily handled by your storage system?
# Are there tough requirements for digital preservation, e.g. TRAC certification, that you wish were more readily handled by your storage system?
#* There are no issues or shortcomings of the system with respect to TRAC requirements.

Latest revision as of 14:18, 11 February 2016

HathiTrust Response to Implementations of Large Scale Storage Architectures

  1. What is the particular preservation goal or challenge you need to accomplish? (for example, re-use, public access, internal access, legal mandate, etc.)
    • HathiTrust's mission is (jointly) multi-institutional long-term preservation and access of digitized library materials.
  2. What large scale storage or cloud technologies are you using to meet that challenge? Further, which service providers or tools did you consider and how did you make your choice?
    • HathiTrust does not use cloud storage providers due to potential legal issues with copyrighted content. At least within the scope of digital preservation of library materials, we would consider ourselves to be a cloud storage provider, or similar to one. The large scale storage system we use is from Isilon which was chosen for its ability to scale in capacity and performance while keeping maintenance requirements constant and low. All well-known storage vendors (about 15) were considered.
  3. Specifically, what kind of materials are you preserving (text, data sets, images, moving images, web pages, etc.)
    • Currently, print volumes comprises the bulk of preserved content. Pilot projects for continuous tone images and audio are well underway.
  4. How big is your collection? (In terms of number of objects and storage space required)
    • As of May 2011, HathiTrust is preserving 8.7 million print volumes which consume approximately 400TB of storage.
  5. What are your performance requirements? Further, why are these your particular requirements?
    • Fixity checking (checksum validation) requires all data to be read over an approximate 90-day periodicity, which, at the current repository size, translates to 54MB/s continuous read activity over 90 days (or higher over fewer days). This has, so far, been easily attainable through parallelization. The most recent peak in the ingest rate was ~500,000 volumes per month, which translates to an average of ~7MB/s write activity, also easily attainable.
  6. What storage media have you elected to use? (Disk, Tape, etc) Further, why did you choose these particular media?
    • HathiTrust uses two instances of disk in different cities (several hundred miles apart) for primary storage and two instances of tape in one city (several miles apart) providing 6 months of previous-version retention (backups). Disk was chosen for primary storage because the archive is light, with materials continuously accessible, and because repository activities such as full-text search indexing and fixity checking require frequent low-latency access to content. Tape was chosen for backups because, as is the conventional wisdom, previous-version retention inflates storage requirements (in our case, roughly 25%) but does not demand low-latency access, and so tape is the most cost-effective medium.
  7. What do you think are the key advantages of the system you use?
    • Being NAS-based, the storage environment is a simple filesystem which offers easy integration with applications and easy movement between vendor platforms. The Isilon system has a number of advanced data integrity features that are well-suited to digital preservation including internal checksums and inline data correction to protect against bit rot, misdirected/torn writes, and similar data storage risks.
  8. What do you think are the key problems or disadvantages your system presents?
    • The simplicity of filesystems can be seen as a risk for the same reason it is beneficial; without imposed structure, filesystems can accommodate informal and unrigorous access to data. When using a filesystem, the price paid for simplicity is that greater attention is required to permissions and formalized processes for data management.
  9. What important principles informed your decision about the particular tool or service you chose to use?
    • The need to use an in-house storage system was heavily influenced by the sensitive nature of copyrighted materials that are the subject of an ongoing lawsuit.
  10. How frequently do you migrate from one system to another? Further, what is it that prompts you to make these migrations?
    • HathiTrust migrated to Isilon storage in 2007 from temporary DAS and has not since needed to do another whole migration, although the first cycle of hardware replacement (100TB at both locations) was completed using the built-in capability for removing and adding storage nodes; there was no direct handling of data required. The initial move was simply the initial hardware acquisition, and the hardware replacement is an annual cycle to maintain hardware currency; the equipment lifetime is 3-4 years.
  11. What characteristics of the storage system(s) you use do you feel are particularly well-suited to long-term digital preservation? (High levels of redundancy/resiliency, internal checksumming capabilities, automated tape refresh, etc)
    • The Isilon system enables us to have greater parity redundancy than conventional storage systems: we use N+3, as opposed to the conventional RAID 5 (N+1) or slightly novel RAID 6 (N+2). The latest release of Isilon system software computes checksums on write and validates checksums on read, correcting data as needed, and the cluster architecture scales the compute capacity required to do this work, as opposed to conventional head/tray design. The system provides features for built-in data migration for hardware replacement that eliminate the need for manual data handling or even downtime during hardware replacement.
  12. What functionality or processes have you developed to augment your storage systems in order to meet preservation goals? (Periodic checksum validation, limited human access or novel use of permissions schemes)
    • We have developed fixity checking and other repository auditing tools as a check on our use of the storage system, ensuring that our processes have stored data correctly. This check doubles as bit rot detection in addition to that performed by the system. We have developed a small diagnostic script to isolate problems in the data synchronization from one site to another if the storage system reports an error.
  13. Are there tough requirements for digital preservation, e.g. TRAC certification, that you wish were more readily handled by your storage system?
    • There are no issues or shortcomings of the system with respect to TRAC requirements.