NDSA:Discussions on Preservation Storage Topics

From DLF Wiki
Jump to navigation Jump to search

Statement of Purpose

The Infrastructure Working Group, in February 2012, initiated a series of open conversations on detailed aspects of preservation storage. These conversations are conducted over the listserv and each topic is discussed over the course of 2-3 weeks. A list of future, potential discussion topics is maintained at bottom and can be augmented by group members. This page serves to capture the content of those conversations for further elaboration by group members.

Topic 2: Compression

Questions

Data compression can decrease the cost of long-term preservation by reducing the amount of storage required. There are at least three types of compression to consider:

  • file compression, using a file compression algorithm suited to the file type
  • hardware compression, which usually means compression done by a tape drive as the data is written to tape
  • disk compression, which is performed by many new storage appliances and uses a combination of compression and de-duplication

If there are other kinds of compression, please add that into this discussion.

For each of these types of compression:

  • 1. Are you currently using this type of compression in your own archival storage (in the OAIS sense of long-term preservation storage)?
  • 2. How do you feel about using this type of compression for archival storage? Is this legitimate or something that Best Practice would discourage?
  • 3. What are the particular risks of this type of compression, if any, for preservation?
  • 4. Are there any advantages to using this type of compression beyond reducing storage costs?
  • 5. How do you trade off cost vs. risk?

Compressed formats are in general much more sensitive to data corruption than uncompressed formats. Due to the 'amplification' effect that compression has on data corruption, the percentage saving in storage space is often much less than the percentage increase in the amount of information that is affected by data corruption.

A Selected Bibliography

(Priscilla: I looked for best practices or other documents that addressed compression in the preservation context. Below are some snippets of what I found, addressing mainly file and hardware compression.)

  • Case Western University Archives: Compression adds complexity to long-term preservation. Some compression techniques shed "redundant" information. As an example, JPEG removes information to reduce file size. The image might look fine on your current monitor, but as monitors improve, the lower quality of the image will be more obvious.

Not encoding, in particular not using compression, typically results in files that have minimal sensitivity to corruption. In this way, the choice not to use compression is a way to mitigate against loss.

It is recommended that algorithms should only be used in the circumstances for which they are most efficient. It is also strongly recommended that archival master versions of images should only be created and stored using lossless algorithms. The Intellectual Property Rights status of a compression algorithm is primarily an issue for developers of format specifications, and software encoders/decoders. However, the use of open, non-proprietary compression techniques is recommended for the purposes of sustainability.

Data is often compressed or "scrambled" to assist in its storage and or protect it's intellectual content. These compression and encryption algorythms are often developed by private organisations who will one day cease to support them. If this happens you're stuck between a rock and a hard place. If you don't want to get into legal trouble you are no longer able to read your data; and if you go ahead and "do the unwrapping yourself" it's quite possible you're breaking copyright law.

A similar obsolescence problem will have to be addressed with the file formats and compression techniques you choose. Do not rely on proprietary file formats and compression techniques, which may not be supported in the future as the companies which produce them merge, go out of business or move on to new products. In the cultural heritage community, thede factostandard formats are uncompressed TIFF for images and PDF, ACSII (SGML/XML markup) and RTF for text. Migration to future versions of these formats is likely to be well-supported, since they are so widely used. Digital objects for preservation should not be stored in compressed or encrypted formats.

Topic 1: Encryption

Questions

Do you have any opinions on it? What are your reasons for your opinions (gut feelings are OK)?

  • The majority of the preservation data we deliver to our clients is stored on LTO data tapes - without encryption. We do use WORM capability if the client is OK with it. Our reasons are mainly based on the the assumption that we do not have any control over who can access the tape, now or in the future, and staffing changes might stifle the client's ability to recover the preservation files ("now where did the last person put the list of encryption keys?")
  • Pros:
    • Strong encryption eliminates worries associated with unauthorized access to preservation copies of materials (such as copyrighted data).
    • Encryption doubles as an authenticity check, and in fact, some encryption methods involve the creation of a digital signature that can be used for provenance or bit rot detection.
  • Cons:
    • Encryption causes file size bloat to the tune of 20-30%.
    • For light archives, encryption imparts a performance penalty for systems that need to extract the content from the preservation archive for access purposes.
  • Duracloud's approach to encryption is in response to what consumers of cloud storage are requesting. The number one concern is over unauthorized access.
  • There is a surfeit of advice promoting a natural wariness towards encryption, though no known studies have addressed it specifically. This conventional wisdom of avoidance is mostly likely driven by the security risk of losing keys (as mentioned above) and the challenge it poses to access, especially to those that may have a legitimate reason (or authorization) to access the data.

What kinds of problems do you think it might create in the future?

  • See above. We're most concerned that staffing issues combined with object-based vault management infrastructures in place could lead to problems. Certainly not saying that is the best rationale, but it is based on current reality.
  • As mentioned, preservation of the encryption keys is typically raised as a long-term concern (see below)
  • Format obsolescence and the need for migration is of equal concern with encryption formats as it is with data storage formats themselves.
  • Similar to the Tivoli Storage Manager example, many cloud storage consumers want to separate the responsibilities of data security from that of storage by uploading already encrypted content. However, the burden of client-side encryption poses a barrier to some.
  • There is the potential for encryption requirements to force a revision of the architectural designs of preservation repositories. If ingesting and preserving content with potential HRCI (risk confidential information), PII (personally identifiable information), or other sensitive/private information, institutional policies or legal requirements may dictate security policies. This can determine encryption requirements which, in turn, can necessitiate the use of specific storage media and architecture (see note below on encrypting disk vs. encrypting tape).
  • The aforementioned wariness is also likely caused by the uncertainty regarding its complication of format migrations, data mining, and other automated preservation functions. That added layer of complexity is itself an additional preservation risk.
  • A key concern is that encryption will overcomplicate legitimate access to content.

Do you have any current requirements to do this (laws, policies)? What are the conditions under which you need to encrypt? Do you know of any upcoming requirements for you to do this?

  • What started us having to consider encryption in our preservation repository was our email archiving project. We'll be preserving email with permanent scholarly value. Email is the first type of content we are preserving that could potentially have HRCI (high risk confidential information) or even just sensitive/private information. Because of its potential privacy issues and sheer quantity, we are treating all email that comes into the repository as potentially sensitive. Its range of potential sensitive/private information means that we are subject to the university security policy regarding HRCI/personal/private information as well as all relevant state and federal laws (HIPAA, FERPA, MA state encryption law, etc.) for this content.
  • While the above requirements caused us to revise our policies & architectural design, it means that we will be able to accept sensitive content of any type (beyond email) when we are done.
  • We received mixed advice regarding software vs. hardware encryption. We were told software encryption solutions were immature (performance problems and worse) and that hardware encryption was the way to go. Some of our system administrators looked at the encryption offerings and found some big drawbacks not even considering effect on preservation (expense mainly but also having to manage a couple of encryption key management devices).
  • We have since come to the conclusion that we are not required to encrypt this content on storage disks, because we are taking other measures (private network address space, local firewalls, periodic penetration tests, encryption on transport, etc.). But, if we use tape as part of the storage solution we will have to encrypt the tapes. We are replacing the DRS storage system this year so, in part because of this encryption requirement, we are considering an all-disk solution (up-to-now we have always included 2 tape copies. along with disk storage).
  • No (other respondents)

If you do it what technique(s)/strategies do you use? Do you isolate encrypted content from non-encrypted content?

  • Duracloud, as a provider, is developing tools to accommodate a number of scenarios:
    • client does all encryption and key management themselves
    • client manages keys, but provides them to upload tooling to encrypt content prior to transit
    • client wants persisted content to be encrypted, but would rather not deal with key management or the encryption process
    • layered on top of these scenarios is the consideration of the contents' usability within the storage system; such as indexing or metadata extraction.
  • We use Tivoli Storage Manager (TSM) client-side encryption for our tape backups. We do not operate the tape backup system we use, so we want to isolate our data security from the tape backup system environment.
  • Encryption is done for security purposes - tracking is done w/ barcodes entered into the preservation metadata database, and vault system databases do not usually "refresh" with each other, so the 2 are not in sync. Again, not saying by any stretch this should/ could be considered "best practices" - it's just what typically happens. Refresh cycles (while another topic) are also a problem in this environment, as there is typically less interest in subsequent updates of the data tapes that have not "recouped" their initial cost of creation. I guess I'm trying to convey there is more interest in keeping the 10-20% of backups that have been profitable than a consistent policy that treats all digital preservation files equally.

On decryption keys

  • Long-term secure preservation of the decryption keys themselves is typically raised as a concern, although personally I feel that solutions to this problem are straightforward, albeit complex. I view this as a compound problem that requires a combination of preservation storage principles and security principles to solve:
    • Preservation storage - There have to be multiple copies of the keys (existing framework of geographic distribution should facilitate this)
    • Security storage - The keys themselves obviously have to be secured in some way. This can be done with either additional encryption or physical security, ie a locking safe, or both. The key point is that this chain ultimately ends in human knowledge, i.e., people have to know secrets. The trick is ensuring that enough people know enough secrets to eventually lead to the encryption keys. Providing office staff at multiple sites with combinations to safes that contain the encrypted encryption keys that a more privileged group of repository administrators know the secret for is an example of adding multiple layers into the scheme.
    • Geographic redundancy mitigates decryption key disaster planning
  • Security risk can never be zero, but that the risk can be brought into an acceptable range with a scheme that is well-thought-out by existing digital preservation technology and policy frameworks.

Do you know of any relevant studies/papers, etc. about this topic?

  • No (all respondents)... I think we've identified a real gap in the literature.

Attempt at Decision Analysis

1) Theoretically, maintenance of keys is straightforward although complex: e.g.

  • use a distributed PKI -- either (i) hierarchical or (ii) PGP-style
  • maintain physically protected copies off-line
  • escrow keys (possibly divided) with multiple (partially) trusted third parties

2) In practice, few organizations have good enterprise key management, and its been unexpectedly difficult to maintain even over the course of normal business timescales. Some commonly encountered challenges include:

  • managing key revocation and possible re-encryption of content using a revoked key
  • enterprise scaling of key management (esp. a(ii) (b) and (c) )
  • enterprise scaling of performance over encrypted content (e.g. barriers to deduplication of virtualized storage; overhead for computing on encrypted content; performance issues with encrypted filesystems; barriers to standard storage maintenance/recovery/integrity auditing )
  • enabling collaboration with encrypted content (managing appropriate group access to content, while maintaining desired security properties)
  • potential catastrophic single point security failures for PKI ( esp. certificate issuing, checking, revocation architecture)
  • proprietary encryption algorithms (esp. hardware embedded)

3) A key issue for preservation is managing risks to long term access, use of encryption creates additional risks for catastrophic / correlated/ single-point long-term access failure, such as:

  • undetected corruption of content due to defects in encryption hardware/software
  • " " due to increased barriers to auditing
  • increased risk of content corruption due to barriers to filesystem maintenance/recovery of encrypted content
  • loss of access to content because of loss of proprietary encryption technology/knowledge (try reading a hardware encrypted tape from 10 years ago  :-(
  • " " because of unintentional loss of key
  • " " because of unintentional/incorrect revocation/key destruction (e.g. "self destruct" mechanisms on encrypted hardware, such as IronKeys)
  • Financial risks to access because of increased costs of maintaining encryption


None of these are necessarily show stoppers -- in a particular environment one could of possible ways to mitigate these risks, project costs, compare to the benefits of encryption, and make a decision either way... Unless security experts are involved, risks from misimplementation or defects in the security software/hardware/protocols (etc.) are usually not on the radar; and additional risks for long term access are generally not on the radar even where a trained security expert is engaged. So its important to make sure that we, as preservation experts, communicate these additional long-term access risks and costs ...

--Micah altman 19:35, 28 February 2012 (UTC)

Addendum: In addition to loss of access to content because of loss of proprietary encryption technology/knowledge, would it also not be a possible threat that the entire PKI infrastructure could change over a long enough period of time? That, for example, current mechanisms for obtaining keys and revocation lists etc. could become unused and unavailable?

A selected bibliography

(From Micah Altman)

While PKI isn't my area of research here are some relevant articles from a shallow dive:

"Strategies for Ensuring Data Accessibility When Cryptographic Keys Are Lost" http://www.giac.org/paper/gsec/783/strategies-ensuring-data-accessibility-cryptographic-keys-lost/101698

The Risks of Key Recovery, Key Escrow, and Trusted Third-Party Encryption http://www.schneier.com/paper-key-escrow.html

Crypto Backup and Key Escrow http://dl.acm.org/citation.cfm?id=227241

A taxonomy for key escrow encryption systems : http://faculty.nps.edu/dedennin/publications/Taxonomy-CACM.pdf

(And 150+ citing articles: http://scholar.google.com/scholar?hl=en&lr=&cites=49628270739110236&um=1&ie=UTF-8&ei=zjRNT_K8BYuD0QGlzon3Ag&sa=X&oi=science_links&ct=sl-citedby&resnum=3&ved=0CD4QzgIwAg)

Although weighted towards "key escrow" solutions they seem to have substantial relevance to long term access.

Potential Future Discussion Topics

Preservation Policies

  • number of copies
  • bit integrity check frequency
  • storage hierarchy

Emerging Storage Technology

  • data reduction/de-duplication
  • device encryption
  • cloud providers
  • WORM devices
  • federated clusters

Decision Factors

  • collection size
  • budget
  • development resources

Other Ideas