NDSA:Discussions on Preservation Storage Topics

From DLF Wiki
Jump to navigation Jump to search

Statement of Purpose

The Infrastructure Working Group, in February 2012, initiated a series of open conversations on detailed aspects of preservation storage. These conversations are conducted over the listserv and each topic is discussed over the course of 2-3 weeks. A list of future, potential discussion topics is maintained at bottom and can be augmented by group members. This page serves to capture the content of those conversations for further elaboration by group members.

Topic 1: Encryption

Do you have any opinions on it? What are your reasons for your opinions (gut feelings are OK)?

  • The majority of the preservation data we deliver to our clients is stored on LTO data tapes - without encryption. We do use WORM capability if the client is OK with it. Our reasons are mainly based on the the assumption that we do not have any control over who can access the tape, now or in the future, and staffing changes might stifle the client's ability to recover the preservation files ("now where did the last person put the list of encryption keys?")
  • Pros:
    • Strong encryption eliminates worries associated with unauthorized access to preservation copies of materials (such as copyrighted data).
    • Encryption doubles as an authenticity check, and in fact, some encryption methods involve the creation of a digital signature that can be used for provenance or bit rot detection.
  • Cons:
    • Encryption causes file size bloat to the tune of 20-30%.
    • For light archives, encryption imparts a performance penalty for systems that need to extract the content from the preservation archive for access purposes.
  • Duracloud's approach to encryption is in response to what consumers of cloud storage are requesting. The number one concern is over unauthorized access.
  • There is a surfeit of advice promoting a natural wariness towards encryption, though no known studies have addressed it specifically. This conventional wisdom of avoidance is mostly likely driven by the security risk of losing keys (as mentioned above) and the challenge it poses to access, especially to those that may have a legitimate reason (or authorization) to access the data.

What kinds of problems do you think it might create in the future?

  • See above. We're most concerned that staffing issues combined with object-based vault management infrastructures in place could lead to problems. Certainly not saying that is the best rationale, but it is based on current reality.
  • As mentioned, preservation of the encryption keys is typically raised as a long-term concern (see below)
  • Format obsolescence and the need for migration is of equal concern with encryption formats as it is with data storage formats themselves.
  • Similar to the Tivoli Storage Manager example, many cloud storage consumers want to separate the responsibilities of data security from that of storage by uploading already encrypted content. However, the burden of client-side encryption poses a barrier to some.
  • There is the potential for encryption requirements to force a revision of the architectural designs of preservation repositories. If ingesting and preserving content with potential HRCI (risk confidential information), PII (personally identifiable information), or other sensitive/private information, institutional policies or legal requirements may dictate security policies. This can determine encryption requirements which, in turn, can necessitiate the use of specific storage media and architecture (see note below on encrypting disk vs. encrypting tape).
  • The aforementioned wariness is also likely caused by the uncertainty regarding its complication of format migrations, data mining, and other automated preservation functions. That added layer of complexity is itself an additional preservation risk.
  • A key concern is that encryption will overcomplicate legitimate access to content.

Do you have any current requirements to do this (laws, policies)? What are the conditions under which you need to encrypt? Do you know of any upcoming requirements for you to do this?

  • What started us having to consider encryption in our preservation repository was our email archiving project. We'll be preserving email with permanent scholarly value. Email is the first type of content we are preserving that could potentially have HRCI (high risk confidential information) or even just sensitive/private information. Because of its potential privacy issues and sheer quantity, we are treating all email that comes into the repository as potentially sensitive. Its range of potential sensitive/private information means that we are subject to the university security policy regarding HRCI/personal/private information as well as all relevant state and federal laws (HIPAA, FERPA, MA state encryption law, etc.) for this content.
  • While the above requirements caused us to revise our policies & architectural design, it means that we will be able to accept sensitive content of any type (beyond email) when we are done.
  • We received mixed advice regarding software vs. hardware encryption. We were told software encryption solutions were immature (performance problems and worse) and that hardware encryption was the way to go. Some of our system administrators looked at the encryption offerings and found some big drawbacks not even considering effect on preservation (expense mainly but also having to manage a couple of encryption key management devices).
  • We have since come to the conclusion that we are not required to encrypt this content on storage disks, because we are taking other measures (private network address space, local firewalls, periodic penetration tests, encryption on transport, etc.). But, if we use tape as part of the storage solution we will have to encrypt the tapes. We are replacing the DRS storage system this year so, in part because of this encryption requirement, we are considering an all-disk solution (up-to-now we have always included 2 tape copies. along with disk storage).
  • No (other respondents)

If you do it what technique(s)/strategies do you use? Do you isolate encrypted content from non-encrypted content?

  • Duracloud, as a provider, is developing tools to accommodate a number of scenarios:
    • client does all encryption and key management themselves
    • client manages keys, but provides them to upload tooling to encrypt content prior to transit
    • client wants persisted content to be encrypted, but would rather not deal with key management or the encryption process
    • layered on top of these scenarios is the consideration of the contents' usability within the storage system; such as indexing or metadata extraction.
  • We use Tivoli Storage Manager (TSM) client-side encryption for our tape backups. We do not operate the tape backup system we use, so we want to isolate our data security from the tape backup system environment.
  • Encryption is done for security purposes - tracking is done w/ barcodes entered into the preservation metadata database, and vault system databases do not usually "refresh" with each other, so the 2 are not in sync. Again, not saying by any stretch this should/ could be considered "best practices" - it's just what typically happens. Refresh cycles (while another topic) are also a problem in this environment, as there is typically less interest in subsequent updates of the data tapes that have not "recouped" their initial cost of creation. I guess I'm trying to convey there is more interest in keeping the 10-20% of backups that have been profitable than a consistent policy that treats all digital preservation files equally.

On decryption keys

  • Long-term secure preservation of the decryption keys themselves is typically raised as a concern, although personally I feel that solutions to this problem are straightforward, albeit complex. I view this as a compound problem that requires a combination of preservation storage principles and security principles to solve:
      • Preservation storage - There have to be multiple copies of the keys (existing framework of geographic distribution should facilitate this)
    • Security storage - The keys themselves obviously have to be secured in some way. This can be done with either additional encryption or physical security, ie a locking safe, or both. The key point is that this chain ultimately ends in human knowledge, i.e., people have to know secrets. The trick is ensuring that enough people know enough secrets to eventually lead to the encryption keys. Providing office staff at multiple sites with combinations to safes that contain the encrypted encryption keys that a more privileged group of repository administrators know the secret for is an example of adding multiple layers into the scheme.
    • Geographic redundancy mitigates decryption key disaster planning
  • Security risk can never be zero, but that the risk can be brought into an acceptable range with a scheme that is well-thought-out by existing digital preservation technology and policy frameworks.

Do you know of any relevant studies/papers, etc. about this topic?

  • No (all respondents)... I think we've identified a real gap in the literature.

Potential Future Discussion Topics

Preservation Policies

  • number of copies
  • bit integrity check frequency
  • storage hierarchy

Emerging Storage Technology

  • data reduction/de-duplication
  • device encryption
  • cloud providers
  • WORM devices
  • federated clusters

Decision Factors

  • collection size
  • budget
  • development resources

Other Ideas