NDSA:Preservation Storage Topic 2: Compression
Statement of Purpose
The Infrastructure Working Group, in February 2012, initiated a series of open conversations on detailed aspects of preservation storage. These conversations are conducted over the listserv and each topic is discussed over the course of 2-3 weeks.
Topic 2: Compression
Questions
Data compression can decrease the cost of long-term preservation by reducing the amount of storage required. There are at least three types of compression to consider:
- file compression, using a file compression algorithm suited to the file type
- hardware compression, which usually means compression done by a tape drive as the data is written to tape
- disk compression, which is performed by many new storage appliances and uses a combination of compression and de-duplication
If there are other kinds of compression, please add that into this discussion.
For each of these types of compression:
- 1. Are you currently using this type of compression in your own archival storage (in the OAIS sense of long-term preservation storage)?
- 2. How do you feel about using this type of compression for archival storage? Is this legitimate or something that Best Practice would discourage?
- 3. What are the particular risks of this type of compression, if any, for preservation?
- 4. Are there any advantages to using this type of compression beyond reducing storage costs?
- 5. How do you trade off cost vs. risk?
Compressed formats are in general much more sensitive to data corruption than uncompressed formats. Due to the 'amplification' effect that compression has on data corruption, the percentage saving in storage space is often much less than the percentage increase in the amount of information that is affected by data corruption.
A Selected Bibliography
(Priscilla: I looked for best practices or other documents that addressed compression in the preservation context. Below are some snippets of what I found, addressing mainly file and hardware compression.)
- Case Western University Archives: Compression adds complexity to long-term preservation. Some compression techniques shed "redundant" information. As an example, JPEG removes information to reduce file size. The image might look fine on your current monitor, but as monitors improve, the lower quality of the image will be more obvious.
- Wright, Miller and Addis in http://www.prestoprime.org/docs/training/Cost_of_risk_RW.pdf
Not encoding, in particular not using compression, typically results in files that have minimal sensitivity to corruption. In this way, the choice not to use compression is a way to mitigate against loss.
- TNA on Image Compression in http://www.nationalarchives.gov.uk/documents/image_compression.pdf
It is recommended that algorithms should only be used in the circumstances for which they are most efficient. It is also strongly recommended that archival master versions of images should only be created and stored using lossless algorithms. The Intellectual Property Rights status of a compression algorithm is primarily an issue for developers of format specifications, and software encoders/decoders. However, the use of open, non-proprietary compression techniques is recommended for the purposes of sustainability.
- Howard Besser, quoted in http://digitalpreservationstrategies.blogspot.com/
Data is often compressed or "scrambled" to assist in its storage and or protect it's intellectual content. These compression and encryption algorythms are often developed by private organisations who will one day cease to support them. If this happens you're stuck between a rock and a hard place. If you don't want to get into legal trouble you are no longer able to read your data; and if you go ahead and "do the unwrapping yourself" it's quite possible you're breaking copyright law.
- NINCH Guide to Good Practice: http://www.nyu.edu/its/humanities/ninchguide/XIV/
A similar obsolescence problem will have to be addressed with the file formats and compression techniques you choose. Do not rely on proprietary file formats and compression techniques, which may not be supported in the future as the companies which produce them merge, go out of business or move on to new products. In the cultural heritage community, thede factostandard formats are uncompressed TIFF for images and PDF, ACSII (SGML/XML markup) and RTF for text. Migration to future versions of these formats is likely to be well-supported, since they are so widely used. Digital objects for preservation should not be stored in compressed or encrypted formats.
- PRESTO Centre, Threats to Data Integrity from Large-Scale Management Environments: http://www.prestocentre.org/library/resources/threats-data-integrity-use-large-scale-management-environments
Related Format Issues: JPEG2000
The LC format standards did an environmental scan of institutional policies around JPEG2000. While not exactly related to storage compression, format standards and their compression are a consideration in preservation infrastructure, so I (Jefferson) thought it would be good information to add here. The page is located here