NDSA:Preservation Storage Topic 2: Compression: Difference between revisions
No edit summary |
|||
Line 68: | Line 68: | ||
[[NDSA:Preservation Storage Topic JPEG2000]] | [[NDSA:Preservation Storage Topic JPEG2000]] | ||
===Other Ideas | ===Other Issues and Ideas around Compression=== | ||
*"Vendor Dependence" with some algorithmic methods of compression | *"Vendor Dependence" with some algorithmic methods of compression | ||
*Compression in dark archives vs. "active" archives | *Compression in dark archives vs. "active" archives | ||
*Compression on online storage vs. offline storage | *Compression on online storage vs. offline storage |
Revision as of 13:15, 28 March 2012
Statement of Purpose
The Infrastructure Working Group, in February 2012, initiated a series of open conversations on detailed aspects of preservation storage. These conversations are conducted over the listserv and each topic is discussed over the course of 2-3 weeks.
Topic 2: Compression
Questions
Data compression can decrease the cost of long-term preservation by reducing the amount of storage required. There are at least three types of compression to consider:
- file compression, using a file compression algorithm suited to the file type
- hardware compression, which usually means compression done by a tape drive as the data is written to tape
- disk compression, which is performed by many new storage appliances and uses a combination of compression and de-duplication
If there are other kinds of compression, please add that into this discussion.
For each of these types of compression:
- 1. Are you currently using this type of compression in your own archival storage (in the OAIS sense of long-term preservation storage)?
- 2. How do you feel about using this type of compression for archival storage? Is this legitimate or something that Best Practice would discourage?
- 3. What are the particular risks of this type of compression, if any, for preservation?
- 4. Are there any advantages to using this type of compression beyond reducing storage costs?
- 5. How do you trade off cost vs. risk?
Compressed formats are in general much more sensitive to data corruption than uncompressed formats. Due to the 'amplification' effect that compression has on data corruption, the percentage saving in storage space is often much less than the percentage increase in the amount of information that is affected by data corruption.
Response 1, Andrea Goethels
I'm not sure this question is easier than the encryption one ;)
I'll answer the easier part for us first. We don't use any compression on disk or tape. One of the reasons we chose the storage technologies we're using now is because under the hood it aggregates the content in tar files which we are relatively comfortable with. File compression is another story...
I think it's helpful to discuss the digitized content we're preserving separately from the born digital content. I'll start with the digitized content. Within my department we don't digitize any content ourselves but we do provide format guidelines and some format requirements. It boils down to the fact that we will accept content in any format but we tell our customers that we have greater confidence in preserving uncompressed formats (I'm just referring to preservation copies, not access copies). There is only one compressed digitized format that we receive a lot of (JPEG2000 JP2) even though we still recommend TIFF uncompressed over JP2. The collection managers that choose to create JP2 over TIFF do it largely to reduce their storage fee (because they can use the JP2 as both the preservation and access format). We know that this is a risky area for us because there still isn't a lot of software support for JPEG2000 and because it's a fairly complicated format we have seen a lot of incorrect interpretations of the spec by software developers leading to invalid files. But, I think it's safe to say that because the future of JP2 is still unknown at this point, there's the chance that it could go in the direction of more acceptance/support. We are participating in an informal network of institutions/individuals who have a stake in seeing the success of JPEG2000 to lessen our risk there.
For born digital content we can't be as prescriptive as for digitized content (for obvious reasons). We're collecting Web content in any and all formats, many of which have compression. We also are collecting PDFs which can contain compressed images. And our email archiving project will introduce email attachments in all kinds of (potentially compressed) formats. We're also getting in images in compressed formats that came from digital cameras. We are more likely to have compressed formats for born digital content because the only copies are the access copies which tend to have compression.
Those are the cases where we receive compressed files. We also intentionally introduce lossless compression for some of our content. As is common in Web archiving, we store our Web content in our repository in compressed form (using Gzipped ARC files). In that case, we are doing what we think is the community standard for Web archives (except for the fact that the new standard is compressed WARC files instead of compressed ARC files). The tools built or that are being built for these types of files support the compression. We also compress our books digitized for the Google book project (ZIP files contain all the images and text for a particular book). In both the Web archive and the Google book case we are compressing them because they would take up too much storage space if we didn't. Both GZIP and ZIP are well-known widely supported lossless formats so we are relatively comfortable with them. There is one more case where we are intentionally compressing content - "opaque containers". We give our customers the option to deposit content in any format zipped up into what we call opaque containers. This is an option for people who want to save storage fees, only want bit-level preservation at the moment and are willing to accept the caveat that it won't receive the same technical characterization and they won't be able to have all the management functionality that they do for other content. We have not gotten a lot of use of the opaque container option because thankfully our customers really do want all the preservation and management services, but that option is available.
Having described the 3 cases above where we intentionally losslessly compress content to save space/storage fees, there are some real disadvantages to having done so from a management perspective. If you're familiar with the object portion of the PREMIS data model (http://www.loc.gov/standards/premis/v2/premis-report-2-1.pdf) it includes objects, files and bitstreams. Currently in our repository we only describe content at the object and file level, so at the most granular level we can only characterize these ARC/GZIP or ZIP files at that container level and not at the arguably more important level of the bitstreams within them. This is a real barrier to preservation planning and reporting. We do record a count of each MIME-type within these containers but this is a very crude description compared to what we have for the other content. We plan to add bitstream support to our repository next year so that we can fully describe and manage the bitstreams but I know that there will always be a lag in tools and infrastructure in what we can do for this content. I would not recommend it as a strategy for all of your content but I think it makes sense for certain categories of content.
A Selected Bibliography
(Priscilla: I looked for best practices or other documents that addressed compression in the preservation context. Below are some snippets of what I found, addressing mainly file and hardware compression.)
- Case Western University Archives: Compression adds complexity to long-term preservation. Some compression techniques shed "redundant" information. As an example, JPEG removes information to reduce file size. The image might look fine on your current monitor, but as monitors improve, the lower quality of the image will be more obvious.
- Wright, Miller and Addis in http://www.prestoprime.org/docs/training/Cost_of_risk_RW.pdf
Not encoding, in particular not using compression, typically results in files that have minimal sensitivity to corruption. In this way, the choice not to use compression is a way to mitigate against loss.
- TNA on Image Compression in http://www.nationalarchives.gov.uk/documents/image_compression.pdf
It is recommended that algorithms should only be used in the circumstances for which they are most efficient. It is also strongly recommended that archival master versions of images should only be created and stored using lossless algorithms. The Intellectual Property Rights status of a compression algorithm is primarily an issue for developers of format specifications, and software encoders/decoders. However, the use of open, non-proprietary compression techniques is recommended for the purposes of sustainability.
- Howard Besser, quoted in http://digitalpreservationstrategies.blogspot.com/
Data is often compressed or "scrambled" to assist in its storage and or protect it's intellectual content. These compression and encryption algorythms are often developed by private organisations who will one day cease to support them. If this happens you're stuck between a rock and a hard place. If you don't want to get into legal trouble you are no longer able to read your data; and if you go ahead and "do the unwrapping yourself" it's quite possible you're breaking copyright law.
- NINCH Guide to Good Practice: http://www.nyu.edu/its/humanities/ninchguide/XIV/
A similar obsolescence problem will have to be addressed with the file formats and compression techniques you choose. Do not rely on proprietary file formats and compression techniques, which may not be supported in the future as the companies which produce them merge, go out of business or move on to new products. In the cultural heritage community, thede factostandard formats are uncompressed TIFF for images and PDF, ACSII (SGML/XML markup) and RTF for text. Migration to future versions of these formats is likely to be well-supported, since they are so widely used. Digital objects for preservation should not be stored in compressed or encrypted formats.
- PRESTO Centre, Threats to Data Integrity from Large-Scale Management Environments: http://www.prestocentre.org/library/resources/threats-data-integrity-use-large-scale-management-environments
Related Format Issues: JPEG2000
The LC format standards did an environmental scan of institutional policies around JPEG2000. While not exactly related to storage compression, format standards and their compression are a consideration in preservation infrastructure, so I (Jefferson) thought it would be good information to add to this discussion.
NDSA:Preservation Storage Topic JPEG2000
Other Issues and Ideas around Compression
- "Vendor Dependence" with some algorithmic methods of compression
- Compression in dark archives vs. "active" archives
- Compression on online storage vs. offline storage