What is the particular preservation goal or challenge you need to accomplish? (for example, re-use, public access, internal access, legal mandate, etc.)
The mission of the Florida Digital Archive is to provide a cost-effective, long-term preservation repository for digital materials in support of teaching and learning, scholarship, and research in the state of Florida.
What large scale storage or cloud technologies are you using to meet that challenge? Further, why did you choose these particular technologies?
We are using IBM DS4800 disk storage (a mid-range enterprise storage system) and Tivoli storage manager for tape. We are mainly using these technologies because our sysadmin staff are comfortable with them. We do not feel we have enough control of cloud storage to use it for the digital archive, although we would consider using it for other purposes such as access to digital collections. We would be comfortable using storage specifically managed to a high preservation standard, such as Chronopolis.
Specifically, what kind of materials are you preserving (text, data sets, images, moving images, web pages, etc.)
We have all sorts of materials, but the FDA is best suited for text, image, audio and video.
How big is your collection? (In terms of number of objects and storage space required)
We have about 300,000 stored AIPs, comprising just over 30 million files. One copy of the archival store is 82 TB, growing by 2-4 TB per month.
What are your performance requirements? Further, why are these your particular requirements?
The FDA is a dark archive; there is no public access to it, and no real-time access. Performance requirements are based on being able to process the amount of incoming materials through Ingest without developing a backlog. At present this requires us to be able to ingest peaks of 5-6 TB in a month, although normally we run about 3 TB.
What storage media have you elected to use? (Disk, Tape, etc) Further, why did you choose these particular media?
We have changed this several times. When we first brought up the FDA in 2005, we used exclusively tape. We wrote one copy via Tivoli to a robotic tape library in Gainesville and another copy over the Florida Lambda Rail to a robotic library in Tallahassee. Tape media is much cheaper than disk for the same amount of storage, and tape uses less electricity. However, the logistical problems were constant and used a lot of staff time. We found we could not do regular fixity checks on the tapes without pushing them dangerously close to their MTF. Writes were very slow, tapes jammed in the drives, and we had other problems too long to mention. We then moved to all disk, again writing one master to Gainesville and one to Tallahassee. Each of the disk copies were backed up to tape by Tivoli. That was lovely, but as the archive grew it became prohibitively expensive to keep adding disk. We finally switched to one tape master in Gainesville and a disk master in Tallahassee. We have an elaborate process for using the tapes where we stage both reads and writes to temporary disk pools to prevent wearing out the tapes.
What do you think the key advantages of the system you use?
It balances the ease of use of disk with the relative economy of tape, and it still gives us total control over our own storage.
What do you think are the key problems or disadvantages your system present?
Storage management is probably the most labor-intensive part of running the Florida Digital Archive. It takes up the time of FDA operators, FDA programmers, and FCLA system administrators.
What important principles informed your decision about the particular tool or service you chose to use?
We felt we had to know exactly what device an AIP was stored on, so we know the age of the device and when to replace it. We felt we had to have every archival master be directly and independently addressable (which made us jump through hoops to subvert Tivoli, which wants to keep track of this for you).
How frequently do you migrate from one system to another? Further, what is it that prompts you to make these migrations?
See the answer to #6, which is about our storage migrations. We have also migrated from our first generation repository system (DAITSS) to a second-generation system (DAITSS 2). DAITSS 2 provides a higher level of abstraction for accessing storage, handles fixity and integrity checking, and optimizes use of space.
What characteristics of the storage system(s) you use do you feel are particularly well-suited to long-term digital preservation? (High levels of redundancy/resiliency, internal checksumming capabilities, automated tape refresh, etc)
The DS4800 is no longer being sold, so we will eventually have to move to something else. In general these mid-range enterprise level storage systems offer high availablily (full redundancy of controllers, power, RAID 5, pre-failure analysis etc.). The disk and tape drives are both dually attached so if anything in the middle fails the systems still works. In Gainesville we have a Tivoli copypool backup of the data.
What functionality or processes have you developed to augment your storage systems in order to meet preservation goals? (Periodic checksum validation, limited human access or novel use of permissions schemes)
Our preservation repository system, DAITSS, does periodic fixity and integrity checks. Each storage silo keeps a database of checksum information for the packages stored on that silo, and package fixity is checked on an ongoing basis. An independent program compares the most recent silo-maintained checksum information for all master copies of a package and compares them to each other and to the package checksum stored in the DAITSS database; if all copies match each other and the database checksum, a timestamped fixity event is written. We also have many safeguards to prevent unauthorized access to storage, and to limit authorized access to the DAITSS system itself, so an individual (even a DAITSS programmer or operator) can't go "under the covers" and alter storage without going through the application.
Are there tough requirements for digital preservation, e.g. TRAC certification, that you wish were more readily handled by your storage system?
I think that the way we handle storage would pass a TRAC-type audit. However I wish TRAC requirements were more clear. For example, can a trustworthy repository outsource storage management to a cloud or centrally run data center, and if so, what information must be provided by the outsourcee to ensure that preservation requirements are met?
Regarding storage systems, I would wish for a tape manager that was more transparent (didn't try to hide all details from you and do everything itself) and a disk system that required less human intervention to manage and expand. Just doing FSCKs on this quantity of disk is a production