NDSA:Penn State

Back to NDSA:Cloud Presentations

Penn State Response to Implementations of Large Scale Storage Architectures

What is the particular preservation goal or challenge you need to accomplish? (for example, re-use, public access, internal access, legal mandate, etc.)
- Digital Library Technologies, a unit of Information Technology Services at Penn State, provides IT systems and services to support the teaching, research, and outreach mission of the Penn State University Libraries and the institution. A core goal of Digital Library Technologies is advancing a joint initiative with the Penn State University Libraries to build an institutional digital stewardship program. The program addresses extant and emerging digital content and asset management needs in areas such as digital library collections, scholarly communications, electronic records archiving, and e-science/e-research data management. Accordingly, we have numerous preservation goals including all of the examples listed.
What large scale storage or cloud technologies are you using to meet that challenge? Further, which service providers or tools did you consider and how did you make your choice?
- We are not currently working with cloud providers. Our storage infrastructure is largely built on HP StorageWorks Enterprise Virtual Arrays (e.g., the 8400), virtualized fibrechannel storage area networks.
Specifically, what kind of materials are you preserving (text, data sets, images, moving images, web pages, etc.)
- Most of our materials now are image- and text-based, but we have bits and pieces of everything. We have initiatives under way that may bring in a large amount of electronic records (largely text-based but also web archives) and research data sets.
How big is your collection? (In terms of number of objects and storage space required)
- Since our collections are currently spread across a minimum of four legacy delivery applications, we lack a good way to keep track of the # of objects. We estimate that our collection is a few dozen terabytes, including archival masters and delivery copies.
What are your performance requirements?
- Not yet identified.
What storage media have you elected to use? (Disk, Tape, etc)
- Tape is used only for backups, at the moment; we use hard disks for all of our storage.
What do you think the key advantages of the system you use?
- Very controllable architecture from the block level up; this has proven useful in the current design of our server configurations. #* Fine grain control over access to data, RAID levels, carving up of space, homogeneous storage design, high speeds/low latencies
What do you think are the key problems or disadvantages your system present?
- Scalability: we are generally limited by hardware, OS, and filesystems. Usually leads to silos. Scaling can be done in certain circumstances, but then it leads to other problems, most notably backing up.
- Migration: between storage hardware life cycles in most cases. Prior to some internal best practices, most systems had to be shutdown during migration to new hardware.
- Transport mechanism: we are locked into fibre channel based transport and protocols; only a problem when dark fiber can not be run, and/or FCIP isn't viable.
What important principles informed your decision about the particular tool or service you chose to use?
- The initial setup and design was inherited from the previous generation of IT personnel.
How frequently do you migrate from one system to another?
- Seemingly, every 6-12 months for a given server. With the introduction of SVSP and standard OS/filesystem choices (e.g. ZFS) we are reducing downtime for migrations to almost zero.
What characteristics of the storage system(s) you use do you feel are particularly well-suited to long-term digital preservation? (High levels of redundancy/resiliency, internal checksumming capabilities, automated tape refresh, etc)
- Storage is only one layer of our approach to long-term digital preservation, and we tend to treat storage as something that should be relatively dumb, sufficiently fast, and swappable/replaceable.
What functionality or processes have you developed to augment your storage systems in order to meet preservation goals? (Periodic checksum validation, limited human access or novel use of permissions schemes)
- We have just built a proof-of-concept services layer above the storage layer that provides functionality useful in the long-term preservation context, such as on-demand checksum validation, provenance event logging, and version control. We plan to build our prototype out soon, and it will include some other functionality (e.g. replication).
Are there tough requirements for digital preservation, e.g. TRAC certification, that you wish were more readily handled by your storage system?
- None come to mind, though we will likely build out a production-ready version of our prototype with some TRAC guidelines in mind.