NDSA:IRODS

What sort of use cases is your system designed to support? What doesn't this support?
- Share Data
- Build Digital Libraries
- Build a Preservation Environment
- Any group that needs to manage distributed data or to migrate data should consider iRODS.
Is this a system or a prototype?
- It is definitely in production, although there is a separate prototype for NARA.
Who is using it for a preservation use case?
- CDR and the Taiwan National Archive
What preservation strategies would your system support?
- The principle strategy is the instantiation of a standard set of enforceable policies in a preservation archive.
- 120 policies have been identified to date. In identifying and reviewing the policies at a SAA workshop, there was a subset of 20 that at least 50% wanted. there is a long tail of policies that at least 1 organization wanted.
What issues are there of mismatched semantics across system?
- An example is NCAR, with mass storage form the 1960s that understood tape get/put. A disk cache had to be put in place on top of tape to interact with. It's the same with Cloud services, which also deals in get/put, and need a cache on top.
What is the base level of functionality to be a part of iRODS?
- This varies. There are specific functions for each local environment. What data processing needs are there? Where must they be run? etc.
When managing large data collections, is distributed data integrity checking built into the system?
- Yes, at the whichever locations where the data is stored. You can create procedures for independent checking.
What infrastructure do you rely on? AND What resources are required to support a solution implemented in your environment?
- Any operating system
- Up to 1 million files cam run in a standalone instance
- Over 1 million files, a distributed system is needed.
- The number of files is the primary gating factor for the database. There is use with Postgres, MySQL, and Oracle, but most use Postgres.
What is the largest current installation?
- NASA, with 700 TB, 65-85 million files. A Particle data project in France has 2-3 PB.
Why is the catalog in one central location?
- Efficiency. There are three models in use in various installations. NIH required a central catalog, which can have slaves. NAO required multiple chained into a catalog. A project in the UK is using multiple Grids chained together.
Are there any moving image projects?
- Yes, Cinegrid. Also an ocean observation project with video observation files.
How can the cloud environment impact digital preservation activities?
- There are no assertions about integrity or any properties. The system has to independently record and assert.
- Once a project reaches a certain amount of data, it should really work locally, not in the cloud. It can be more cost efficient/predictable if you expect to be running within local capacity.
If we put data in your system today what systems and processes are in place so that we can get it back 50 years from now? (Take for granted a sophisticated audience that knows about multiple copies etc.)
- In 50 years, NONE of our current infrastructure components will still be in place.
- We need infrastructure independence - we have to be able to migrate based on policies, not a specific infrastructure, to a new infrastructure.
- To do that, we must know all previous versions of policies, and which are applied to which objects. That is potentially easiest with a policy-based system like iRODS.
What cloud services are supported so far?
- S3 and EC3. It can also be run in a virtualized environment, such as the VCL project at NCSU.
What about distributed checksums?
- Can check in each and/or compare across multiple copies. In a local environment, it checks against the central catalog.
Are there any privacy use cases to be aware of?
- They have worked with a group on IRB issues, and can implement policies against a local IRB catalog. That data is NOT stored in the central catalog.
Anything else?
- The next release is February 2011.