NDSA:IRODS
- What sort of use cases is your system designed to support? What doesn't this support?
- Share Data
- Build Digital Libraries
- Build a Preservation Environment
- Any group that needs to manage distributed data or to migrate data should consider iRODS.
- Is this a system or a prototype?
- It is definitely in production, although there is a separate prototype for NARA.
- Who is using it for a preservation use case?
- CDR and the Taiwan National Archive
- What preservation strategies would your system support?
- The principle strategy is the instantiation of a standard set of enforceable policies in a preservation archive.
- 120 policies have been identified to date. In identifying and reviewing the policies at a SAA workshop, there was a subset of 20 that at least 50% wanted. there is a long tail of policies that at least 1 organization wanted.
- What issues are there of mismatched semantics across system?
- An example is NCAR, with mass storage form the 1960s that understood tape get/put. A disk cache had to be put in place on top of tape to interact with. It's the same with Cloud services, which also deals in get/put, and need a cache on top.
- What is the base level of functionality to be a part of iRODS?
- This varies. There are specific functions for each local environment. What data processing needs are there? Where must they be run? etc.
- When managing large data collections, is distributed data integrity checking built into the system?
- Yes, at the whichever locations where the data is stored. You can create procedures for independent checking.
- What infrastructure do you rely on? AND What resources are required to support a solution implemented in your environment?
- Any operating system
- Up to 1 million files cam run in a standalone instance
- Over 1 million files, a distributed system is needed.
- The number of files is the primary gating factor for the database. There is use with Postgres, MySQL, and Oracle, but most use Postgres.
- What is the largest current installation?
- NASA, with 700 TB, 65-85 million files. A Particle data project in France has 2-3 PB.
- Why is the catalog in one central location?
- Efficiency. There are three models in use in various installations. NIH required a central catalog, which can have slaves. NAO required multiple chained into a catalog. A project in the UK is using multiple Grids chained together.
- Are there any moving image projects?
- Yes, Cinegrid. Also an ocean observation project with video observation files.
- How can the cloud environment impact digital preservation activities?
- There are no assertions about integrity or any properties. The system has to independently record and assert.
- Once a project reaches a certain amount of data, it should really work locally, not in the cloud. It can be more cost efficient/predictable if you expect to be running within local capacity.
- If we put data in your system today what systems and processes are in place so that we can get it back 50 years from now? (Take for granted a sophisticated audience that knows about multiple copies etc.)
- In 50 years, NONE of our current infrastructure components will still be in place.
- We need infrastructure independence - we have to be able to migrate based on policies, not a specific infrastructure, to a new infrastructure.
- To do that, we must know all previous versions of policies, and which are applied to which objects. That is potentially easiest with a policy-based system like iRODS.
- What cloud services are supported so far?
- S3 and EC3. It can also be run in a virtualized environment, such as the VCL project at NCSU.
- What about distributed checksums?
- Can check in each and/or compare across multiple copies. In a local environment, it checks against the central catalog.
- Are there any privacy use cases to be aware of?
- They have worked with a group on IRB issues, and can implement policies against a local IRB catalog. That data is NOT stored in the central catalog.
- Anything else?
- The next release is February 2011.