NDSA:IRODS: Difference between revisions
Created page with '# What sort of use cases is your system designed to support? What doesn't this support? ## Share Data ## Build Digital Libraries ## Build a Preservation Environment ## Any group…' |
No edit summary |
||
Line 1: | Line 1: | ||
# What sort of use cases is your system designed to support? What doesn't this support? | # Question: What sort of use cases is your system designed to support? What doesn't this support? | ||
# | #* Share Data | ||
# | #* Build Digital Libraries | ||
# | #* Build a Preservation Environment | ||
# | #* Any group that needs to manage distributed data or to migrate data should consider iRODS. | ||
Question: Is this a system or a prototype? | # Question: Is this a system or a prototype? | ||
#* It is definitely in production, although there is a separate prototype for NARA. | |||
Question: Who is using it for a preservation use case? | # Question: Who is using it for a preservation use case? | ||
#* CDR and the Taiwan National Archive | |||
# What preservation strategies would your system support? | # What preservation strategies would your system support? | ||
# | #* The principle strategy is the instantiation of a standard set of enforceable policies in a preservation archive. | ||
# | #* 120 policies have been identified to date. In identifying and reviewing the policies at a SAA workshop, there was a subset of 20 that at least 50% wanted. there is a long tail of policies that at least 1 organization wanted. | ||
# What issues are there of mismatched semantics across system? | |||
#* An example is NCAR, with mass storage form the 1960s that understood tape get/put. A disk cache had to be put in place on top of tape to interact with. It's the same with Cloud services, which also deals in get/put, and need a cache on top. | |||
Question: What is the base level of functionality to be a part of iRODS? | # Question: What is the base level of functionality to be a part of iRODS? | ||
#* This varies. There are specific functions for each local environment. What data processing needs are there? Where must they be run? etc. | |||
Question: When managing large data collections, is distributed data integrity checking built into the system? | # Question: When managing large data collections, is distributed data integrity checking built into the system? | ||
#* Yes, at the whichever locations where the data is stored. You can create procedures for independent checking. | |||
# What infrastructure do you rely on? AND What resources are required to support a solution implemented in your environment? | # What infrastructure do you rely on? AND What resources are required to support a solution implemented in your environment? | ||
# | #* Any operating system | ||
# | #* Up to 1 million files cam run in a standalone instance | ||
# | #* Over 1 million files, a distributed system is needed. | ||
# | #* The number of files is the primary gating factor for the database. There is use with Postgres, MySQL, and Oracle, but most use Postgres. | ||
# What is the largest current installation? | |||
#* NASA, with 700 TB, 65-85 million files. A Particle data project in France has 2-3 PB. | |||
# Why is the catalog in one central location? | |||
#* Efficiency. There are three models in use in various installations. NIH required a central catalog, which can have slaves. NAO required multiple chained into a catalog. A project in the UK is using multiple Grids chained together. | |||
# Are there any moving image projects? | |||
#* Yes, Cinegrid. Also an ocean observation project with video observation files. | |||
# How can the cloud environment impact digital preservation activities? | # How can the cloud environment impact digital preservation activities? | ||
# | #* There are no assertions about integrity or any properties. The system has to independently record and assert. | ||
# | #* Once a project reaches a certain amount of data, it should really work locally, not in the cloud. It can be more cost efficient/predictable if you expect to be running within local capacity. | ||
# If we put data in your system today what systems and processes are in place so that we can get it back 50 years from now? (Take for granted a sophisticated audience that knows about multiple copies etc.) | # If we put data in your system today what systems and processes are in place so that we can get it back 50 years from now? (Take for granted a sophisticated audience that knows about multiple copies etc.) | ||
# | #* In 50 years, NONE of our current infrastructure components will still be in place. | ||
# | #* We need infrastructure independence - we have to be able to migrate based on policies, not a specific infrastructure, to a new infrastructure. | ||
# | #* To do that, we must know all previous versions of policies, and which are applied to which objects. That is potentially easiest with a policy-based system like iRODS. | ||
# What cloud services are supported so far? | |||
#* S3 and EC3. It can also be run in a virtualized environment, such as the VCL project at NCSU. | |||
# What about distributed checksums? | |||
#* Can check in each and/or compare across multiple copies. In a local environment, it checks against the central catalog. | |||
# Are there any privacy use cases to be aware of? | |||
#* They have worked with a group on IRB issues, and can implement policies against a local IRB catalog. That data is NOT stored in the central catalog. | |||
# Anything else? | # Anything else? | ||
# | #* The next release is February 2011. |
Revision as of 15:04, 5 April 2011
- Question: What sort of use cases is your system designed to support? What doesn't this support?
- Share Data
- Build Digital Libraries
- Build a Preservation Environment
- Any group that needs to manage distributed data or to migrate data should consider iRODS.
- Question: Is this a system or a prototype?
- It is definitely in production, although there is a separate prototype for NARA.
- Question: Who is using it for a preservation use case?
- CDR and the Taiwan National Archive
- What preservation strategies would your system support?
- The principle strategy is the instantiation of a standard set of enforceable policies in a preservation archive.
- 120 policies have been identified to date. In identifying and reviewing the policies at a SAA workshop, there was a subset of 20 that at least 50% wanted. there is a long tail of policies that at least 1 organization wanted.
- What issues are there of mismatched semantics across system?
- An example is NCAR, with mass storage form the 1960s that understood tape get/put. A disk cache had to be put in place on top of tape to interact with. It's the same with Cloud services, which also deals in get/put, and need a cache on top.
- Question: What is the base level of functionality to be a part of iRODS?
- This varies. There are specific functions for each local environment. What data processing needs are there? Where must they be run? etc.
- Question: When managing large data collections, is distributed data integrity checking built into the system?
- Yes, at the whichever locations where the data is stored. You can create procedures for independent checking.
- What infrastructure do you rely on? AND What resources are required to support a solution implemented in your environment?
- Any operating system
- Up to 1 million files cam run in a standalone instance
- Over 1 million files, a distributed system is needed.
- The number of files is the primary gating factor for the database. There is use with Postgres, MySQL, and Oracle, but most use Postgres.
- What is the largest current installation?
- NASA, with 700 TB, 65-85 million files. A Particle data project in France has 2-3 PB.
- Why is the catalog in one central location?
- Efficiency. There are three models in use in various installations. NIH required a central catalog, which can have slaves. NAO required multiple chained into a catalog. A project in the UK is using multiple Grids chained together.
- Are there any moving image projects?
- Yes, Cinegrid. Also an ocean observation project with video observation files.
- How can the cloud environment impact digital preservation activities?
- There are no assertions about integrity or any properties. The system has to independently record and assert.
- Once a project reaches a certain amount of data, it should really work locally, not in the cloud. It can be more cost efficient/predictable if you expect to be running within local capacity.
- If we put data in your system today what systems and processes are in place so that we can get it back 50 years from now? (Take for granted a sophisticated audience that knows about multiple copies etc.)
- In 50 years, NONE of our current infrastructure components will still be in place.
- We need infrastructure independence - we have to be able to migrate based on policies, not a specific infrastructure, to a new infrastructure.
- To do that, we must know all previous versions of policies, and which are applied to which objects. That is potentially easiest with a policy-based system like iRODS.
- What cloud services are supported so far?
- S3 and EC3. It can also be run in a virtualized environment, such as the VCL project at NCSU.
- What about distributed checksums?
- Can check in each and/or compare across multiple copies. In a local environment, it checks against the central catalog.
- Are there any privacy use cases to be aware of?
- They have worked with a group on IRB issues, and can implement policies against a local IRB catalog. That data is NOT stored in the central catalog.
- Anything else?
- The next release is February 2011.