NDSA:2014 National Agenda Outline
Draft Outline
Introduction
a. Description of the National Agenda for Digital Stewardship
i. The document is inspiration for the planning of digital preservation work and observations of the joint leadership group. It is also an evaluation of the state of digital preservation activity and key emerging issues for the year
ii. The document is not intended to be prescriptive, a directive to working groups, and it is not intended to replace any organizational efforts, planning, goals or opinions.
iii. Hoped for impact
b. Description of the NDSA, NDSA goals and how the 2014 Agenda furthers those goals (i.e inform and inspire individual, working group, and organizational work plans)
c. Intended audience: NDSA members and the wider digital preservation community
d. Authored by the joint leadership group
Section topics
Trends in Digital Content
Electronic Records
Electronic records, and the loss of the underlying information it contains poses a significant threat to the American memory. Whether it’s an electronic diary, email correspondence, or documenting government transactions, all of these records are at risk of disappearing without thoughtful action to preserve important information. Preserving electronic records efficiently and in a cost effective manner remains a tremendous challenge. Culling through the volume of records generated and held by individuals and institutions in electronic format is requiring changes to traditional paper-based procedures. Rather than relying on files clerks to organize and store information, the information creator – each of us – will be responsible for properly managing his or her own electronic records. Education and a proper infrastructure will be a critical factor in teaching the public about the deficiencies of long-term electronic preservation and how to properly save important materials.
Research Data
Curating digital research data illustrates some of the most acute challenges with digital content. The sheer scale of research data represents a daunting curation task. With new scientific instrumentation being developed and the growing use of computer simulations, a research team can generate many terabytes of data per day. Data curators face managing at the petabyte scale (a petabyte equals 1,000 terabytes) and well beyond. Scientific fields such as particle physics with its collider data and astronomy with its sky surveys as well as research fields and methods such as bioinformatics, crystallography, and engineering design generate massive amounts of digital data. Large scale digitized content being created by initiatives like the Google Books project pose similar challenges. Digital research data are complex objects to curate. They are very heterogeneous, ranging from numeric and image-based, to text, geospatial, and other forms. There are many different information standards used (and not used) as well as many different approaches to information structure (e.g., XML-structured documents vs. fixed image and textual file formats). The research communities that produce data are equally diverse; their data management practices vary greatly within a discipline as well as between disciplines. There can also be commercial interests in the data and associated data practices. Perhaps the overriding challenges in all respects to digital research data are the affiliated costs. Domain researchers, technologists, information scientists, and policymakers are searching for sustainable economic models with the ability to accurately predict costs and to balance them across the lifecycle (e.g. costs for ingest, archival management, and dissemination), and through federated inter-institutional repository systems. There is no “one size fits all” approach when it comes to resolving the management challenges of research data. A path forward would be to galvanize digital preservation/curation community members around these four data challenges – scale, complexity, research communities’ practices, and costs -- study the issues more in-depth and begin recommending new solutions.
Web and Social Media
DRAFT DRAFT
While cultural heritage organizations and others have been preserving web content since 1996, challenges continue in preserving born digital web content as websites become more complex and the scale of the web continues to grow. Crawlers used to collect content, as well as access tools used to render the web archives, are increasingly challenged in keeping up with the explosion of ever-complex technologies: multimedia, mashups, deep-web, databases, and the increasing prevalence of heavily scripted site navigational paradigms that do not prevent the collection of data but make replay nearly impossible without changes to the browser configuration of a visitor to the archive. More and more content published and created on the web is unable to be preserved using available tools.
The International Internet Preservation Consortium (netpreserve.org) developed the Heritrix web crawler and is working to develop a community to stabilize, improve, and support this open source tool in the future. Broader involvement by web archivists not involved directly in IIPC is critical. Development and exploration of improvements to access tools, including data mining tools for large datasets of web archives, are also needed. Full-text indexing of web archives continues to challenge researchers and the community of web archivists, particularly as archives expand and reach multiple terabyte and petabyte size.
The increasing use of social media by organizations and individuals can also be a challenge to preserve, as services hosting this content do not have preservation as a business model and changes they make in how they serve up content can upset the preservation process. The rate of publication/site implementations and change for large social media aggregation sites is, on average, every 3-6 weeks. This makes it virtually impossible to keep pace with an archival quality capture of these resources without direct access to the site feeds which are not available to most cultural heritage institutions even for a fee.
Tools being developed in recent years, primarily to meet the needs of business compliance regulations, are able to capture more of this type of material on a small scale. While they show exciting advancements in the tools available for web archiving, the technologies available have not yet translated to open source tools that scale to the needs of cultural heritage institutions and others collecting large amounts of data.
Motion Picture Film and Video
[rough notes from Carolyn - need a writeup] The challenges are outlined here: http://www.digitalpreservation.gov/series/challenge/DigitizationGuidelinesPart3.html
Digital preservation and stewardship of motion picture film and video presents a multitude of challenges. There is a lack of standards for preservation quality reformatting and a slew of issues that come from producing such large files-- not only storage of these monster files, but the ability to playback such files, etc. And also the clash or the potential synergy between the movie industry and cultural heritage institutions.
Digital content stored on obsolete or deteriorating media
As more archives and repositories come to terms with managing born digital content there will be a growing need for the management of disk images. The first rule for managing born digital content is to remove a copy of all the digital files from the physical media. This means digitally transforming hard drives, CD, DVD, floppies, zip disks, etc., often working with media that is obsolete (ie., older, not still in use). These transformed materials can be either forensic or logical disk images of the physical media. The main challenge in this first step is that the organization may not have the means to adequately process the digital files right away. This will then require rights-sensitive and potentially huge amount of storage until these materials can be appropriately treated and made available. As hard drives get larger and the storage capacity for an individual can easily jump to the terabytes this will pose a significant long term challenge for digital preservation, particularly at smaller organizations. Additionally, with obsolete media it may be difficult to identify and obtain working drives or drivers needed to access the data, and with older files it could be difficult to obtain software needed to open and view files as originally written.
Start input from S&P Working Group on trends in digital content
- Web archiving
- Research data
- Big data
- Computational consumption of archives
- How do you connect annotations to content? Should we preserve those connections?
- How do we provide access with appropriate limits
- (government classification, copyright restrictions, donor agreements, licenses, human subject research restrictions). ** Rights metadata standards?
- Compound, complex objects
- Dynamic content, integrating resources
- Not just documents (video, digital art / new media, etc.)
- Preservation of social media
- How to connect related publications (within and between repositories)
- Findability and discoverability of content
- Accessibility of digital content (e.g., usable via screen reader)
- Accessibility of data sets
- In the context of open access requirements / mandates / etc.
End input from S&P Working Group on trends in digital content
Research Priorities
The Research Priorities section focuses on two distinct aspects of research: the long term preservation of research data such as e-science, data sets, and so forth; and the need for research on digital preservation activities.
Research Data
[EXAMPLE] Education Workforce Development Research: Sentence to paragraph description with rationale for including the topic in the 2014 National Agenda. Recommendation for action included if relevant.
Research Related to Digital Preservation Practices
Applied Research
In the near term future, there are specific areas of applied research around digital preservation lifecycle issues that need attention. Currently there are limited models for cost estimation for ongoing storage of digital content. Cost estimation models need to be robust and flexible. Different approaches to cost estimation should be explored and comparisons of existing models made with emphasis on reproducibility of results. Auditing models also need to be strengthened and further developed. The SafeArchive system and other bit-level auditing practices could be connected to the NDSA Levels of Preservation work to help organizations determine and validate the costs of scaling different auditing schemes. Around both topics, research needs to address multiple storage models: locally stored data, distributed preservation networks, data cooperatives, cloud storage, brokered cloud storage systems and hybrid systems need to be addressed in cost models and auditing practices so that organizations can make informed cost-effective digital preservation decisions.
Research in Curriculum Development
Theoretical Framework
(3-6 Year horizon) ["helen"] 1. Information valuation/selection. Models for estimating future private & public value of information. 2. Models for estimating future risks
Information Equivalence
(3-6 year) [Jefferson]: Significant properties, fingerprints, authenticity
iv. preservation at scale (3-6 year): [Jefferson] 1. Preserving 'big data' -- storage scale 2. preserving high-velocity/dynamic 3. Scalable models for information provenance, equivalence, and quality 4. Information valuation and portfolio management 5. Privacy & confidentiality @ scale
Policy Research
(3-6 year): ["Micah"] 1.Trust engineering, trust frameworks
Education Workforce Development Research
(3-6 year) [Helen]
Evidence-Based for Preservation Methodologies & Policies
(Cross-Cutting/10 years/Grand Challenge) ["Micah"] 1. experimental: labs/testbeds/field experiments • Methodologies for digital preservation research that can provide useful results with simulation of long time periods. • Methodologies for digital preservation research that provide reliable test plans. • Methodologies that combine aspects of different research areas (e.g., computer science, materials science 2. observational: random sampling/systematic trend/coverage 3. computational: replicable theoretically grounded computer models 4. Research in a lab or test-bed environment, with a focus on methods to test research results and implement effective strategies from the research lab or test-bed. Frameworks that allow people to apply their specialized knowledge and skills to specific problems
Start input from S&P Working Group on research
- Findability and discoverability of content
- Large scale integration of emulation into delivery (connect to work done internationally)
- Format migration testing
- Integration of emulation and migration (hybrid approach)
- How do we leverage tools and practices in the digital forensics community (and other fields)?
End input from S&P Working Group on research
Research Priorities References:
- www.safearhive.org
- http://blogs.loc.gov/digitalpreservation/2012/11/ndsa-levels-of-digital-preservation-release-candidate-one/
Infrastructure Development
i. Infrastructure can be generally defined as the set of interconnected structural elements that provide framework supporting an entire structure of development. This includes both physical and institutional elements. --Micah altman 18:31, 13 February 2013 (UTC)
ii.Examples:
• Trends in data protection standards
• Best practices for using cloud concepts within a digital preservation strategy
• Cost-benefit analysis techniques for infrastructure planning
Integration of Digital Forensics Tools into Production Workflows for Collections of Born Digital Materials
- Building on exploratory work on using digital forensics (CLIR report, recent DPC report)
- Leveraging underdevelopment tools (like bit curator) and implementing workflows like those laid out in the AIMS report.
- Mention OCLC SWAT project as a great example of a potential way forward.
- CLEAR NEED: Considerable ground has been made on preservation, but access remains problematic. Advances here in infrastructure development suggest the need for further development of both policies and tools that work based on those policies.
Implementation of tools and services for ongoing implementation of File Format Action Plans
As organizations are now amassing considerable and in many cases diverse and heterogeneous collections of digital files there is both a need and an opportunity for organizations to begin to mine and monitor this material. We are now getting to the point where we have an array of digital files under stewardship of various vintages and there is a clear need for organizations to begin surveying their digital content and files and developing techniques to identify threats and risks to this material. We would like to suggest that there is clear value in organizations beginning to document what kinds of files they have and share this information to prioritize the development of approaches for format actions based on the clear current needs.
Need for targeted help to move organizations up through the NDSA Levels of Digital Preservation
The work of the NDSA Levels of Digital Preservation team has produced a useful chart for helping to prioritize digital preservation work at organizations. At this point, it would be beneficial for the community to use this chart as a means to help identify the low level infrastructure requirements that many member organizations are not currently meeting and try to focus time and energy on helping make it easier for organizations to move up the chart.
Start input from Digital Content group
>>file system - linear tape file system (transport between tape, into cloud)
We were thinking about passing it to you for consideration in your section as it doesn't feel contenty, it feels more infrastructurey (to us at least, who admittedly don't fully understand what the issue is :)
Gail Truman was the one who'd brought it up, and she forwarded some additional details, below. I think Bradley Daigle also discussed this on our call, but he's not responded to a request yet for more details.
End input from Digital Content group
Start input from S&P Working Group on infrastructure
- Development of commercial products for digital preservation; creating and maintaining relationships with the private sector
- Consolidating and keeping alive the palette of tools we need to do our work of digital preservation, and for rendering in the future
- Shared tool development or reusing tools developed by other communities
- Common packaging (general and specialized)
- In a perfect world, record-keeping systems in federal agencies would all know how to create a package, so that all sorts of systems become interoperable; would achieve huge economies for the government
- Use and access – tends to be divorced from preservation, but needs to be more integrated
- Preservation is ensuring access over time
- Need to involve researchers more
- “Archlive” – shouldn’t be places of storage, but of dynamic activities
- Have yet to pursue the other end of the OAIS model – the consumer archive
- New demands for API and federated access to our content coming out of initiatives like DPLA, edX, jdarchive
- What tools are available to do things like package and annotate content (i.e., in lieu of PDF/A-3)
- Storage concerns at scale.
- Tools for risk assessment or other archive management tasks (e.g. preservation planning)
End input from S&P Working Group on infrastructure
Organizational Roles, Policies, and Practices
Issues: What is the critical organizational problem facing digital preservation work today?
- Organizational pressures
- Increased scope of responsibilities (data management, education of content creators, etc.)
- financial pressures - increased costs
- lack of inadequate staff (refer to the staffing survey)
- volume, increased complexity of data (see the comments in the content section)
- changing context (compliance rules, etc.)
- lack of prioritization of dp by higher administration and those controlling budgets
- continued preservation mandates
What potential solutions could address this challenge in a practical way?
- Work together as a community to raise the profile of DP and campaign for more resources, higher priority given to DP
and/or
- Increased organizational cooperation and division of labor to multiply the breadth of impact of investments made within individual institutions
- If it is impractical for every institution to develop expertise in every aspect of the digital preservation challenge, different institutions could specialize in different aspects and rely on each other for some functions
- If each institution does not have the resources to fully fund all the digital preservation responsibilities and activities, having each institution spend on something different and sharing capabilities with each other would help address the cost
- If each institution cannot hire the number of staff and the variety of types of expertise, collaborative hiring and sharing of staff and skills could help
and/or
- Identification of more cost-efficient methods of preservation
and/or
- DP training and staffing resources (training materials, hiring materials, etc.)
What are barriers to realizing collaborations and cooperative ? on a large scale?
- Lack of knowledge on where natural collaborations could occur
- Immature certification and trust frameworks
- Lack of widely accepted standards and certifications that would allow organizations to rely on each other more easily for predictable and equivalent outcomes
- Lack of assurance that the digital preservation community is participating in all relevant standards bodies so that institutions can trust that their digital preservation interests are being represented by someone in the community when it matters. We need comprehensive coverage on all critically relevant standards bodies, and coordination so that it is clear who has taken responsibility for what.
In order to address these barriers, we recommend focused work in the following areas:
- Build and strengthen interdependent [??] preservation networks regionally, nationally and internationally
- Identify preservation functions that could be outsourced (staffing survey revealed some functions) and functions that each organization prefers to or must do for itself (planning, alignment with parent organization’s goals, and alignment with designated communities might be examples in this category)
- Create greater visibility into the different services offered, areas of expertise, and standards activities of organizations active in the digital preservation community
- Use that visibility to analyze gaps where something necessary is not getting done and find opportunities where multiple organizations could benefit from a division of labor.
- Identify potential specializations, then publicize commitments of organizations to specialize in a particular function so others can begin to rely on it.
Raw notes from S&P Working Group:
- Sustainable budgetary models for long-term preservation
- Articulating the compendium of best practices
- Continuum of policies ranging from high-level organizational policies to lower-level rules
- Role of national efforts, e.g. DPN, Academic Preservation Trust
- International efforts and leveraging other preservation groups
- Aligning National Approaches to Digital Preservation publication as a reference
- Need for creation of more dedicated FTEs to staff digital preservation initiatives
- Findings from the staffing survey (needs gaps, characteristics of needed staff)
- What are the barriers to hiring qualified staff? Is it training? Budget? Finding people?
- Collection of position descriptions that people could use as models.
- How do we convince management that digital preservation is important and deserves resources?
- Audit and certification
- Scope of what we’re responsible for as practitioners has been broadening (data management,...) Also at different levels (department, institution, community)
- Role of disciplinary repositories (how does our organization’s repository fit into the network of repositories?)
- Changing rules for compliance
Conclusion
a. Possible ways to engage with the topics and issues detailed in the agenda