NDSA:2014 National Agenda Outline

From DLF Wiki

Draft Outline

Introduction

a. Description of the National Agenda for Digital Stewardship

i. The document is inspiration for the planning of digital preservation work and observations of the joint leadership group. It is also an evaluation of the state of digital preservation activity and key emerging issues for the year

ii. The document is not intended to be prescriptive, a directive to working groups, and it is not intended to replace any organizational efforts, planning, goals or opinions.

iii. Hoped for impact

b. Description of the NDSA, NDSA goals and how the 2014 Agenda furthers those goals (i.e inform and inspire individual, working group, and organizational work plans)

c. Intended audience: NDSA members and the wider digital preservation community

d. Authored by the joint leadership group

Section topics

Trends in Digital Content

Electronic Records

Electronic records, and the loss of the underlying information it contains poses a significant threat to the American memory. Whether it’s an electronic diary, email correspondence, or documenting government transactions, all of these records are at risk of disappearing without thoughtful action to preserve important information. Preserving electronic records efficiently and in a cost effective manner remains a tremendous challenge. Culling through the volume of records generated and held by individuals and institutions in electronic format is requiring changes to traditional paper-based procedures. Rather than relying on files clerks to organize and store information, the information creator – each of us – will be responsible for properly managing his or her own electronic records. Education and a proper infrastructure will be a critical factor in teaching the public about the deficiencies of long-term electronic preservation and how to properly save important materials.

Research Data

Curating digital research data illustrates some of the most acute challenges with digital content. The sheer scale of research data represents a daunting curation task. With new scientific instrumentation being developed and the growing use of computer simulations, a research team can generate many terabytes of data per day. Data curators face managing at the petabyte scale (a petabyte equals 1,000 terabytes) and well beyond. Scientific fields such as particle physics with its collider data and astronomy with its sky surveys as well as research fields and methods such as bioinformatics, crystallography, and engineering design generate massive amounts of digital data. Large scale digitized content being created by initiatives like the Google Books project pose similar challenges. Digital research data are complex objects to curate. They are very heterogeneous, ranging from numeric and image-based, to text, geospatial, and other forms. There are many different information standards used (and not used) as well as many different approaches to information structure (e.g., XML-structured documents vs. fixed image and textual file formats). The research communities that produce data are equally diverse; their data management practices vary greatly within a discipline as well as between disciplines. There can also be commercial interests in the data and associated data practices. Perhaps the overriding challenges in all respects to digital research data are the affiliated costs. Domain researchers, technologists, information scientists, and policymakers are searching for sustainable economic models with the ability to accurately predict costs and to balance them across the lifecycle (e.g. costs for ingest, archival management, and dissemination), and through federated inter-institutional repository systems. There is no “one size fits all” approach when it comes to resolving the management challenges of research data.

Web and Social Media

DRAFT DRAFT

While cultural heritage organizations and others have been preserving web content since 1996, challenges continue in preserving born digital web content as websites become more complex and the scale of the web continues to grow. Crawlers used to collect content, as well as access tools used to render the web archives, are increasingly challenged in keeping up with the explosion of ever-complex technologies: multimedia, mashups, deep-web, databases [others to list?]. More and more content published and created on the web is unable to be preserved using available tools.

The International Internet Preservation Consortium (netpreserve.org) developed the Heritrix web crawler and is working to develop a community to stabilize, improve, and support this open source tool in the future. Broader involvement by web archivists not involved directly in IIPC is critical. Development and exploration of improvements to access tools, including data mining tools for large datasets of web archives, are also needed. Full-text indexing of web archives continues to challenge researchers and the community of web archivists, particularly as archives expand and reach multiple terabyte and petabyte size.

The increasing use of social media by organizations and individuals can also be a challenge to preserve, as services hosting this content do not have preservation as a business model and changes they make in how they serve up content can upset the preservation process. Tools being developed in recent years, primarily to meet the needs of business compliance regulations, are able to capture more of this type of material on a small scale. While they show exciting advancements in the tools available for web archiving, the technologies available have not yet translated to open source tools that scale to the needs of cultural heritage institutions and others collecting large amounts of data.

Motion Picture Film and Video

[rough notes from Carolyn - need a writeup] The challenges are outlined here: http://www.digitalpreservation.gov/series/challenge/DigitizationGuidelinesPart3.html

Motion picture film and video--challenges of (lack of) standards for preservation quality reformatting and a slew of issues that come from producing such large files-- not only storage of these monster files, but the ability to playback such files, etc. And also the clash or the potential synergy between the movie industry and cultural heritage institutions.

Digital content stored on obsololete or deteriorating media

(our interpretation of what was meant by " Disc images - come off physical media images, optical media, magnetic storage media"


Start input from S&P Working Group on trends in digital content

  • Web archiving
  • Research data
  • Big data
    • Computational consumption of archives
  • How do you connect annotations to content? Should we preserve those connections?
  • How do we provide access with appropriate limits
    • (government classification, copyright restrictions, donor agreements, licenses, human subject research restrictions). ** Rights metadata standards?
  • Compound, complex objects
    • Dynamic content, integrating resources
    • Not just documents (video, digital art / new media, etc.)
  • Preservation of social media
  • How to connect related publications (within and between repositories)
  • Findability and discoverability of content
  • Accessibility of digital content (e.g., usable via screen reader)
    • Accessibility of data sets
    • In the context of open access requirements / mandates / etc.

End input from S&P Working Group on trends in digital content

Research Priorities

The Research Priorities section focuses on two distinct aspects of research: the long term preservation of research data such as e-science, data sets, and so forth; and the need for research on digital preservation activities.

Research Data

[EXAMPLE] Education Workforce Development Research: Sentence to paragraph description with rationale for including the topic in the 2014 National Agenda. Recommendation for action included if relevant.

Research Related to Digital Preservation Practices

Applied Research

In the near term future, there are specific areas of applied research around digital preservation lifecycle issues that need attention. Currently there are limited models for cost estimation for ongoing storage of digital content. Cost estimation models need to be robust and flexible. Different approaches to cost estimation should be explored and comparisons of existing models made with emphasis on reproducibility of results. Auditing models also need to be strengthened and further developed. The SafeArchive system and other bit-level auditing practices could be connected to the NDSA Levels of Preservation work to help organizations determine and validate the costs of scaling different auditing schemes. Around both topics, research needs to address multiple storage models: locally stored data, distributed preservation networks, data cooperatives, cloud storage, brokered cloud storage systems and hybrid systems need to be addressed in cost models and auditing practices so that organizations can make informed cost-effective digital preservation decisions.

Research in Curriculum Development
Theoretical Framework

(3-6 Year horizon) ["helen"] 1. Information valuation/selection. Models for estimating future private & public value of information. 2. Models for estimating future risks

Information Equivalence

(3-6 year) [Jefferson]: Significant properties, fingerprints, authenticity

iv. preservation at scale (3-6 year): [Jefferson] 1. Preserving 'big data' -- storage scale 2. preserving high-velocity/dynamic 3. Scalable models for information provenance, equivalence, and quality 4. Information valuation and portfolio management 5. Privacy & confidentiality @ scale

Policy Research

(3-6 year): ["Micah"] 1.Trust engineering, trust frameworks

Education Workforce Development Research

(3-6 year) [Helen]

Evidence-Based for Preservation Methodologies & Policies

(Cross-Cutting/10 years/Grand Challenge) ["Micah"] 1. experimental: labs/testbeds/field experiments • Methodologies for digital preservation research that can provide useful results with simulation of long time periods. • Methodologies for digital preservation research that provide reliable test plans. • Methodologies that combine aspects of different research areas (e.g., computer science, materials science 2. observational: random sampling/systematic trend/coverage 3. computational: replicable theoretically grounded computer models 4. Research in a lab or test-bed environment, with a focus on methods to test research results and implement effective strategies from the research lab or test-bed. Frameworks that allow people to apply their specialized knowledge and skills to specific problems


Start input from S&P Working Group on research

  • Findability and discoverability of content
  • Large scale integration of emulation into delivery (connect to work done internationally)
  • Format migration testing
  • Integration of emulation and migration (hybrid approach)
  • How do we leverage tools and practices in the digital forensics community (and other fields)?

End input from S&P Working Group on research


Research Priorities References:

Infrastructure Development

i. Infrastructure can be generally defined as the set of interconnected structural elements that provide framework supporting an entire structure of development. This includes both physical and institutional elements. --Micah altman 18:31, 13 February 2013 (UTC)

ii.Examples:

• Trends in data protection standards

• Best practices for using cloud concepts within a digital preservation strategy

• Cost-benefit analysis techniques for infrastructure planning

Start input from Digital Content group >>file system - linear tape file system (transport between tape, into cloud)

We were thinking about passing it to you for consideration in your section as it doesn't feel contenty, it feels more infrastructurey (to us at least, who admittedly don't fully understand what the issue is :)

Gail Truman was the one who'd brought it up, and she forwarded some additional details, below. I think Bradley Daigle also discussed this on our call, but he's not responded to a request yet for more details. End input from Digital Content group Start input from S&P Working Group on infrastructure

  • Development of commercial products for digital preservation; creating and maintaining relationships with the private sector
  • Consolidating and keeping alive the palette of tools we need to do our work of digital preservation, and for rendering in the future
    • Shared tool development or reusing tools developed by other communities
  • Common packaging (general and specialized)
    • In a perfect world, record-keeping systems in federal agencies would all know how to create a package, so that all sorts of systems become interoperable; would achieve huge economies for the government
  • Use and access – tends to be divorced from preservation, but needs to be more integrated
    • Preservation is ensuring access over time
    • Need to involve researchers more
    • “Archlive” – shouldn’t be places of storage, but of dynamic activities
    • Have yet to pursue the other end of the OAIS model – the consumer archive
    • New demands for API and federated access to our content coming out of initiatives like DPLA, edX, jdarchive
  • What tools are available to do things like package and annotate content (i.e., in lieu of PDF/A-3)
  • Storage concerns at scale.
  • Tools for risk assessment or other archive management tasks (e.g. preservation planning)

End input from S&P Working Group on infrastructure

Organizational Roles, Policies, and Practices

i. Preservation happens through the work of individuals and institutions. Just as it is critical to refine and develop infrastructure and basic research it is similarly critical to refine and develop workflows, practices, roles, and responsibilities both inside institutions and within networks of institutions to ensure long term access to digital content.

ii.Examples:

• Need for models for licensing old software for long term virtualization

• Need for creation of more dedicated FTEs to staff digital preservation initiatives

• Development for policies around crowdsourcing as part of digital preservation life cycle

• Expanded use of machine readable licensing for data under long term preservation

Raw notes from S&P Working Group:

  • Sustainable budgetary models for long-term preservation
  • Articulating the compendium of best practices
  • Continuum of policies ranging from high-level organizational policies to lower-level rules
  • Role of national efforts, e.g. DPN, Academic Preservation Trust
  • International efforts and leveraging other preservation groups
  • Aligning National Approaches to Digital Preservation publication as a reference
  • Need for creation of more dedicated FTEs to staff digital preservation initiatives
    • Findings from the staffing survey (needs gaps, characteristics of needed staff)
  • What are the barriers to hiring qualified staff? Is it training? Budget? Finding people?
  • Collection of position descriptions that people could use as models.
  • How do we convince management that digital preservation is important and deserves resources?
  • Audit and certification
  • Scope of what we’re responsible for as practitioners has been broadening (data management,...) Also at different levels (department, institution, community)
  • Role of disciplinary repositories (how does our organization’s repository fit into the network of repositories?)
  • Changing rules for compliance

Conclusion

a. Possible ways to engage with the topics and issues detailed in the agenda