NDSA:2014 National Agenda Outline: Difference between revisions

From DLF Wiki
Jump to navigation Jump to search
m (97 revisions imported: Migrate NDSA content from Library of Congress)
 
(No difference)

Latest revision as of 15:19, 11 February 2016

Draft Outline

Introduction

The 2014 National Agenda for Digital Stewardship identifies significant developments in preserving and providing long-term access to digital content. In a field of frequent change, this position piece evaluates the current state of digital stewardship activity and highlights current challenges and emerging issues. The 2014 National Agenda for Digital Stewardship identifies emerging trends, opportunities, and gaps in digital stewardship activity. The Agenda is intended to be of use to National Digital Stewardship Alliance (NDSA) members and to the digital preservation community as a whole. It is not intended to replace any individual organizational efforts, planning, goals or opinions. It offers inspiration and potential direction for future work in digital preservation activity. The NDSA establishes, maintains, and advances the capacity to preserve our nation's digital resources for the benefit of present and future generations. Members work together to make sustained contributions to digital stewardship through action-focused working groups. The NDSA broadens access to the expanding digital resources of the United States of America; develops and coordinates sustainable infrastructures for the preservation of digital content; advocates standards for the stewardship of digital objects; builds a community of practice; promotes innovation; facilitates cooperation between various sectors; and raises awareness of the enduring value of digital resources and the need for active stewardship. With its national focus, the NDSA is in a unique position to identify and communicate the challenges, opportunities, and priorities for digital preservation activities in the United States. The NDSA joint leadership group, experts in digital preservation in libraries, archives, technology, and the commercial sector, created this report. The group engaged in discussions to identify significant trends and challenges. This dialog was enriched by an extensive range of resources, current research, and suggestions from the membership of the NDSA. The joint leadership group is made up of the Coordinating Committee members, the Working Group co-chairs, and the NDSA facilitator.


• Micah Altman • Jefferson Bailey • Karen Cariani • Jim Corridan • Blane Dessy • Michelle Gallinger • Andrea Goethals • Abbie Grotke • Cathy Hartman • Butch Lazorchak • Jane Mandelbaum • Carol Minton-Morris • Trevor Owens • Meg Phillips • John Spencer • Helen Tibbo • Tyler Walters • Kate Wittenberg • Kate Zwaard

Section topics

Trends in Digital Content

Electronic Records

Electronic records, and the loss of the underlying information it contains poses a significant threat to the American memory. Whether it’s an electronic diary, email correspondence, or documenting government transactions, all of these records are at risk of disappearing without thoughtful action to preserve important information. Preserving electronic records efficiently and in a cost effective manner remains a tremendous challenge. Culling through the volume of records generated and held by individuals and institutions in electronic format is requiring changes to traditional paper-based procedures. Rather than relying on files clerks to organize and store information, the information creator – each of us – will be responsible for properly managing his or her own electronic records. Education and a proper infrastructure will be a critical factor in teaching the public about the deficiencies of long-term electronic preservation and how to properly save important materials.

Research Data

Curating digital research data illustrates some of the most acute challenges with digital content. The sheer scale of research data represents a daunting curation task. With new scientific instrumentation being developed and the growing use of computer simulations, a research team can generate many terabytes of data per day. Data curators face managing at the petabyte scale (a petabyte equals 1,000 terabytes) and well beyond. Scientific fields such as particle physics with its collider data and astronomy with its sky surveys as well as research fields and methods such as bioinformatics, crystallography, and engineering design generate massive amounts of digital data. Large scale digitized content being created by initiatives like the Google Books project pose similar challenges. Digital research data are complex objects to curate. They are very heterogeneous, ranging from numeric and image-based, to text, geospatial, and other forms. There are many different information standards used (and not used) as well as many different approaches to information structure (e.g., XML-structured documents vs. fixed image and textual file formats). The research communities that produce data are equally diverse; their data management practices vary greatly within a discipline as well as between disciplines. There can also be commercial interests in the data and associated data practices. Perhaps the overriding challenges in all respects to digital research data are the affiliated costs. Domain researchers, technologists, information scientists, and policymakers are searching for sustainable economic models with the ability to accurately predict costs and to balance them across the lifecycle (e.g. costs for ingest, archival management, and dissemination), and through federated inter-institutional repository systems. There is no “one size fits all” approach when it comes to resolving the management challenges of research data. A path forward would be to galvanize digital preservation/curation community members around these four data challenges – scale, complexity, research communities’ practices, and costs -- study the issues more in-depth and begin recommending new solutions.

Web and Social Media

While cultural heritage organizations and others have been preserving web content since 1996, challenges continue in preserving born digital web content as websites become more complex and the scale of the web continues to grow. Crawlers used to collect content, as well as access tools used to render the web archives, are increasingly challenged in keeping up with the explosion of ever-complex technologies: multimedia, mashups, deep-web, databases, and the increasing prevalence of heavily scripted site navigational paradigms that do not prevent the collection of data but make replay nearly impossible without changes to the browser configuration of a visitor to the archive. More and more content published and created on the web is unable to be preserved using available tools.

The International Internet Preservation Consortium (netpreserve.org) developed the Heritrix web crawler and is working to develop a community to stabilize, improve, and support this open source tool in the future. Broader involvement by web archivists not involved directly in IIPC is critical. Development and exploration of improvements to access tools, including data mining tools for large datasets of web archives, are also needed. Full-text indexing of web archives continues to challenge researchers and the community of web archivists, particularly as archives expand and reach multiple terabyte and petabyte size.

The increasing use of social media by organizations and individuals can also be a challenge to preserve, as services hosting this content do not have preservation as a business model and changes they make in how they serve up content can upset the preservation process. The rate of publication/site implementations and change for large social media aggregation sites is, on average, every 3-6 weeks. This makes it virtually impossible to keep pace with an archival quality capture of these resources without direct access to the site feeds which are not available to most cultural heritage institutions even for a fee.

Tools being developed in recent years, primarily to meet the needs of business compliance regulations, are able to capture more of this type of material on a small scale. While they show exciting advancements in the tools available for web archiving, the technologies available have not yet translated to open source tools that scale to the needs of cultural heritage institutions and others collecting large amounts of data.

Motion Picture Film and Video

[rough notes from Carolyn - need a writeup] The challenges are outlined here: http://www.digitalpreservation.gov/series/challenge/DigitizationGuidelinesPart3.html


Digital preservation and stewardship of motion picture film and video presents a multitude of challenges. There is a lack of standards for preservation quality reformatting and a slew of issues that come from producing such large files-- not only storage of these monster files, but the ability to playback such files, etc. And also the clash or the potential synergy between the movie industry and cultural heritage institutions.

Digital content stored on obsolete or deteriorating media

As more archives and repositories come to terms with managing born digital content there will be a growing need for the management of disk images. The first rule for managing born digital content is to remove a copy of all the digital files from the physical media. This means digitally transforming hard drives, CD, DVD, floppies, zip disks, etc., often working with media that is obsolete (ie., older, not still in use). These transformed materials can be either forensic or logical disk images of the physical media. The main challenge in this first step is that the organization may not have the means to adequately process the digital files right away. This will then require rights-sensitive and potentially huge amount of storage until these materials can be appropriately treated and made available. As hard drives get larger and the storage capacity for an individual can easily jump to the terabytes this will pose a significant long term challenge for digital preservation, particularly at smaller organizations. Additionally, with obsolete media it may be difficult to identify and obtain working drives or drivers needed to access the data, and with older files it could be difficult to obtain software needed to open and view files as originally written.




Start input from S&P Working Group on trends in digital content

  • Web archiving
  • Research data
  • Big data
    • Computational consumption of archives
  • How do you connect annotations to content? Should we preserve those connections?
  • How do we provide access with appropriate limits
    • (government classification, copyright restrictions, donor agreements, licenses, human subject research restrictions). ** Rights metadata standards?
  • Compound, complex objects
    • Dynamic content, integrating resources
    • Not just documents (video, digital art / new media, etc.)
  • Preservation of social media
  • How to connect related publications (within and between repositories)
  • Findability and discoverability of content
  • Accessibility of digital content (e.g., usable via screen reader)
    • Accessibility of data sets
    • In the context of open access requirements / mandates / etc.

End input from S&P Working Group on trends in digital content

Research Priorities

The Research Priorities section focuses on two distinct aspects of research: the long term preservation of research data such as e-science, data sets, and so forth; and the need for research on digital preservation activities.

Research Data

[EXAMPLE] Education Workforce Development Research: Sentence to paragraph description with rationale for including the topic in the 2014 National Agenda. Recommendation for action included if relevant.

Research Related to Digital Preservation Practices

Applied Research

In the near term future, there are specific areas of applied research around digital preservation lifecycle issues that need attention. Currently there are limited models for cost estimation for ongoing storage of digital content. Cost estimation models need to be robust and flexible. Different approaches to cost estimation should be explored and comparisons of existing models made with emphasis on reproducibility of results. Auditing models also need to be strengthened and further developed. The SafeArchive system and other bit-level auditing practices could be connected to the NDSA Levels of Preservation work to help organizations determine and validate the costs of scaling different auditing schemes. Around both topics, research needs to address multiple storage models: locally stored data, distributed preservation networks, data cooperatives, cloud storage, brokered cloud storage systems and hybrid systems need to be addressed in cost models and auditing practices so that organizations can make informed cost-effective digital preservation decisions.

Research in Curriculum Development
  • Issue: As the stewardship of digital materials becomes a responsibility for an increasing number and variety of institutions, education, training, and workforce development are seen as key elements in supporting the expertise necessary for building a qualified base of current and future of digital stewards.
  • Discussion: Key research issues in this area include exploring more practical, immersive internships and fellowship for undergraduates and graduate students, the need for greater fluency with technologies across the field, more robust and affordable professional development opportunities, the economics and efficacy of online learning, better understanding of career paths and organizational roles for digital curators and preservationists, affiliations with data management and preservation in non-humanities disciplines, and exploring collaborative opportunities between educational programs, students, and employers in the digital preservation community.
  • Recommendation: Continued exploration of new curriculum models for graduate education as well as innovative training and professional development mechanisms for those currently in the field.
  • Resources:
Theoretical Framework

(3-6 Year horizon) ["helen"] 1. Information valuation/selection. Models for estimating future private & public value of information. 2. Models for estimating future risks

Information Equivalence

Significant properties, fingerprints, authenticity (3-6 year research horizon)

  • Issue: The ongoing proliferation of digital content types, file formats, and modes of access, along with continued and emerging preservation strategies of migration and emulation, has foregrounded the need for a better understanding of significant properties, file identification, and the authenticity of digital objects as they are preserved and made accessible into the future.
  • Discussion: Changes in technology, software, and hardware will increase the need for evolving standards, frameworks, and tools to help digital stewards evaluate the essential characteristics of digital objects in their care. Similarly, better understanding is needed of how the management of “digital fingerprint” fixity information impacts storage, processing, and preservation planning. Ensuring the authenticity of digital objects remains a community priority.
  • Recommendation: Organizations must identify the significant properties of digital objects under their stewardship, preserve their integrity information, and ensure their authenticity through time.
  • Resources:

Preservation at Scale (3-6 year research horizon)

  • Issue: Growing volumes of digital materials will test the financial and operational capabilities of organizations to accomplish preservation activities. Institutional responsibilities to serve and preserve big data will also be influenced by user and content creator expectations regarding its maintenance and accessibility. Storage, intellectual and administrative control, and access will all be redefined by the demands of big data.
  • Discussion: Currently, many organizations lack the expertise or economies-of-scale to store and process petabytes of data. Lack of infrastructure and expertise will require collaborative solutions involving greater automation, scalable processes, and modular, adaptive frameworks.
  • Recommendation: Community-driven scalable solutions to a wide-range of unique and independent preservation activities must be developed. As well, shared infrastructure and open-source solutions will enable greater efficiency and economic feasibility towards the growing volume of digital content that must be preserved. Research is needed into how the stewardship of "big data" impacts privacy, confidentiality, and personally-identifiable information.
  • Resources:
Policy Research

(3-6 year): ["Micah"] 1.Trust engineering, trust frameworks

Education Workforce Development Research

(3-6 year) [Helen]

Evidence-Based for Preservation Methodologies & Policies

(Cross-Cutting/10 years/Grand Challenge) ["Micah"] 1. experimental: labs/testbeds/field experiments • Methodologies for digital preservation research that can provide useful results with simulation of long time periods. • Methodologies for digital preservation research that provide reliable test plans. • Methodologies that combine aspects of different research areas (e.g., computer science, materials science 2. observational: random sampling/systematic trend/coverage 3. computational: replicable theoretically grounded computer models 4. Research in a lab or test-bed environment, with a focus on methods to test research results and implement effective strategies from the research lab or test-bed. Frameworks that allow people to apply their specialized knowledge and skills to specific problems


Start input from S&P Working Group on research

  • Findability and discoverability of content
  • Large scale integration of emulation into delivery (connect to work done internationally)
  • Format migration testing
  • Integration of emulation and migration (hybrid approach)
  • How do we leverage tools and practices in the digital forensics community (and other fields)?

End input from S&P Working Group on research


Research Priorities References:


Infrastructure Development

Infrastructure can be generally defined as the set of interconnected technical elements that provide a framework for supporting an entire structure of design, development, deployment and documentation in service of applications, systems and tools for digital preservation. This includes hardware, software and systems. While organizational policies, practice and regulation are relevant, these elements will be covered in the following section.

Integration of Digital Forensics Tools into Production Workflows for Collections of Born Digital Materials

Implementation of services and tools for ongoing implementation of File Format Action Plans

  • Raising awareness that organizations are amassing considerable collections of digital files, which are heterogeneous in format, size and format.
  • Articulating how organizations need to have the capability to mine and monitor diverse collections of digital material.
  • Exposing how organizations are surveying digital content repositories, and developing techniques to identify threats and risks to this material.
  • Advocating there is value in organizations documenting the diversity of files they have in their collections, and sharing this information to assist in prioritizing the development of approaches for file format actions based on current needs.

Interoperability between storage devices, hardware, data tape, and file systems software

  • Developing use cases, e.g., I want to be able to send my materials to a vendor to digitize, lay on a LTO tape and then plug into my library system without having to re-ingest everything. How do I know if there are any general standards for a vendor to adhere to make this easier?

Best practices for fixity checks based on size of collection and institution

  • Developing use cases, e.g., I want some guidance for how often it makes sense with large collections, so that I'm not constantly fixity checking and putting data storage at risk. I'm aware that the scenario is very different if you’re using spinning disc versus tape. Are there any guidelines?

Issues around migration

  • Articulating the need for an interoperable back-end infrastructure.
  • Raise awareness for end-to-end data integrity across systems migration, and automated validation strategies at various steps in the life cycle.
  • Identify challenges to migrating individual components, or entire tiers, within complex storage environments with seemless integration.
  • Identify challenges to data mobility.

Issues of Scalability

  • Raising awareness how digital collections are growing exponentially, and keeping track of everything that's in a collection is a big challenge.
  • Advocating the needs of the digital stewardship community that vendors need to develop systems capable of finding information in a timely manner.
  • Creating use cases that demonstrate the need for smart systems that can:
    • return corrupt files, not only error messages
    • rebuild parts of indexes when an error occurs to avoid having to stage everything again, and
    • manage metadata in a way that reflects the current state of a system.


Start input from Digital Content group

>>file system - linear tape file system (transport between tape, into cloud)

We were thinking about passing it to you for consideration in your section as it doesn't feel contenty, it feels more infrastructurey (to us at least, who admittedly don't fully understand what the issue is :)

Gail Truman was the one who'd brought it up, and she forwarded some additional details, below. I think Bradley Daigle also discussed this on our call, but he's not responded to a request yet for more details.

End input from Digital Content group

Start input from S&P Working Group on infrastructure

  • Development of commercial products for digital preservation; creating and maintaining relationships with the private sector
  • Consolidating and keeping alive the palette of tools we need to do our work of digital preservation, and for rendering in the future
    • Shared tool development or reusing tools developed by other communities
  • Common packaging (general and specialized)
    • In a perfect world, record-keeping systems in federal agencies would all know how to create a package, so that all sorts of systems become interoperable; would achieve huge economies for the government
  • Use and access – tends to be divorced from preservation, but needs to be more integrated
    • Preservation is ensuring access over time
    • Need to involve researchers more
    • “Archlive” – shouldn’t be places of storage, but of dynamic activities
    • Have yet to pursue the other end of the OAIS model – the consumer archive
    • New demands for API and federated access to our content coming out of initiatives like DPLA, edX, jdarchive
  • What tools are available to do things like package and annotate content (i.e., in lieu of PDF/A-3)
  • Storage concerns at scale.
  • Tools for risk assessment or other archive management tasks (e.g. preservation planning)

End input from S&P Working Group on infrastructure

Organizational Roles, Policies, and Practices

A. The Issue

What is the most critical organizational problem affecting digital preservation work today?

Despite continued preservation mandates, it has become increasingly difficult to adequately preserve digital content because of a complex set of interrelated societal, technological, financial and organizational pressures:

  • Increased scope of responsibilities (data management, education of content creators, etc.)
  • Growing financial pressures - increased costs and decreasing resources
  • Lack of adequate staff, in numbers and expertise (refer to the staffing survey)
  • Increased complexity and volume of data (see the comments in the content section)
  • Rapidly accelerating technological change
  • Evolving data management, security and compliance policies
  • Lack of prioritization of digital preservation by higher administration and those controlling budgets

B. Solutions and Recommendations

What potential solutions could address this challenge in a practical way? Because the pressures listed above are interrelated, the most effective solutions will address multiple factors -- we need to address the whole suite of problems together for most effective change.

1. Work together as a community to raise the profile of digital preservation and campaign for more resources and higher priority given to digital preservation

Recommendations:

  • Increase outreach activities and education about the importance and real cost of digital preservation
2. Dramatically increase cross-organizational cooperation and division of labor to multiply the breadth of impact and investments made within individual institutions

Rationale:

  • If it is impractical for every institution to develop expertise in every aspect of the digital preservation challenge; different institutions could specialize in different aspects and rely on each other for some functions.
  • If each institution does not have the resources to fully fund all the digital preservation responsibilities and activities, having each institution spend on something different and sharing capabilities with each other would place investments wisely where they could make a real impact.
  • If each institution cannot hire the number of staff and the variety of types of expertise, collaborative hiring and sharing of staff and skills could help.

Recommendations:

  • Identify preservation functions that could be outsourced (the staffing survey revealed some functions) versus the functions that each organization prefers to or must do for itself (e.g. planning, alignment with parent organization’s goals and designated communities)
  • Establish a network of preservation service providers who can provide different specialized services so every participant does not need to provide all the services it needs for itself.
    • Make visible the different services offered, areas of expertise, and standards activities of organizations active in the digital preservation community
    • Use that visibility to find opportunities where multiple organizations could benefit from a division of labor and identify gaps where something necessary is not getting done
    • Identify potential specializations, then publicize commitments of organizations to specialize in a particular function so others can begin to rely on it.
  • Develop mature certification and trust frameworks
  • Encourage wide adoption of interoperability standards that would allow organizations to rely on each other more easily for predictable and equivalent outcomes
  • Establish a method of providing assurance that the digital preservation community is participating in all relevant standards bodies so that institutions can trust that their digital preservation interests are being represented by someone in the community when it matters. We need comprehensive coverage on all critically relevant standards bodies, and coordination so that it is clear who has taken responsibility for what.. [Butch - please explain relation to network]
3. Identify more cost-efficient methods of preservation

Recommendations:

  • Conduct research on cost-efficient but effective preservation, and sustainable financing/billing models
4. Develop and share digital preservation training and staffing resources

Recommendations:

  • Develop and share resources for training or hiring digital preservation staff (e.g. curricula, training materials. position descriptions)

Conclusion

a. Possible ways to engage with the topics and issues detailed in the agenda