NDSA:February 24, 2014 Standards and Practices Working Group Notes
Attendees:
- Barrie Howard
- Amy Kirchhoff
- Andrea Goethals
- Butch Lazorchak
- Carol Kussmann
- Carolyn Campbell
- David Lake
- Deborah Kempe
- Dina Sokolova
- Felicity Dykas
- Kate Murray
- Mariella Soprano
- Mary Vardigan
- Michelle Paolillo
- Midge Coates
- Rosie Storey
- Winston Atkins
Rosie Storey is new attendee. She was sent by Kate Zwaard who is out on maternity leave. She is a s/w developer.
Project 1 Review: Digital Video conversation a couple of week's ago. Went well. Page on Wiki. Going to put together plans on next steps. Doodle poll for scheduling of next call to look at suggestions on Wiki and think about next steps. Infrastructure working group also interested and opened it up for their participants. WebEx will be posted to whole group, whether you opt to answer the poll or not.
Project 2 Review: PDF A/3 document published last week. Butch Lazorchak led the effort. http://blogs.loc.gov/digitalpreservation/2014/02/new-ndsa-report-the-benefits-and-risks-of-the-pdfa-3-file-format-for-archival-institutions/. Also on by-line were Sheila Morrissey and Caroline Arms. Lots of good feedback on Twitter. Recommend we all read it and comment to Butch or on blog. Kudos to all authors.
Project 3 Review: Fixity document. An informal document at this point. Posted by Trevor Owens to Signal Blog on 2/7. http://blogs.loc.gov/digitalpreservation/2014/02/check-yourself-how-and-when-to-check-fixity/ Next steps are to encourage folks to share the document, get comments, refine, and post as a more formal document to NDSA website.
As a follow-up to fixity document, will be posting a blog about different roles and placement of fixity data in media files. A little different than with "documents". Love to hear comments.
Last update: The Signal Blog post about the Wikipedia project. Andrea talks about it a little. http://blogs.loc.gov/digitalpreservation/2014/01/wikipedia-the-go-to-source-for-information-about-digital-preservation/. Discuss project, challenges, achievements and folks who contributed a lot of effort, especially our colleauges at Columbia. Asking in blog post if anyone wants to take over heading the porject. If interested, contact Steven Paul Davis.
Last Last Update: Self assessment project. Archivematica hosting a Drupal based app that institutions can download. It helps institutions work through a TRAC self-assessment. https://www.archivematica.org/wiki/Internal_audit_tool Next focus of project will be getting back to guidance and examples of how to go through the process. Want to have something to share by July meeting.
Another Update: Agenda for NDSA. This year an opportunity to tap into wealth of info in working groups. A concerted effort to get feedback from working group members and working groups. Read through 2014 document and provide feedback by mid-March on different sections addressed. Standards is a bit buried (infrastructure, on the other hand, has its own section within the document). Contact Andrea, Barrie, Kate, or coordinating committee or working group chairs with comments. Input would be useful, including examplars.
A couple of calls ago, discussed interests of attendees on the group. Looked for overlap. First was video and led to those discussions. Next up was metadata packaging.
Rosie from LC: Using BagIt. Where possible, try and get content providers to use BagIt. Usually providers are happy to do so. In it BagIt spec, there is a baginfo.txt file where you can put MD. There is a manifest with fixity info. When bag received at LC, have an inventory DB. The baginfo.txt file feeds into the inventory DB. Users can browse in that DB via a web application to find and get to the files. Keep that baginfo.txt file on the system in case the DB get's lost. Keeping it in sync is one of the problems. Working on a related project to address BagIt spec version 2 ("baguette"). Enhanced ability to keep MD at root level of bag and across the entirety of bag. Also trying to come up with a less intrusive way to keep the content on disk to keep fixities and MD embedded. Maybe over next 6 months to a year. Will be posting to Google Digital Curation Group. Amy K. asked if it would be possible to get into BagIt redesign during the design phase.
Andrea from Harvard: In past have mixture of some content modeled using METS (mainily page turn objects) and lots of other content they don't use METS for. Putting second generation DB in place and thinking about how to treat MD. Making it a first citizen of repository. Migrating all MD this year. Serializing MD to METS files. No standards around the first time they built their DB. Changing all schemas, now, and changed data model to be more consistent with PREMIS data model (objects that have files, rather than just files). MD coming in from lots of data sources. Old files, catalog records, running FITS tool against all files and that is providing technical and format MD. Challenge is to figure out where to draw the line of what to do now and what to put off to the future. Can't do everything right now.
David Lake from NARA and much of what Andrea discussed resonates with him. Currently in the process of developing a new SIP specification for the ERA project. The original SIP created was from back in 2005/6. Quite limitted in functionality, but it did much of the basics (capture fixity, created manifest, etc.). Homegrown. Limitted in ability to provide MD at various levels. An opportunity to re-examine what they are doing in this area. Doing a lot of work to take processing capabilities out of the DRA system and put them in a more flexible environment. Need a multitude of tools to process different datasets as they come in. Doing some refactoring. Repository has an XML file of MD for every object in the repository. It is based on PREMIS. Limitations with that. Thinking about SIP spec right now. Using METS, heavily, in construction of SIPs. In SIP, assuming they'll be making changes on the backend with the repository schema. Biggest challenges are accomodating massive collections that are expecting to come in -- volume will be a challenge. Especially, being able to take in MD across the different types of formats they receive, especially records in hierarchical format. How to model those complex types of records in SIP in a way that repository will be able to parse out and manage MD and relationships between the files once they get into the repository. Planning to have a draft produced for a pilot project later this year. May take it to wider community for constructive review.
Carolyn Campbell -- use METS, now. Want to use METS in a DSpace repository. Current project. Trying to get METS integrated into DSpace repository and not sure how to do it.
Amy -- Portico takes in content packaged in many different ways and normalizes everything to its content model, which was informed by PREMIS, DIDL and METS. The Portico MD closely follows the 6 part content model: content type, content set, archival unit, content unit, functional unit, and storage unit. Portico prefers to export content in BagIt, using rsync where possible (to leverage its built in fixity check functionality). Portico imposes a specific directory structure on the payload of the bags. Amy will send round some pictures of the Portico content model.
A question was asked on whether in Portico's experience it was possible to put all the DMD into the baginfo.txt file. Amy explained that at Portico we do not try to. Portico expects its content recipients to read the NLM/JATS XML files and/or the PMD file for DMD and article structural information. We do impose a structure on the payload and provide an XML file with each journal directory that includes some business information such as journal title, that does not always come in the NLM/JATS files.
Question: Do we want to blog post about this, Andrea asks? Or just keep it as an internal discussion? Or discuss further. Folks who spoke will write up a paragraph.
Wrap Up: Moving from lists.digitalpreservation.gov to lists.gov. They will send out reminders, in case we've set up rules in our email services. This WebEx will be changing, too. Not sure what it'll be yet, evaluating choices.
Action Items:
- Amy from ITHAKA will send round a picture of the Portico content model
- Working group members who spoke about their experiences with MD and packaging will write up a paragraph