NDSA:March 29, 2013 Call: Difference between revisions
No edit summary |
Added Caroline Arms notes |
||
Line 115: | Line 115: | ||
• These are temp records, disposed of after 5 years | • These are temp records, disposed of after 5 years | ||
---- | |||
Caroline Arms' thoughts after the 3/29/2013 call: | |||
The call with Stephen Levenson confirms me in thinking that we ought to present the risks first. | |||
Here are some thoughts I've tried to turn into words since Friday – words that might fit into the report. | |||
The PDF/A-3 specification allows any sort of file to be embedded in a PDF/A-3 document. There is a mechanism for expressing a relationship between the embedded file and the document, but there is no way to verify that the stated relationship holds. Thus, PDF/A-3 is not appropriate for long-term preservation in a cultural heritage institution without a prior agreement between a depositor and the archival repository. Such an agreement must clarify what formats are acceptable as embedded files and how the depositor's workflow guarantees that the relationships between the main PDF document and any embedded files is fully understood by the archival institution, so that appropriate rules are applied on ingest. The agreement might be based on following a community best practice or a formal, legal regulation. | |||
Failing an appropriate agreement, the default ingest policy for an archival repository might be to treat all embedded files in PDF/A-3 files as outside its preservation responsibility. A repository should establish its own rules for dealing with embedded files in PDF/A-3 documents in general, just as it needs to establish rules for dealing with files in a generic aggregation format, such as ZIP. Some repositories might disallow embedded files in PDF/A documents; some might immediately extract the embedded files and make individual preservation commitment decisions; some might accept PDF/A-3 files but make no commitment with respect to an embedded files. | |||
We recognize that there are scenarios where PDF/A-3 makes excellent sense. One scenario is that of "hybrid archiving" in which a document originates and is edited in a word-processor, but is managed in a document management system as a PDF/A-3, with the source word-processor file embedded within the PDF document. In this scenario, the document is considered archive-ready throughout its lifecycle, and it is reasonable to consider the embedded source file as non-archival. | |||
Other scenarios where use of PDF/A-3 seems appropriate are where a document is essentially a record of a transaction between two stages in an information workflow. Examples of such transactions are invoices and court filings. Such a transaction requires a format that preserves the visual content of a document across devices and operating systems (a primary objective of PDF/A), but may also benefit from a machine-processable representation of part or all of the content. This could be an XML document using a standard or well-known schema. {Examples: German invoices, US Bankruptcy Court filings} | |||
Yet another scenario is from the Brazilian government, where the XML version is the official version, but one potential dissemination format is PDF/A-3 with the XML master embedded. | |||
---- |
Revision as of 04:01, 1 April 2013
Back to Standards Working Group Main Page
Back to PDF Exploration Page
Agenda
Discussion with Stephen Levenson, an IT Specialist for Policy and Planning at the Office of the US Courts and the chair of the PDF/A working group.
Participants
Don Chalfant, Kate Murray, Sheila Morrissey, Kevin DeVorsey, Chris Dietrich, Carl Fleischhauer, Stephen Levenson, Butch Lazorchak
Meeting Notes
- A rough transcription of our conversation*
Stephen Levenson discussing how the PDF/A family of standards are being implemented by the ISO community.
Steve: PDF/A-3 doesn't necessarily replace A-1 or A-2. Should be able to use a PDF/A-1 file 30 years from now. The methodology used should not change for rendering these files in the future.
PDF/A movement highly influenced by manufacturers, now PDF/A center, dominated by the Germans. Had many use cases for instances where creators wanted to include the original files wihtin a PDF/A document.
Brazilian government wanted to preserve their material as XML but XML wasn't trusted by users because of complexity. Wanted to make a more presentation-ready format but didn't want to throw away the XML.
U.S. Courts, bankruptcy court, claims, when an individual goes into court, and the claims are laid out, in order for someone to assert the claim, we print a document for them. That's what they bring back to court to assert their claim. Ginny Mae started this and Mastercard is also movin gon this. We get the PDF but then have to reenter the data from this doucment in their case management systems.
We're putting an XML output of what the claim represents inside the PDF document. Ginny Mae's automated processes work on this XML.
Down with PDF/A-3 we have downstream functions that leverage the inner materials. Adobe's server product does not currently output A-3 files.
Chris: How will hidden content be protected from certain readers of the document.
Steve: For A-3 you can still include the information as "private data" that would make it hidden. Conforming reader would recognize it as A-3 and set up an additional dialog.
Caroline: we're in the position of having to preserve files that somebody else created. We need tools to characterize files. Some PDF/A-3 files may not have any embedded content so they'd actually behave like a PDF/A-2.
Steve: We'd have to talk to the developers about that. There is a vendor that is looking to create an independent service to validate the writers to ensure that they are actually complying with the standards. The software would have to get certified that it works. Then we'd actually have validators at ingestion. We have to get somebody interested in creating this software as a business and we think we finally have somebody who will do this.
DOD standard 5015.2 that says if you're going to be a document management system you have to do certain things. Has to be sent to Fort Huachuca in AZ to a testing center to ensure that it conforms to DOD 5015.2. And this validator would do the same kind of thing.
right now we have no validator that says a Word document is actually a word document. There are a lot of bad writers out there.
Kevin: we're working on policy and guidance side. We ask people to keep temporary, permanent and non-record material separate from each other. Does PDF/A-3 run counter to that? Might encourage people to mix record and non-record material together in the same file?
Steve: If there's a relationship, don't you need that for provenance information?
Kevin: we need to educate our folks.
Steve: We've been dealing with current technologies on these things and who knows what technology will afford us in the future. We'll be able to hedge our bets.
Sheila: Question to me is what is the relationship between the PDF/A-3 container and the embedded XML, that is, in the Brazilian example, which one has the force of law? And how do you ensure that they say the same thing? In Germany they're planing on using this for commerce and the validation of invoices, but processing thing in the mass pragmatically means that you're going to look at one or the other. What warrant is there that they're going to stay the same way. The Germans said that the "embedded content" has no standing. Only the archival version has standing."
Steve: I can talk about legal. We would assume that the entire document was the legal evidence. In the case of dual content, according to the "best evidence" rule. If the company put a courtesy copy inside and that it's the main document that is their record. For example, if you requested an invoice and you received what you might see in a PDF versus XML data, then that's the evidence.
Folks coming in to a reading room. We may have to set up rules at ingestion, they could either strip it out and put the file on a diet and store the other content. We, in the committee, didn't want to dictate to preservationists how to do their job.
If NARA said they didn't want a part of A-3 then the agencies shouldn't store in A-3. In our Pacer system you can pull down a PDF document but stored inside is an MP3 files that allows you to understand the provenance a little more. The PDF is the metadata around the MP3 file. these are temporary records, so it's not the same issues.
Chris: So the PDF is acting as a manifest for the MP3 file. No validators, if someone if processing a bunch of files into PDF and embedding an XML version. Something could go wrong and you embed the wrong versions of the XML. There's not way to validate the right coordinated content.
Steve: archivists are going to have to get more involved in advising creators on the types of files they create.
PDF/E and A are essentially the same, they just couldn't find an open codec for rendering 3D documents.
PDF/A is pretty much asleep now, making sure we keep up with the latest version of the PDF specification.
Caroline:
Chris Dietrich's Notes: PDF/A-3 Use Cases 20130329
NDSA PDF/A-3 work group discuss (positive) use cases with Stephen Levenson of the U.S. Courts
PDF/A has 3 versions: No versions (A-1, A-2, or-A3) will be replaced by subsequent versions reducing the need for migration
• PDF/A-1 = ???? [didn’t capture what makes a PDF/A-1 different from a PDF]
• PDF/A-2 = PDF/A-1 + ability to embed a second PDF/A-2 copy of the same document(?) (recursive?)
• PDF/A-3 = PDF/A-2 + any file embedded
• Genesis for A-3 was to include other types like XML along with the PDF (not just PDF/A-2)
• Private Data section in A-2:
• Use Case: Brazil uses to embed other data like XML version of the file (legal docs)
• Is/can be protected from view
• Was the genesis for creating A-3 to store other data not protected from view (?)
• Allows for human- and machine-readable versions of a document in the same container
• Viewing software (i.e. Adobe Reader et al.) will not render additional content but will notify consumer that additional content exists
All three versions are PDFs, no need to distinguish between the three versions except by the presentation interface which should be designed to detect and display embedded additional content
Currently no validators to ensure that any particular file (PDF, .doc, etc.) is actually what the extension indicates or is high-quality (e.g. some PDF generators are better than others)
• There is a vendor looking to create an ISO-compliant PDF validator which could be used by repositories to validate incoming PDF files (still years away)
Ability to embed XML allows both human- and machine-readable versions of a document in the same container
• Useful for automated ingest and presentation by repositories
Question arises which of the two versions (PDF or XML) is the authoritative version
• Use-case: PACER system for courtroom audio stored as MP3 and packaged in PDF/A-3
• PDF/A-3 acts as the metadata/manifest for the embedded recording
• Recordings are access copies, not high-quality originals for long-term preservation/archiving
• These are temp records, disposed of after 5 years
Caroline Arms' thoughts after the 3/29/2013 call:
The call with Stephen Levenson confirms me in thinking that we ought to present the risks first.
Here are some thoughts I've tried to turn into words since Friday – words that might fit into the report.
The PDF/A-3 specification allows any sort of file to be embedded in a PDF/A-3 document. There is a mechanism for expressing a relationship between the embedded file and the document, but there is no way to verify that the stated relationship holds. Thus, PDF/A-3 is not appropriate for long-term preservation in a cultural heritage institution without a prior agreement between a depositor and the archival repository. Such an agreement must clarify what formats are acceptable as embedded files and how the depositor's workflow guarantees that the relationships between the main PDF document and any embedded files is fully understood by the archival institution, so that appropriate rules are applied on ingest. The agreement might be based on following a community best practice or a formal, legal regulation.
Failing an appropriate agreement, the default ingest policy for an archival repository might be to treat all embedded files in PDF/A-3 files as outside its preservation responsibility. A repository should establish its own rules for dealing with embedded files in PDF/A-3 documents in general, just as it needs to establish rules for dealing with files in a generic aggregation format, such as ZIP. Some repositories might disallow embedded files in PDF/A documents; some might immediately extract the embedded files and make individual preservation commitment decisions; some might accept PDF/A-3 files but make no commitment with respect to an embedded files.
We recognize that there are scenarios where PDF/A-3 makes excellent sense. One scenario is that of "hybrid archiving" in which a document originates and is edited in a word-processor, but is managed in a document management system as a PDF/A-3, with the source word-processor file embedded within the PDF document. In this scenario, the document is considered archive-ready throughout its lifecycle, and it is reasonable to consider the embedded source file as non-archival.
Other scenarios where use of PDF/A-3 seems appropriate are where a document is essentially a record of a transaction between two stages in an information workflow. Examples of such transactions are invoices and court filings. Such a transaction requires a format that preserves the visual content of a document across devices and operating systems (a primary objective of PDF/A), but may also benefit from a machine-processable representation of part or all of the content. This could be an XML document using a standard or well-known schema. {Examples: German invoices, US Bankruptcy Court filings}
Yet another scenario is from the Brazilian government, where the XML version is the official version, but one potential dissemination format is PDF/A-3 with the XML master embedded.