NDSA:March 29, 2013 Call: Difference between revisions

From DLF Wiki
Jump to navigation Jump to search
No edit summary
Line 63: Line 63:
PDF/A is pretty much asleep now, making sure we keep up with the latest version of the PDF specification.  
PDF/A is pretty much asleep now, making sure we keep up with the latest version of the PDF specification.  


Caroline:


----


Chris Dietrich's Notes:
PDF/A-3 Use Cases 20130329


Caroline:
NDSA PDF/A-3 work group discuss (positive) use cases with Stephen Levenson of the U.S. Courts
 
PDF/A has 3 versions: No versions (A-1, A-2, or-A3) will be replaced by subsequent versions reducing the need for migration
-PDF/A-1 = ???? [didn’t capture what makes a PDF/A-1 different from a PDF]
-PDF/A-2 = PDF/A-1 + ability to embed a second PDF/A-2 copy of the same document(?) (recursive?)
-PDF/A-3 = PDF/A-2 + any file embedded
--Genesis for A-3 was to include other types like XML along with the PDF (not just PDF/A-2)
--Private Data section in A-2:
--Use Case: Brazil uses to embed other data like XML version of the file (legal docs)
--Is/can be protected from view
--Was the genesis for creating A-3 to store other data not protected from view (?)
--Allows for human- and machine-readable versions of a document in the same container
--Viewing software (i.e. Adobe Reader et al.) will not render additional content but will notify consumer that additional content      exists
 
All three versions are PDFs, no need to distinguish between the three versions except by the presentation interface which should be designed to detect and display embedded additional content
 
Currently no validators to ensure that any particular file (PDF, .doc, etc.) is actually what the extension indicates or is high-quality (e.g. some PDF generators are better than others)
 
There is a vendor looking to create an ISO-compliant PDF validator which could be used by repositories to validate incoming PDF files (still years away)
 
Ability to embed XML allows both human- and machine-readable versions of a document in the same container
  Useful for automated ingest and presentation by repositories
 
Question arises which of the two versions (PDF or XML) is the authoritative version
-Use-case: PACER system for courtroom audio stored as MP3 and packaged in PDF/A-3
--PDF/A-3 acts as the metadata/manifest for the embedded recording
--Recordings are access copies, not high-quality originals for long-term preservation/archiving
--These are temp records, disposed of after 5 years
 
I think I missed another use case here…

Revision as of 18:19, 29 March 2013

Back to Standards Working Group Main Page

Back to PDF Exploration Page

Agenda

Discussion with Stephen Levenson, an IT Specialist for Policy and Planning at the Office of the US Courts and the chair of the PDF/A working group.

Participants

Don Chalfant, Kate Murray, Sheila Morrissey, Kevin DeVorsey, Chris Dietrich, Carl Fleischhauer, Stephen Levenson, Butch Lazorchak

Meeting Notes

  • A rough transcription of our conversation*

Stephen Levenson discussing how the PDF/A family of standards are being implemented by the ISO community.

Steve: PDF/A-3 doesn't necessarily replace A-1 or A-2. Should be able to use a PDF/A-1 file 30 years from now. The methodology used should not change for rendering these files in the future.

PDF/A movement highly influenced by manufacturers, now PDF/A center, dominated by the Germans. Had many use cases for instances where creators wanted to include the original files wihtin a PDF/A document.

Brazilian government wanted to preserve their material as XML but XML wasn't trusted by users because of complexity. Wanted to make a more presentation-ready format but didn't want to throw away the XML.

U.S. Courts, bankruptcy court, claims, when an individual goes into court, and the claims are laid out, in order for someone to assert the claim, we print a document for them. That's what they bring back to court to assert their claim. Ginny Mae started this and Mastercard is also movin gon this. We get the PDF but then have to reenter the data from this doucment in their case management systems.

We're putting an XML output of what the claim represents inside the PDF document. Ginny Mae's automated processes work on this XML.

Down with PDF/A-3 we have downstream functions that leverage the inner materials. Adobe's server product does not currently output A-3 files.

Chris: How will hidden content be protected from certain readers of the document.

Steve: For A-3 you can still include the information as "private data" that would make it hidden. Conforming reader would recognize it as A-3 and set up an additional dialog.

Caroline: we're in the position of having to preserve files that somebody else created. We need tools to characterize files. Some PDF/A-3 files may not have any embedded content so they'd actually behave like a PDF/A-2.

Steve: We'd have to talk to the developers about that. There is a vendor that is looking to create an independent service to validate the writers to ensure that they are actually complying with the standards. The software would have to get certified that it works. Then we'd actually have validators at ingestion. We have to get somebody interested in creating this software as a business and we think we finally have somebody who will do this.

DOD standard 5015.2 that says if you're going to be a document management system you have to do certain things. Has to be sent to Fort Huachuca in AZ to a testing center to ensure that it conforms to DOD 5015.2. And this validator would do the same kind of thing.

right now we have no validator that says a Word document is actually a word document. There are a lot of bad writers out there.

Kevin: we're working on policy and guidance side. We ask people to keep temporary, permanent and non-record material separate from each other. Does PDF/A-3 run counter to that? Might encourage people to mix record and non-record material together in the same file?

Steve: If there's a relationship, don't you need that for provenance information?

Kevin: we need to educate our folks.

Steve: We've been dealing with current technologies on these things and who knows what technology will afford us in the future. We'll be able to hedge our bets.

Sheila: Question to me is what is the relationship between the PDF/A-3 container and the embedded XML, that is, in the Brazilian example, which one has the force of law? And how do you ensure that they say the same thing? In Germany they're planing on using this for commerce and the validation of invoices, but processing thing in the mass pragmatically means that you're going to look at one or the other. What warrant is there that they're going to stay the same way. The Germans said that the "embedded content" has no standing. Only the archival version has standing."

Steve: I can talk about legal. We would assume that the entire document was the legal evidence. In the case of dual content, according to the "best evidence" rule. If the company put a courtesy copy inside and that it's the main document that is their record. For example, if you requested an invoice and you received what you might see in a PDF versus XML data, then that's the evidence.

Folks coming in to a reading room. We may have to set up rules at ingestion, they could either strip it out and put the file on a diet and store the other content. We, in the committee, didn't want to dictate to preservationists how to do their job.

If NARA said they didn't want a part of A-3 then the agencies shouldn't store in A-3. In our Pacer system you can pull down a PDF document but stored inside is an MP3 files that allows you to understand the provenance a little more. The PDF is the metadata around the MP3 file. these are temporary records, so it's not the same issues.

Chris: So the PDF is acting as a manifest for the MP3 file. No validators, if someone if processing a bunch of files into PDF and embedding an XML version. Something could go wrong and you embed the wrong versions of the XML. There's not way to validate the right coordinated content.

Steve: archivists are going to have to get more involved in advising creators on the types of files they create.

PDF/E and A are essentially the same, they just couldn't find an open codec for rendering 3D documents.

PDF/A is pretty much asleep now, making sure we keep up with the latest version of the PDF specification.

Caroline:


Chris Dietrich's Notes: PDF/A-3 Use Cases 20130329

NDSA PDF/A-3 work group discuss (positive) use cases with Stephen Levenson of the U.S. Courts

PDF/A has 3 versions: No versions (A-1, A-2, or-A3) will be replaced by subsequent versions reducing the need for migration -PDF/A-1 = ???? [didn’t capture what makes a PDF/A-1 different from a PDF] -PDF/A-2 = PDF/A-1 + ability to embed a second PDF/A-2 copy of the same document(?) (recursive?) -PDF/A-3 = PDF/A-2 + any file embedded --Genesis for A-3 was to include other types like XML along with the PDF (not just PDF/A-2) --Private Data section in A-2: --Use Case: Brazil uses to embed other data like XML version of the file (legal docs) --Is/can be protected from view --Was the genesis for creating A-3 to store other data not protected from view (?) --Allows for human- and machine-readable versions of a document in the same container --Viewing software (i.e. Adobe Reader et al.) will not render additional content but will notify consumer that additional content exists

All three versions are PDFs, no need to distinguish between the three versions except by the presentation interface which should be designed to detect and display embedded additional content

Currently no validators to ensure that any particular file (PDF, .doc, etc.) is actually what the extension indicates or is high-quality (e.g. some PDF generators are better than others)

There is a vendor looking to create an ISO-compliant PDF validator which could be used by repositories to validate incoming PDF files (still years away)

Ability to embed XML allows both human- and machine-readable versions of a document in the same container

  Useful for automated ingest and presentation by repositories

Question arises which of the two versions (PDF or XML) is the authoritative version -Use-case: PACER system for courtroom audio stored as MP3 and packaged in PDF/A-3 --PDF/A-3 acts as the metadata/manifest for the embedded recording --Recordings are access copies, not high-quality originals for long-term preservation/archiving --These are temp records, disposed of after 5 years

I think I missed another use case here…