NDSA:August 21, 2013 Meeting Minutes

Attendees

Bailey, Jefferson, Metropolitan NY Library Council
Grotke, Abbie | Web Archiving Team Lead, Library of Congress, and Co-Chair of the NDSA Content Working Group | abgr@LOC.GOV | 202-707-2833 | @agrotke
Hartman, Cathy | Associate Dean of Libraries, University of North Texas/ Co-Chair of the NDSA Content Working Group | cathy.hartman@UNT.EDU
McCain, Edward | University of Missouri | mccaing@missouri.edu
McAninch, Glen | Kentucky Department for Libraries and Archives | Glen.McAninch@ky.gov
McMillan, Gail | Virginia Polytechnic Institute and State University | gailmac@vt.edu
Moffatt, Christie | National Library of Medicine | moffattc@mail.nlm.nih.gov
Rudersdorf, Amy | Digital Public Library of America | amy@dp.la
Stoller, Michael | New York University | Michael.stoller@NYU.EDU
Taylor, Nicholas | Stanford Univ. Libraries
Wurl, Joel | National Endowment for the Humanities | jwurl@neh.gov

Agenda

Brainstorm of:

what questions we'd like to repeat from the 2011 survey --what
topics/issues that were brought up in survey that we might want to delve deeper into
new questions we might ask

Discussion Notes

Kristine Hanna (IA) couldn't join us today but submitted these comments to Abbie, which she shared with group:

I think it would be extremely helpful to see the progress organizations have made in the last two years. Sort of a "are you better or worse off than in 2011" type of polling.
I keep hearing over and over gain the need to understand internal work flows, skill sets required, resources needed to initiate and sustain a web archiving program. And the answers might be different for a one person shop than with a five person team. Or they may be the same. Perhaps we could have more in depth questions around this area.

Michael found it comforting in reading original survey that things they were struggling with, others were struggling with too.

Nicholas suggested we standardize language in survey to say "Collecting" rather than "capture".

The more sample policies we can gather the better, for those starting out, to refer to. If we can gather these and share more broadly would be very useful.

What to repeat

We didn't get too much into this but talked about problematic questions that we might drop or restructure for this year. Jefferson reported that in analyzing results, respondents seemed to have trouble with questions 13-15 (those about "what subjects in your archives" - we discussed possibly reformulating these questions but no solutions proposed yet. There was interest in keeping in some questions about news, media and journalism. Maybe we can tie those with policy questions regarding certain types of content?

Possible New Areas To Explore

WORKFLOWS - would be helpful to learn more about what workflows people have in place for acquisition of web content. Multiple choice might be hard; descriptive open-ended comment field? (see additional workflow comment in metadata below)

STAFFING/SKILLSETS - All agreed that questions about staffing would be useful. Jefferson suggested we look at the infrastructure wg's staffing survey to see how they asked questions about this (full time vs part time, who is doing what? what skills do they need? Who is selecting what is being captured?

METADATA - In Q24 we ask about whether people do catalog records, but didn't get into more detail. Would be good to inquire about how you structure descriptions, what fields are used, what formats (MODS, DUBLIN CORE, etc.). What data is auto-generated/extracted from archive vs. manually created or edited? What percent of auto generated data needs to be corrected? What is the workflow for metadata creation? Are there difficulties resolving descriptions with existing standards?

METRICS - how are people counting their data? how much do they have/volume/size of data? TB, URLs archived, other metrics in use?

DATA EXTRACTION: Are people using web harvesting to extract PDFs or other content for other purposes? If so, how? We talked about the CINCH tool and UNT's PDF extraction of EOT content as examples.

QR/Post-processing: Some tools support removing data or deindexing/suppressing data that you don't want shown or is a "mistake" - are people doing this, if so, when/why? Is this a quality review workflow? Cathy will check with Brenda from UNT to see latest on QR survey. Do we want to add QR questions or look at what she's asked and responses separately? (or both?)

MISC: What policies are in place that affect social media archiving? How are people approaching it? Are your collections memento-enabled?

Expand on topics from 2011 Survey

TOOLS (Besides crawling and access): Curator tools, supplemental capture tools, other non-crawling modes of aquiring content. Full text indexing tools? other technologies that people are using with/for their archives?

PRESERVATION METHODS - Glen found the section on downloading copies/transferring data (Q20) interesting, would like this asked again but also go into what preservation methods are used: checksums, validation of files, etc. How long do people intend to keep their web archive content? Forever, or are there other retention schedules being followed?

ACCESS: expand on questions - what are add-ons to access or UI issues. Are people linking to external collections or related non-web archive content in their displays?

COLLECTION/SELECTION POLICIES: We asked about policies before in Q8-9-10 (links to #9 responses only available to NDSA members, not in final report) how are you making decisions about frequency and depth of collections? Is it budgetary or curatorial? Are your collections related to existing (print?) collections or are they new and different things?

ROBOTS POLICIES: expand Q28 re: robots.txt to get more details. Nicholas said open-ended comments about this hint at some possible checkbox options if not respecting robots: 1) organizations own copyright 2) we seek permissions so we ignore robots 3) discretionary (if this, ask for details - why would be ignored or not).

PERMISSIONS POLICIES: Get more granularity on permissions policies - has something changed in your policies since ARL guidance issued? Are people relying on embargoes (time frame for those)? Are there other external policies that influence your approach?

RESEARCH USE/ USAGE STATISTICS - a bit along the lines of researcher use in original survey, we wonder about usage statistics that people might be gathering, how they are doing this, what types of "hits" are they getting. We did not that Archive-IT doesn't currently track this but a future release will make that easier. We discussed how Q25 was open ended and many of the answers were "we're not sure, too soon to tell". We also talked about a question about citations - are people tracking citations to their web archives?

Next Steps

Abbie offered to compile minutes and share with listserv. Joel pointed out there are a lot of questions and we haven't agreed on dropping any old ones yet, so we'll need to figure out how best to shape this. We discussed making more questions multiple choice and less open-ended, which will help with compiling results at the end. Nicholas and Edward offered to help draft questions. We'll figure out where to go from here after everyone reviews these minutes. Survey will take place in October, so we have a month to prepare it.

-end-