NDSA:August 21, 2013 Meeting Minutes: Difference between revisions

From DLF Wiki
Abgr (talk | contribs)
No edit summary
Abgr (talk | contribs)
Line 30: Line 30:
And the answers might be different for a one person shop than with a five person team. Or they may be the same. Perhaps we could have more in depth questions around this area.
And the answers might be different for a one person shop than with a five person team. Or they may be the same. Perhaps we could have more in depth questions around this area.


===General Comments===
Michael found it comforting in reading original survey that things they were struggling with, others were struggling with too.


*Michael found it comforting in reading original survey that things they were struggling with, others were struggling with too.  
Nicholas suggested we standardize language in survey to say "Collecting" rather than "capture".  


The more sample policies we can gather the better, for those starting out, to refer to. If we can gather these and share more broadly would be very useful.


===What to repeat===
===What to repeat===
Line 51: Line 52:
METRICS - how are people counting their data? how much do they have/volume/size of data? TB, URLs archived, other metrics in use?
METRICS - how are people counting their data? how much do they have/volume/size of data? TB, URLs archived, other metrics in use?


DATA EXTRACTION: Are people using web harvesting to extract PDFs or other content for other purposes? If so, how? We talked about the CINCH tool and UNT's PDF extraction of EOT content as examples.


QR/Post-processing: Some tools support removing data or deindexing/suppressing data that you don't want shown or is a "mistake" - are people doing this, if so, when/why?  Is this a quality review workflow? Cathy will check with Brenda from UNT to see latest on QR survey. Do we want to add QR questions or look at what she's asked and responses separately? (or both?)
MISC: What policies are in place that affect social media archiving? How are people approaching it?  Are your collections memento-enabled?


===Expand on topics from 2011 Survey===
===Expand on topics from 2011 Survey===


TOOLS (Besides crawling and access): Curator tools, supplemental capture tools, other non-crawling modes of aquiring content.  
TOOLS (Besides crawling and access): Curator tools, supplemental capture tools, other non-crawling modes of aquiring content. Full text indexing tools? other technologies that people are using with/for their archives?


COLLECTION/SELECTION POLICIES: We asked about policies before in Q8-9-10 (links to #9 responses only available to NDSA members, not in final report) how are you making decisions about frequency and depth of collections? Is it budgetary or curatorial? Are your collections related to existing (print?) collections or are they new and different things?  
PRESERVATION METHODS - Glen found the section on downloading copies/transferring data (Q20) interesting, would like this asked again but also go into what preservation methods are used: checksums, validation of files, etc. How long do people intend to keep their web archive content? Forever, or are there other retention schedules being followed?  


PRESERVATION METHODS - Glen found the section on downloading copies/transferring data (Q20) interesting, would like this asked again but also go into what preservation methods are used: checksums, validation of files, etc.
ACCESS: expand on questions


ACCESS: expand on questions
COLLECTION/SELECTION POLICIES: We asked about policies before in Q8-9-10 (links to #9 responses only available to NDSA members, not in final report) how are you making decisions about frequency and depth of collections? Is it budgetary or curatorial? Are your collections related to existing (print?) collections or are they new and different things?


ROBOTS POLICIES: expand Q28 re: robots.txt to get more details. Nicholas said open-ended comments about this hint at some possible checkbox options if not respecting robots: 1) organizations own copyright 2) we seek permissions so we ignore robots 3) discretionary (if this, ask for details - why would be ignored or not).
ROBOTS POLICIES: expand Q28 re: robots.txt to get more details. Nicholas said open-ended comments about this hint at some possible checkbox options if not respecting robots: 1) organizations own copyright 2) we seek permissions so we ignore robots 3) discretionary (if this, ask for details - why would be ignored or not).
Line 67: Line 72:
PERMISSIONS POLICIES: Get more granularity on permissions policies - has something changed in your policies since ARL guidance issued? Are people relying on embargoes (time frame for those)? Are there other external policies that influence your approach?  
PERMISSIONS POLICIES: Get more granularity on permissions policies - has something changed in your policies since ARL guidance issued? Are people relying on embargoes (time frame for those)? Are there other external policies that influence your approach?  


RESEARCH USE/ USAGE STATISTICS - a bit along the lines of researcher use in original survey, we wonder about usage statistics that people might be gathering, how they are doing this, what types of "hits" are they getting. We did not that Archive-IT doesn't currently track this but a future release will make that easier. We discussed how Q25 was open ended and many of the answers were "we're not sure, too soon to tell" -
RESEARCH USE/ USAGE STATISTICS - a bit along the lines of researcher use in original survey, we wonder about usage statistics that people might be gathering, how they are doing this, what types of "hits" are they getting. We did not that Archive-IT doesn't currently track this but a future release will make that easier. We discussed how Q25 was open ended and many of the answers were "we're not sure, too soon to tell".  We also talked about a question about citations - are people tracking citations to their web archives?

Revision as of 16:45, 21 August 2013

Attendees

  • Bailey, Jefferson, Metropolitan NY Library Council
  • Grotke, Abbie | Web Archiving Team Lead, Library of Congress, and Co-Chair of the NDSA Content Working Group | abgr@LOC.GOV | 202-707-2833 | @agrotke
  • Hartman, Cathy | Associate Dean of Libraries, University of North Texas/ Co-Chair of the NDSA Content Working Group | cathy.hartman@UNT.EDU
  • McCain, Edward | University of Missouri | mccaing@missouri.edu
  • McAninch, Glen | Kentucky Department for Libraries and Archives | Glen.McAninch@ky.gov
  • McMillan, Gail | Virginia Polytechnic Institute and State University | gailmac@vt.edu
  • Moffatt, Christie | National Library of Medicine | moffattc@mail.nlm.nih.gov
  • Rudersdorf, Amy | Digital Public Library of America | amy@dp.la
  • Stoller, Michael | New York University | Michael.stoller@NYU.EDU
  • Taylor, Nicholas | Stanford Univ. Libraries
  • Wurl, Joel | National Endowment for the Humanities | jwurl@neh.gov

Agenda

Brainstorm of:

  • what questions we'd like to repeat from the 2011 survey --what
  • topics/issues that were brought up in survey that we might want to delve deeper into
  • new questions we might ask

Discussion Notes

Kristine Hanna (IA) couldn't join us today but submitted these comments to Abbie, which she shared with group:

1) I think it would be extremely helpful to see the progress organizations have made in the last two years. Sort of a "are you better or worse off than in 2011" type of polling.

2) I keep hearing over and over gain the need to understand internal work flows, skill sets required, resources needed to initiate and sustain a web archiving program. And the answers might be different for a one person shop than with a five person team. Or they may be the same. Perhaps we could have more in depth questions around this area.

Michael found it comforting in reading original survey that things they were struggling with, others were struggling with too.

Nicholas suggested we standardize language in survey to say "Collecting" rather than "capture".

The more sample policies we can gather the better, for those starting out, to refer to. If we can gather these and share more broadly would be very useful.

What to repeat

We didn't get too much into this but talked about problematic questions that we might drop or restructure for this year. Jefferson reported that in analyzing results, respondents seemed to have trouble with questions 13-15 (those about "what subjects in your archives" - we discussed possibly reformulating these questions but no solutions proposed yet. There was interest in keeping in some questions about news, media and journalism. Maybe we can tie those with policy questions regarding certain types of content?


Possible New Areas To Explore

WORKFLOWS - would be helpful to learn more about what workflows people have in place for acquisition of web content. Multiple choice might be hard; descriptive open-ended comment field? (see additional workflow comment in metadata below)

STAFFING/SKILLSETS - All agreed that questions about staffing would be useful. Jefferson suggested we look at the infrastructure wg's staffing survey to see how they asked questions about this (full time vs part time, who is doing what? what skills do they need? Who is selecting what is being captured?

METADATA - In Q24 we ask about whether people do catalog records, but didn't get into more detail. Would be good to inquire about how you structure descriptions, what fields are used, what formats (MODS, DUBLIN CORE, etc.). What data is auto-generated/extracted from archive vs. manually created or edited? What percent of auto generated data needs to be corrected? What is the workflow for metadata creation? Are there difficulties resolving descriptions with existing standards?

METRICS - how are people counting their data? how much do they have/volume/size of data? TB, URLs archived, other metrics in use?

DATA EXTRACTION: Are people using web harvesting to extract PDFs or other content for other purposes? If so, how? We talked about the CINCH tool and UNT's PDF extraction of EOT content as examples.

QR/Post-processing: Some tools support removing data or deindexing/suppressing data that you don't want shown or is a "mistake" - are people doing this, if so, when/why? Is this a quality review workflow? Cathy will check with Brenda from UNT to see latest on QR survey. Do we want to add QR questions or look at what she's asked and responses separately? (or both?)

MISC: What policies are in place that affect social media archiving? How are people approaching it? Are your collections memento-enabled?

Expand on topics from 2011 Survey

TOOLS (Besides crawling and access): Curator tools, supplemental capture tools, other non-crawling modes of aquiring content. Full text indexing tools? other technologies that people are using with/for their archives?

PRESERVATION METHODS - Glen found the section on downloading copies/transferring data (Q20) interesting, would like this asked again but also go into what preservation methods are used: checksums, validation of files, etc. How long do people intend to keep their web archive content? Forever, or are there other retention schedules being followed?

ACCESS: expand on questions

COLLECTION/SELECTION POLICIES: We asked about policies before in Q8-9-10 (links to #9 responses only available to NDSA members, not in final report) how are you making decisions about frequency and depth of collections? Is it budgetary or curatorial? Are your collections related to existing (print?) collections or are they new and different things?

ROBOTS POLICIES: expand Q28 re: robots.txt to get more details. Nicholas said open-ended comments about this hint at some possible checkbox options if not respecting robots: 1) organizations own copyright 2) we seek permissions so we ignore robots 3) discretionary (if this, ask for details - why would be ignored or not).

PERMISSIONS POLICIES: Get more granularity on permissions policies - has something changed in your policies since ARL guidance issued? Are people relying on embargoes (time frame for those)? Are there other external policies that influence your approach?

RESEARCH USE/ USAGE STATISTICS - a bit along the lines of researcher use in original survey, we wonder about usage statistics that people might be gathering, how they are doing this, what types of "hits" are they getting. We did not that Archive-IT doesn't currently track this but a future release will make that easier. We discussed how Q25 was open ended and many of the answers were "we're not sure, too soon to tell". We also talked about a question about citations - are people tracking citations to their web archives?