Data Collection and Contribution, Digitization Cost Calculator
Digitization is a costly business -- estimating expenses associated with a given digitization project, a fiscal year, or for a grant application, can feel disconnected from the reality of staffing, timelines, and true project costs. The Digital Library Federation’s Assessment Interest Group is developing a Digitization Cost Calculator that runs on data contributed by participating institutions. The calculator will help with digitization project planning by using contributed data to produce average estimates of costs and time for various aspects of the digitization process. You can view the wireframes for the calculator here, or use a beta version (simpler than the final version) here.
But we need your help! We can’t build the calculator unless the community contributes data. You can contribute data at any time for one or more of the data fields in the calculator.
How to collect data
Before you begin, read the definitions of processes carefully and make sure that your process conforms to the guidelines for the fields in which you want to submit data. Remember that we cannot accept data that:
- Does not conform with the DLF-AIG definitions of processes
- Does not provide both time data and the number of scans to which time data relates
- Combines time data that cannot be separated for multiple processes
- Is just a broad/loose estimate of time data rather than true data collection
Understand that all data is submitted in increments of “per 100 scans.” This is necessary for the calculator to make data usable between institutions. Even if you are submitting data for a process that is not performed “by scan” (i.e., fastener removal, disbinding) you’ll need to pay attention to how many digital images are captured from the quantity of material that you timed staff on as they performed the action, so that it can be calculated as “x time per 100 scans.”
Additionally, for everything other than image capture, you’ll need to submit data on the percentage of materials on which a process was performed. See example 2 below.
Example 1: A student is assigned to scan glass plate negatives on a flatbed scanner. After one hour, the student reported that they had scanned 20 plates.
Divide 60 (minutes) by 20 (scans) = 3 minutes per scan.
In this example, to determine per 100 scans:
3 (minutes per scan) * 100 (scans) = 300 minutes to complete 100 scans at the rate determined by your session.
Example 2: A staff person is removing staples. In 2 hours of removal, the staff person removes 40 fasteners from papers in a single box of materials. Once the box is scanned, the materials produce 150 scans.
Turn 2 hours into minutes: 2*60 = 120
Divide 120 (minutes) by 150 (scans) = 0.8 minutes per scan
In this example, to determine per 100 scans:
0.8 (minutes per scan) * 100 (scans) = 80 minutes to complete fastener removal for 100 scans at the rate determined by your session.
Additionally, you’ll report that the process was performed on 27% of materials (40 staples divided by 150 scans evens out to staples in 27% of materials, when “scans” are used as the base material).
Try to purposefully time processes, and not include time for overhead or other activities. People using the Cost Calculator will build in their own overhead time and costs -- it shouldn’t be included in the data submissions. We would rather have you accurately time a student’s scanning pace for a single 3-hour shift than use a larger quantity of unspecific data to make an inaccurate estimate. For example, if you know that a student scanned 1,000 pages during 50 hours of shifts and they were mostly scanning when they worked, you might be inclined to submit scanning data as “20 scans per hour.” In reality this student may have spent 25% of their time doing other tasks, being out sick, answering emails, eating lunch, or going to the restroom. Submitting such inaccurate data from rough estimates can have a huge negative impact on the ability of the Digitization Cost Calculator to help others estimate time and money for projects.
Methods of timing
- Have staff write down a beginning and end time for a shift or stretch of time that includes only one process (e.g., intellectual property review). They will also write down the number of times the process was performed and the number of scans the process was performed on -- or, if it is a pre-scanning process, you will need to identify the quantity of material. Later, the “number of scans” is determined by the number of scans that will be produced from this material. For pre-scanning processes, it is easiest to time based on concrete quantities: i.e., an entire collection, a set of boxes.
- Give staff stopwatches (or use stopwatch apps on smartphones) to collect data for short periods of defined work. This is particularly useful for processes that are performed intermittently, e.g. disbinding, supporting, flattening, quality control, alignment/rotation -- these may be performed in between bouts of image capture. In such cases, the ability to easily stop and start a stopwatch instead of keeping track of many start and stop times is very useful. You’ll just need to also make sure you’re marking down the number of times you conduct the process and how many total scans were produced per that number of occurrences.
- Time a process 20 times, then create an average. From then on, just count the number of times you perform the process and at the end, multiply the number by the average time you previously calculated. This method lends itself well to a process that is difficult to time, or that once timed, typically takes the same amount of time again. For example, if you typically perform image cropping after each scan instead of reviewing and cropping a large number of scans at a single time, you might time yourself with a stopwatch 20 times, see that the times are all pretty similar, take the average, and then just mark how many times you perform the process. If it takes an average of 10 seconds to crop an image and you performed it 500 times over a month on a collection of 700 scans, you’ll get your time data by multiplying 500*10, converting it to minutes and then dividing it by 7 to get minutes per 100 scans.
How are other people doing it?!
Some folks who have already contributed data have provided some helpful tips.
- Watch this short video (coming soon) from Northwestern University about how they collect data
- ... we hope to have more examples soon, let us know if you'd like to contribute something!
How to submit data
Data is submitted via an online form, which you can preview before beginning the data submission process. If you have questions, you can contact firstname.lastname@example.org. In order to submit data, you’ll need to know which of the calculator fields you are submitting data for (e.g., fastener removal, image stitching, image capture), your time data, and the number of scans the time data pertains to (e.g., 200 minutes to perform process X for 100 scans), and what percentage of materials the process was performed on (for everything other than image capture).
Get your data ready for the submission form: We highly recommend reviewing the preview of the data submission form before beginning data submission.
For descriptive metadata creation and quality control you will have to determine which sub-category your data falls into: level 1, 2, or 3. Read the field definitions for guidelines.
For image capture data, you’ll need to be able to specify the image capture device.
For every data field except image capture, you’ll need to estimate what percentage of the overall material a process was performed on; e.g., if you are submitting data in the image “clean up/dust removal” category and it took you an hour to do dust removal required on 20 of the 100 scans, you’ll need to submit the number of minutes that process took (60), the total number of scans (100), and the percentage of scans the process was performed on (20%, or 20/100). Similarly if you performed fastener removal and you estimate there was a staple every 5-pages or so, you’d provide the percentage as either 20% if the material is one sided (1 staple for an average of every 5 pages, 5 scans) or 10% if the material is two-sided (1 staple for an average of every 5 pages, 10 scans).
For every data submission, you’ll also be asked to submit:
- Your name and contact information
- The dates during which data was collected (“March 2016,” “Apr 15-Sept 20 2015,” etc.)
- The type of material (manuscript, photograph, etc.)
- The time period of the materials (19th century, 20th century, etc.)
What will I need to do? You will time yourself, staff, or student as you/they perform digitization processes during a period of time of your choosing -- a week, a month, the duration of a specific project, etc. The areas in which you can contribute time data include image capture, descriptive metadata creation, quality control, various preparation processes such as condition review, rebinding, formatting, and various post-processing processes such as alignment/rotation, image cropping, and stitching.
Do I have to give you salary/benefits data? No! You only submit time data (how many minutes it took to perform a specific task per 100 scans). “Costs” are estimated using salary/benefits data that people enter when using the calculator. The calculator does not track or store cost data anywhere.
What if we don’t perform all of the tasks mentioned? That’s fine because the calculations are broken down by task – you only submit data for the specific tasks that you choose. Contributing whatever pieces of your process that are trackable in whatever increments you can track them in, is still incredibly helpful! When will I need to do it? Submissions are ongoing.
What if we track(ed) time data for process X and process Y smushed together in one number? Unfortunately, we cannot use data that combines multiple processes -- your time data contributions will have to be separate for each process you contribute data for. Alternatively, you can collect sample data for one of the processes and then estimate the aggregate data out into pieces.
How will my institution benefit? You will have contributed to the creation of a freely-available tool (the Digitization Cost Calculator) that allows users to input their institution’s salary and benefits data, the amount of material being digitized, select which processes they will be undertaking, and then outputs cost and time data based on all aggregate contributed data. This tool will help many organizations in planning future projects and in articulating the true costs of digitization projects.
Will the information I contribute be associated with my institution? Sort of. The data you submit will be aggregated by the calculator with all other data submissions and displayed as part of an average on the results screen when people use the calculator: No individual institution’s information will be discernible in the calculator. However, individual institutional data will be shown on the Notes About Data webpage, another part of the Digitization Cost Calculator website. This allows calculator users to get a feel for the wide variation in time and in practice from institution to institution and project to project. Seeing the data apart from the aggregate average can also be helpful if a user feels their institution is more similar to one or more other institutions in the list, and allows them to calculate custom time estimates. The time period over which the data contribution was collected will also be displayed on the Notes About Data page.
What if I have some historical digitization data to contribute now? Great, we’d love to have your historical data! Submit your data using the Data Submission Form, or send an email to email@example.com, subject line Cost Calculator, to learn more.
What if my historical data is in a different format than the cost calculator data submission requirement? Email firstname.lastname@example.org, and we will help migrate the data into the right format. You will have to know how many scans are associated with a block of time, so if there is no way to go back and get that information, we cannot accept time data without knowing what quantity of material it is for.
What if we track(ed) only a part of the data you are looking for in our digitization workflow? That’s fine, and still very valuable. You can contribute just one piece of data -- you don’t have to have all the fields represented in the calculator. I have more questions! Please feel free to contact Joyce Chapman with any additional questions about the project, being a contributor, or using the calculator: email@example.com, 919-660-5889.