NDSA:David Rosenthal

From DLF Wiki

I'm excited to launch a new series, Insights, for the blog. Insights will feature interviews and conversations between National Digital Stewardship Alliance Innovation working group members and individuals working on projects related to preservation, access and stewardship of digital information. We are thrilled to kick off this series with an interview with eminent computer scientist and engineer David Rosenthal.

What do you think are one or two of the toughest problems we have already solved in digital stewardship? Further, what do you see as the most important aspects of those solutions?

I'm an engineer. Engineers never really get to completely solve a problem the way mathematicians do. There are never enough resources; engineers are trained to look for high-leverage opportunities. We work on something until it is good enough that improving it further isn't the best use of resources right now. Maybe after we improve the thing that is the new best use of resources for a while, the first thing will become the next best use of resources again, so we go back to it.

Much of the resources of digital preservation have been devoted to preparing for formats to go obsolete and require migration. I've been arguing for a long time that this isn't a good use of resources. Before the web, formats used to go obsolete quickly, but since the advent of the web in 1995, it is very hard to find any widely used format that has gone obsolete. The techniques we have had for a long time, such as open source renderers and emulation, work well enough to cope with format obsolescence if and when it eventually happens; further work in this low-leverage area is a waste of resources.

The most important aspect of the way that format obsolescence became not worth working on is that digital preservation had nothing to do with it. Formats stopped going obsolete for very fundamental reasons, not because digital preservation prevented it. Virtual machines and open source became part of mainstream IT for reasons that had nothing to do with preservation.

Do you think there are lessons from those solutions that we should be applying to solve current pressing problems?

We need to look for high-leverage areas and devote our limited resources to them.

For example, very few of the resources of digital preservation have been devoted to the high-leverage area of reducing the cost of achieving the storage reliability we need. People seem to think that, left to its own devices the industry will produce what we need, but they are wrong. Current technologies are not nearly reliable enough and they're way too expensive. Long term digital preservation is a very small part of the total storage market; industry's focus is on reducing the cost of the kind of storage that addresses much bigger markets. These have much lower reliability needs.

Digital preservation resources should be devoted to systems engineering research into better ways of combining mass-market components into extremely reliable storage systems, and into better ways of modeling these systems so this research can produce results in useful time-frames.

What do you think are currently the most pressing problems that need to be addressed in digital preservation? Further, to what extent do you think we are addressing these problems?

Each of the vast range of types of digital content that we should be preserving has its own set of problems. There are no one-size-fits-all approaches.

For the types of content I work with, e-journals, government documents and other web-published content that used to be preserved on paper, the pressing problems result from the intersection of copyright law, diversity, and risk. What we are urged to preserve is large, uniform and thus technically easy, and not at any significant risk of loss. We are told preserving Elsevier's journals is a top priority, but the probability of access to them being lost in the foreseeable future is miniscule. The content, such as small humanities journals, that is being lost right now is small, diverse and thus hard. First because high-quality preservation requires permission from the publisher, we must track them down and convince them to give permission. Second, because their web sites tend to lack consistent structures and to change rapidly. Both mean that the staff time per byte preserved is high. But that should be measured against the risk of losing the content. This is an area we're hardly addressing at all, because we're all measured by the size of our collections irrespective of the risks we're protecting them from.

At the other extreme, preserving huge amounts of data for a long time, as for example the Internet Archive or the Protein Data Bank, poses a different dilemma. It is very difficult to achieve the necessary level of reliability cheaply enough to be funded adequately over the long haul. This is an area that is getting a lot of attention, with a Blue Ribbon Task Force. But the attention seems to take the cost as a given and focus on how to raise the money, instead of looking at cost/reliability trade-offs.

LOCKSS is one of the most successful tools in digital preservation infrastructure. What would you say are a few of the key lessons from both the history of the software's development and the history of the LOCKSS Alliance are for anyone working on building tools to support digital stewardship?

The LOCKSS program started by focusing on a small, well-defined part of the overall problem of digital preservation, analyzing that part carefully, including the legal, organizational and user interface aspects, designing a system that aimed to be as cheap as possible, prototyping the entire system, then trying it with real content, real libraries and real publishers. If we had tried to solve the whole problem we would have failed.

The next stage was to throw the prototype away, design then implement a real system based on the lessons from the prototype. At this stage we knew which functions were essential; everything else could be left out or postponed. One important thing we did at this stage was to take the security of the system seriously, spending time wearing black hats figuring out how to attack it. This turned out to be a significant research problem, requiring several years work at Sun Labs and then at Stanford's CS department.

Our focus on squeezing cost out of the system was critical. It enabled the LOCKSS program achieve economic sustainability soon after going into production, and to maintain it during difficult economic times. Once we had a working system that was cost-effective, people found uses for it that we'd never thought of.

If you had one piece of advice or word of warning for someone getting involved in developing tools for digital preservation what would it be?

Focus on the total cost of ownership through time. No-one has enough money to preserve everything that should be preserved.