IGITAL documents can quickly become unreadable, as anyone who has tried to open an old WordStar file or postponed transferring data from a 5 1/4-inch floppy knows.
Disks decay, or the required software changes, or the necessary hardware and operating systems no longer exist.
During the last decade, a growing number of librarians, archivists and researchers have turned to the challenge of long-term preservation of digital documents, debating ways to conserve the information embedded in them so that it can be understood in the future just as it is understood today.
Advertisement
|
|
|
At present, the basic, imperfect approach is to update documents constantly, converting them from their original versions into newer ones while it is still possible to run the old software.
But this is a labor-intensive process and in many cases gradually leads to corrupted documents, because each time the files are updated they may lose some of the stored information.
The alternative, also widely practiced, is to keep old files and hope that some software in the future will be able to decipher them. Chances are, though, that in 2040 there won't be a way to understand, for example, those old PDF documents. Acrobat Reader, the present means of reading them, will probably no longer be in use, and even if you save a 2002 version, it will be unlikely to run on computers of 2040.
What is needed, some archivists argue, is a kind of computer Esperanto — a common preservation system that can read and present today's formats and the thousands that will follow in a simple, standard way that can be emulated or mimicked on whatever computers lie ahead.
Now, Dr. Raymond Lorie, a researcher at the I.B.M. Almaden Research Center in San Jose, Calif., has proposed a system that he hopes will become that lingua franca. He has developed a prototype for a "universal virtual computer" — a system with architecture and language designed to be so logical and accessible that computer developers of the future will be able to write instructions to emulate it on their machines.
Dr. Lorie defined and described his universal virtual computer in a series of technical papers in the last few years and demonstrated the system for the National Library of the Netherlands.
For the universal computer to work, it would first have to be adopted as a standard throughout the computer industry. Developers of new software with new file formats would need to write additional software that could read and display the files in the language of the universal computer. At the same time, descriptions of the universal virtual computer would need to be widely available for future computer developers.
Then, assuming that the universal computer is simple and logical enough, people 100 years from now using different computer architectures would face only one relatively basic task to read old formats on new machines — write a set of instructions so the universal virtual computer could be emulated on whatever machines exist then.
Emulation is a common computer technique in which one computer acts like another — for instance, code is written for a Mac that mimics in every detail the operations of a PC so that programs written for a PC will run on a Mac.
In his approach, Dr. Lorie said, a program written for the universal virtual computer extracts all the data stored in a file, for instance, the data in a PDF file. This program does not try to reproduce the full range of services offered by Acrobat Reader.
"I don't need to recreate Acrobat Reader with all its buttons and colors," he said. "That would be overkill." Users of the future, he said, will want to see the document and have access to the data. "They will take the data and store it, probably in a completely different way."
Dr. Lorie's program reads and displays the contents of the PDF file using tags, extra semantic information designed to reduce the confusion of people in 2040 who may at first be unsure of what they are viewing. These semantic tags might say, for instance, "There is text in this document and it is organized like this," he explained.
Dr. Lorie has successfully tested the key parts of his universal computer, proving that it will work in the future, said Dr. Robin Williams, associate director of research at Almaden. To do this, Dr. Lorie first wrote a program in the universal computer language that could read and display the contents of a PDF file. Then he wrote programs to show how his universal computer system could work on computers with different architectures, Dr. Williams said.
Johan Steenbakkers, director of information technology for the Dutch national library, which hired I.B.M. to investigate a way to preserve electronic publications, said Dr. Lorie's virtual computer had been successfully demonstrated there. "We have seen a proof of concept," he said. "If the universal virtual computer became a standard for digital archiving, it would be a major step forward," offering a controlled, one-time migration to a specific preservation format.
Meanwhile, Jeff Rothenberg, a senior computer scientist at the RAND Corporation in Santa Monica, Calif., who raised the problem of long-term preservation of digital documents in an influential Scientific American article in 1995, takes a different approach to preservation.
Mr. Rothenberg wants archivists to preserve the original software — meaning, for instance, all of the functions of Adobe Acrobat — rather than adopting the data extraction program that Dr. Lorie proposes.
"I would prefer to store documents in their original forms and formats — with all of the software that created them and is typically required to view them," he said. The original software would be run under emulation on future computers. "This is the only reliable way to recreate a digital document's original function, look and feel," he said.
Data extraction, in contrast, is too limited, he said. "It will give you the contents — or rather, what someone thought were the meaningful core contents — in some future form," he said. "But it won't preserve the original."