This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new!

Feature: Open Source

Linux to help the Library of Congress save American history

By Michael Stutz on March 28, 2007 (8:00:00 AM)

Share    Print    Comments   

The Library of Congress, where thousands of rare public domain documents relating to America's history are stored and slowly decaying, is about to begin an ambitious project to digitize these fragile documents using Linux-based systems and publish the results online in multiple formats.

Thanks to a $2 million grant from the Sloan Foundation, "Digitizing American Imprints at the Library of Congress" will begin the task of digitizing these rare materials -- including Civil War and genealogical documents, technical and artistic works concerning photography, scores of books, and the 850 titles written, printed, edited, or published by Benjamin Franklin. According to Brewster Kahle of the Internet Archive, which developed the digitizing technology, open source software will play an "absolutely critical" role in getting the job done.

The main component is Scribe, a combination of hardware and free software. "Scribe is a book-scanning system that takes high-quality images of books and then does a set of manipulations, gets them in optical character recognition and compressed, so you can get beautiful, printable versions of the book that are also searchable," says Kahle.

While previous versions were written for both Linux and Windows, the Internet Archive has migrated Scribe entirely to Linux, and Windows support has been dropped. Kahle says the project uses Ubuntu now.

When asked why the Library of Congress chose Scribe for this project, Dr. Jeremy E. A. Adamson, the library's director for collections and services, replies that the Internet Archive has already demonstrated "the efficient production of high-quality images" with it.

Kahle says that a Linux-based Scribe workstation at the Library of Congress will hold the material to be scanned in a V-shaped cradle -- it doesn't crack books all the way open -- while two cameras take images of it. A human operator performs quality assurance, then Scribe sends the digital images across the breadth of the country to the Internet Archive in San Francisco, where it is processed and eventually posted online in various formats. Free software is used almost every step of the way.

"[It's a] Linux-based station out there in the field. It rsyncs the files up to the servers, [and then] it goes and does the processing on a Linux cluster of over 1,000 machines, and then posts it online -- also on Linux machines," Kahle says.

Image processing for an average book takes about 10 hours on the cluster, and while the project still uses proprietary optical character recognition (OCR) software, Kahle says that many open source applications come into play, including the netpbm utilities and ImageMagick, and the software performs "a lot of image manipulation, cropping, deskewing, correcting color to normalize it -- [it] does compression, optical character recognition, and packaging into a searchable, downloadable PDF; searchable, downloadable DjVu files; and an on-screen representation we call the Flip Book."

The Flip Book is used at The Open Library, a charmingly retro Web interface for online books that mimics old technologies (clicking "Details" for a title brings up a yellowed card catalog entry), which the Internet Archive says was "inspired by a British Library kiosk."

The books are stored in the PetaBox, which is the Internet Archive's massive million-gigabyte storage system -- a system that Kahle says is "all built on open source software."

Caring for brittle books

A good number of the historic materials in question are old, fragile, and in such rough shape that placing them in Scribe's cradle, or even attempting to read them, could irreparably damage them. Adamson says that some of the books, for example, have pages "that have become brittle with age"; while Adamson says these materials are in a broad range of conditions that limit their physical handling, he uses the general term "brittle books" to describe it. No list of such brittle materials at the Library of Congress has been made, but Adamson says that "they comprise a percentage of virtually every collection." Adamson says the project's objectives include the development of a more formal classification and description of these "brittle" materials, and to "establish digitization workflows based on that classification of condition."

If scanning the brittle materials demands new software and digitization techniques, the Library of Congress will work in conjunction with the Internet Archive to make the innovations available to the public. But there's no way to know at this point what they may be, because the project is only getting underway.

"The project proposal calls for months of planning before any scanning or engineering is to begin," Adamson says. And the planning, he says, is "significant": "Space needs to be prepared to accommodate the physical scanning of books, server storage allocated, project plans need to be written, project team members briefed, along with myriad other details required for a project of this magnitude and complexity."

Eventually, Adamson says, when the scanning and processing of materials has been completed, the high-quality digitized versions of these historic documents (and metadata associated with them, such as indices and contents) will be freely accessible online -- which Kahle says is a "huge step" in broadening the reach of the ever-too-small public domain.

"There may be public domain books that are sitting on shelves, but if you can't get access to [something], what good does it do to be in the public domain?" says Kahle. "The Library of Congress is dedicated to keeping [these digitized holdings] public domain, which I think is a great step that's not being followed by everybody else."

The program is part of larger efforts, both at the Library of Congress, to preserve old media and records, and at the Internet Archive, which is already scanning public domain materials with its Open Content Alliance, a consortium of about 40 libraries. Kahle says that the alliance is presently operating in five cities, using the Scribe software, at a brisk clip of 12,000 books a month.

"We're part of the 'open world' through and through -- we use open source software, we generate open source software, we generate open content," says Kahle. "We're trying to take this open source idea to the next level, which is open content and open access to cultural materials, which means 'publicly downloadable in bulk.' I think we're really seeing the next level up of this whole movement -- we had the open network, then open source software, now we're starting to see open source content."

Share    Print    Comments   


on Linux to help the Library of Congress save American history

Note: Comments are owned by the poster. We are not responsible for their content.

LOL @ the dramatic title

Posted by: Anonymous Coward on March 29, 2007 07:12 AM
Sorry but that's a stretch!



Posted by: Anonymous Coward on March 30, 2007 06:06 AM
The constitution needs badly to be saved, it seems it has been forgotten.<nobr> <wbr></nobr>:(


That's all very sweet, but what about the data?

Posted by: Anonymous Coward on March 30, 2007 07:13 PM

It's sweet of them to use Linux, but what about the data? Is it being stored in open formats with freely available specifications that anybody can impliment without having to pay royalities? I followed a few links and while I found some PR speak that sounds pretty good, I found no specifics regards this question. If they were a private institution working with their own data they could do as they like. Because they're paid with tax dollars and archiving what, sooner or later, is the public's data we've a right to have that data in an open format.

I hope they fufill their 'publicly downloadable in bulk' promise. The <a href="" title="">Open Archives Initiative FAQ</a> (an affiliated organization) says "any advocate of “free” information recognize that it is eminently reasonable to restrict<nobr> <wbr></nobr>... defamatory misuse of information". That seems a strange thing to put into a FAQ. Let's hope they leave it to the courts to decide what's "defamatory misuse".

Karl O. Pinc


Re:That's all very sweet, but what about the data?

Posted by: Anonymous Coward on March 30, 2007 09:47 PM
djvu is an open format -- see <a href="" title=""></a> for more information


Re:That's all very sweet, but what about the data?

Posted by: Anonymous Coward on March 31, 2007 01:16 AM
Go to the Scribe project's site <a href="" title=""></a> and you will find samples of the output. Scribe outputs pdf, ps,and txt, so of course there's nothing proprietary there.


What about google

Posted by: Administrator on May 17, 2007 11:47 PM
I though google is digitizing so much stuff - i think they can take care of this<nobr> <wbr></nobr>;-)


Gutenberg Project

Posted by: Anonymous [ip:] on September 07, 2007 07:08 PM
I'm hoping the Gutenberg Project picks up all of the public domain books that come out of this. I like that site.


This story has been archived. Comments can no longer be posted.

Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya