This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Google's Tesseract OCR engine is a quantum leap forward

By Nathan Willis on September 28, 2006 (8:00:00 AM)

Share    Print    Comments   

The open source optical character recognition (OCR) landscape got dramatically better recently when Google released the Tesseract OCR engine as open source software.

The Tesseract code was written at Hewlett-Packard in the 1980s and '90s. In 1995, it was one of the top-tier performers at UNLV's OCR competition, but when HP withdrew from the OCR software marketplace, the code languished. Then in 2005, HP handed off the code to UNLV's Information Science Research Institute (ISRI), an academic center doing ongoing research into OCR and related topics. ISRI discovered that original Tesseract developer Ray Smith was now an employee at Google, and asked the search engine giant if it was interested in the code. Google spent a few months updating the code to compile on modern operating systems, and released it on SourceForge.net.

You can download the latest tarball, a bugfix release numbered 1.0.1, from the Tesseract OCR project page. The only compilation instructions are those listed on the release notes section of the SourceForge.net download page. Instructions are listed for Windows, Mac OS X, and Linux, all for the same source code. Compilation under Linux is straightforward -- run ./configure followed by make -- but there is no make install step. In fact, you must move the resulting tesseract binary into its parent directory, where it expects to find a support directory called tessdata. Make sure the directory is writable, because Tesseract generates temporary files there while processing an image.

The usage instructions are concise -- Tesseract has no switches and does exactly one thing. Just execute tesseract example.tiff outputfilename, and Tesseract will generate an ASCII file named outputfilename.txt containing the text recognized from example.tiff.

Currently, Tesseract recognizes only English and works only on TIFF files (black and white, 8-bit greyscale, and 24-bit color; no compression). Also, it can generate output only in the US-ASCII character set, so glyphs with accent marks or other unsupported attributes will probably be reproduced incorrectly.

Proof text

In January, I wrote an overview of free software OCR engines, concluding pessimistically that if you needed to OCR text of any length, you should consider paying a typist to transcribe it. Now, the question is: how well does Tesseract compare to the open source competition?

The difference is night and day. This is an OCR engine that actually works. For the sake of comparison, this listing is Tesseract's output on the exact same file with which I tested Kooka in the earlier review. I count seven mistakes (five spelling, one capitalization, and one punctuation) in 266 words. Discarding the punctuation mistake because I did not count them in my previous review, and Tesseract correctly recognized 97.74% of the text.

Of course, unlike Kooka, Tesseract does not recognize page layout (e.g., multi-column text), so it combined the two columns into one. Nor does it share Kooka's ability to mask out non-text parts of an image, so I had to remove an illustration from the page with the GIMP.

All things considered, though, it is the success rate of the text-recognition engine that matters most. The rest is just gravy. Even without a GUI, Tesseract is more useful today than Kooka.

Google has confirmed its intentions to continue developing the Tesseract code, although it does not have concrete plans. Currently the Tesseract OCR project on SourceForge has only two members, the Google engineers who work on the project part-time. I spoke with Google's Luc Vincent about the future of the project, and he listed the known shortcomings -- lack of supported file types, additional languages, page layout -- as the targets for future development.

But, he said, the direction that the project takes will largely depend on the interest shown by outside programmers. Google does not have plans to develop Tesseract into a full-fledged application like Picasa or Google Earth.

Read 'em and weep

The company clearly deals in areas where OCR is important (such as the Book Search and Image Search programs), but it doesn't need a GTK or Qt app for them. Where Tesseract goes for Linux users will depend on who gets involved and actually works on the code.

Luckily, some activity has already begun. For instance, a couple of enterprising early adopters have worked up a simple script that uses ImageMagick to seamlessly convert other image formats and pass them to Tesseract, overcoming one of the software's big limitations.

The Tesseract code is under the Apache 2.0 License, which the Free Software Foundation claims is incompatible with the GPL, but the Apache Software Foundation does not. On the licensing front, it is worth noting that the tesseract-1.0.1 tarball contains a subdirectory with third-party code called Aspirin/MIGRAINES. This code is licensed separately (as the README and other documentation makes clear), under a non-free software license, but the code is not actually used by the current version of Tesseract.

Even if Tesseract 1.0.1 were to be the only release ever made from this project, it has changed the landscape of OCR for free software dramatically. I'm confident it won't be the only release -- it's just that high-quality. Do yourself a favor and check it out -- it builds quickly on Linux, and it actually works as advertised. It is lucky for us that the best GUI OCR program (Kooka) uses pluggable OCR engines, so the Tesseract code could join the current arsenal (GOCR and OCRAD) in short order, and provide free software users with an all-in-one solution.

Share    Print    Comments   

Comments

on Google's Tesseract OCR engine is a quantum leap forward

Note: Comments are owned by the poster. We are not responsible for their content.

One very small step for a sub-atomic particle

Posted by: Anonymous Coward on September 29, 2006 02:12 AM
Wow man, a quantum leap. That is so damn small, it is really the smallest thing next to nothing, so how is that news? Someone changed a typo in a comment in the source code and recompiled?

#

Needs a lot of work

Posted by: Anonymous Coward on October 01, 2006 05:19 PM
Tesseract in its present form is unusable. It needs to be able to handle formatting correctly, and the source looks like it needs some serious maintenance to sort out 64 bit compatibility issues. It looks like it might be better just to cherry pick the good parts, clean them up, and integrate them with gocr rather than try and make this usable.

#

Possibilities...

Posted by: Anonymous Coward on September 29, 2006 03:29 AM
It seems to be a good thing. Most documents that I have are actually scanned images, and this tool could make it possible to use Beagle to search through them.

Though I wonder what license it uses? It doesn't say on the sf.net page.

#

Not Revolutionary

Posted by: Anonymous Coward on September 29, 2006 10:35 AM
Hardly a quantum leap forward; it only works with a limited color depth and resolution in a specific image format, and is limited to the EASY set of linguistic glyphs.

Wow. How revolutionary.

I think I'll go back to my default OCR engine, which may produce occasional errors but which works with all color depths, multiple resolutions, any image format, and the full set of linguistic glyphs for multiple languages.

#

Re:Not Revolutionary

Posted by: Anonymous Coward on September 29, 2006 12:10 PM
And your default OCR engine is????

#

Re:Not Revolutionary

Posted by: Anonymous Coward on September 30, 2006 04:07 PM
Don't know about the easy set of glyphes, if you want to support Chineese or Arabic you surely need a lot of more work, but any document can be converted to match the stuff that this engine needs, you can even make a script for it!

#

Proprietary

Posted by: Anonymous Coward on October 05, 2006 12:34 PM
You didn't name your default OCR "engine", but since this is the best option for FOSS apps, I'm willing to bet that your OCR app isn't FOSS. That means it costs a couple hundred bucks at a minimum, and maybe thousands.



As others have pointed out, Tesseract handles quite a few formats if converted. Thanks to the ease of the command line, ease of creating scripts, and ease of GNU/Linux in general, it may take an extra step or two, maybe not even that, to convert the target file, process it and dump the result into a new file. Thanks to the command line, bash, scripting, perl, and a few other tricks, its probably the same number of steps or less compared to your proprietary app, and much faster.



A final point, especially for you naysayers. Where the app in question turns out to be a useful app, where it can replace a proprietary app, where it can replace an expensive app, where it can be incorporated into a project like KDE which has developers who like to incorporate everything useful for desktop computing, it and many other apps generally receive enough attention to the point where the app very quickly becomes much better and much more efficient than the proprietary apps that perform similar tasks.



You go on using your proprietary OCR app. And paying for the privilege. And paying for updates. And paying for bug fixes. And bothering with registration keys. And bothering with dongles and other prove-your-innocence tactics. And forced upgrades.



After all, without you, where would the proprietary industry be?

#

Re:Kooka

Posted by: Anonymous Coward on October 05, 2006 12:09 PM
One of the missing apps from the GNU/Linux domain is a good OCR app. It may be needed by a small group of users, out of a small group of GNU/Linux users, but it is needed. I suspect if the app is as good as stated, that soon enough someone will incorporate it into Kooka or another GTK or other type of gui application. My bet would be that it is picked up by Kooka or some other Kapp, since KDE developers like to make as many apps as possible use KDE libraries and functional working using resources already loaded for KDE instead of adding more resource requirements to a running system. I also hope that this app picks up mindshare and is picked up under KDE.

#

Kooka

Posted by: Administrator on September 29, 2006 04:47 PM
It would be nice if Kooka were updated with this technology, but the project seems dead? The last change on the Kooka website is from 2004 so that doesn't look very good unfortunately.

#

Seriously handy!!

Posted by: Anonymous [ip: 84.92.225.20] on August 30, 2007 04:02 PM
Along with imagemagick, grep, a pinch of bash, and a part list - I am using Tesseract to automatically index ~10,000 pages of pdf's containing vector drawings with vector drawn part number text, resulting in a mysqldump ready for import!

It might not be very efficient but it works and it won't cost a penny! FOSS pwnz.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya