This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Optical character recognition is an uphill battle for open source

By Nathan Willis on January 03, 2006 (8:00:00 AM)

Share    Print    Comments   

If you use Linux, or another free operating system, and need optical character recognition (OCR) software, be prepared for a challenge. OCR is a tricky problem on any computing platform -- both because it is conceptually hard, and because the task does not lend itself to simple, easy-to-use interfaces.

OCR is the use of visual pattern matching to extract text from an image -- usually a scanned paper document, but it could be a digital photo, a frame of video, or a screenshot just as easily.

Proprietary platforms like Mac OS X and Windows have a few commercial choices, but even the best of them can be headache-inducing. A few years ago, while working for a non-profit organization, I had to scan in and convert old documents using OCR -- out-of-print books, magazine articles, the occasional printout of a lost digital original. We tried several different products, and found that even the best of them required thorough proofreading, numerous corrections, manual tweaking of the input images, and a whole lot of time. So it was with low expectations that I approached the task of finding a suitable open source OCR program.

Extracting text with Kooka

The best of the lot available today is Kooka, the KDE environment's default scanning application. Kooka uses the GNU project's Ocrad as its OCR engine. It supports a wide variety of scanners via the SANE library, for when you are acquiring images as you work.

Using Kooka
Scanning with Kooka - click to enlarge

Its documentation suggests that Kooka can use other OCR engines in addition to Ocrad, but Kooka offers only one other option through its preferences, GOCR. Despite selecting GOCR and restarting Kooka as directed, the application persisted in using Ocrad. I did verify that GOCR was installed, functional, and located in /usr/bin as Kooka expected by testing its GTK front end, gtk-ocr. Kooka just would not use it.

Kooka's controls give you a lot of leeway over the resolution, brightness, and contrast of the document you are scanning. For the best results, you want to tweak the settings to give maximum contrast, eliminating if possible specks of dust and shadows on the paper. The downside is that excessive contrast in the scan can wash out the serifs, dots, and thin strokes of the letters, making them harder for OCR software to distinguish.

OCR with Kooka
OCR with Kooka - click to enlarge

In my case, the images I needed to OCR were already-digitized scans of the user manual for an old camera system. I chose previously digitized images in order to test all of the OCR apps I found (some of which do not perform scanning) on an equal footing. The images were of good quality, and the text was perfectly readable to start with. To adjust brightness and contrast in these images I had to open them in an external image editor, which Kooka conveniently lets you do through the Image menu.

You initiate an OCR job with OCR Image from the Image menu. Kooka presents a dialog from which you can tweak several settings, including whether the program should attempt to guess the layout of the document, and whether to enable spell-checking on the result.

Listing 1 is the result of the first pass with Kooka's OCR, in which all settings were left at their defaults. Clearly there is much to be desired; among other things a plethora of stray pixels were picked up and interpreted as punctuation marks. Listing 2 is the result of the second pass, after having cleaned up the original image considerably with the GIMP. Listing 3 began with the original text from Listing 2, corrected by the spell checker.

Correcting spelling with Kooka

Spell checking in Kooka is a manual process; the program cycles through the OCRed text and (much as in any office app) asks you about every unrecognized word, suggesting replacements when possible. Kooka uses existing spell-checking engines, such as ispell, aspell, and hspell (for Hebrew).

The problem with this approach is that none of these spell-checking engines are aware that the checked text is from an OCR source. Word processing spelling errors are very different from OCR spelling errors; the former are the results of mistyping, the latter are the results of machine vision problems.

OCR spell-checking needs to take into account that the mistakes it is looking for stem from similarities in letter shape, confusion between capital and lowercase letters, and misplaced word breaks. Ispell and Aspell can suggest replacement corrections based on dictionary words, but they are of little help.

Another feature often found in commercial OCR applications is the ability to mark or exclude areas of the page from scanning. In some cases, it is possible to indicate that two or more blocks of text are connected, which can be very helpful when scanning in magazine layouts. Kooka has the beginnings of automatic layout detection via the Ocrad engine -- you can tell it to interpret the document as multi-column or to guess at more complex layouts. But, in most cases, specifying the layout yourself would be much faster.

Using Clara
Using Clara - click to enlarge

Everything else

By my count, Kooka was able to correctly identify just 123 of 270 words, or 45.56 percent -- not an encouraging number. But the pickings are slim for Linux users. I did examine other applications: gtk-ocr and Clara. Both are functioning GUI programs, but neither has close to the feature set of Kooka.

Gtk-ocr is sparse in its interface and controls. The command-line client may be more functional, but for a graphically intensive task such as this I would not recommend it. Clara has a more promising feature list than gtk-ocr, but the last update to the program was a long time ago, and it uses Xlib for its interface -- a fact that may frighten away younger users, but elicit feelings of nostalgia in others with more experience.

Using Clara
Using gtk-ocr - click to enlarge

A number of other projects turn up on a Freshmeat or SourceForge search for "OCR," but most are academic in nature, and not suited for end users.

Given the inherent complexity of tasks involving natural language processing (OCR, speech recognition, and machine translation, to name a few), it should come as no surprise that (a) the available tools are just marginally useful, and (b) research into the subject continues.

I hope that some of the research being conducted will find its way into a usable open source OCR application for Linux. Right now, the best alternatives for those needing to extract text from an image all involve spending money. There are a few proprietary solutions advertising Linux support (VueScan and OCRShop, for example) and proprietary OCR software for proprietary OSes -- but don't underestimate the value of paying a typist to transcribe your text the old-fashioned way.

Share    Print    Comments   

Comments

on Optical character recognition is an uphill battle for open source

Note: Comments are owned by the poster. We are not responsible for their content.

opensource OCR

Posted by: Anonymous Coward on January 04, 2006 03:07 AM
I see nothing has changed with respect to open source OCR in the past year. I had looked into it about a year ago.What I read at the time was so discouraging that I did not even test any open source OCR. I concluded that one of the windows based products was the only way to go. Thanks for the insight into why it is such a difficult task. It looks more grim than I had suspected. Your accuracy statistics are really discouraging. Some of the windows products claim 98-99% accuracy. Even that is problematic with uncommon formatting.

I am curious about those book scanning projects, where books are scanned to be made available on line. Do they just make images available as the final product? Or, do any of them try to do OCR?

#

Re:opensource OCR

Posted by: Anonymous Coward on January 04, 2006 03:38 AM
Well quite a bit of the work is also patented. And yes, most of the book-scanning projects release images (the human is still the best OCR engine).* And last I believe that there are SDKs that you can buy if you want to incorperate OCR into your product.

*There's another reason for a book-scanning project to just release images. Piracy is a bit harder.

#

Re:opensource OCR

Posted by: walt-sjc on January 04, 2006 07:56 AM
Most commercial products that claim 98-99% accuracy are lying. They MAY be that accurate when dealing with a typewritten original from an IBM Selectric or laser printed courier font document, but they SUCK at real-world documents. They generally don't handle skew well at all, don't handle boxes around text, totally lose formatting, etc.

I found it cheaper, faster, and more accurate to send the work off to India and have it triple hand-entered.

Correcting text can frequently take more time than retyping as well depending on the speed of the typist.

#

Re:opensource OCR

Posted by: Anonymous Coward on January 04, 2006 10:42 AM
"triple hand entered"

Wow! How many three handed people are there in India?

#

Re:opensource OCR

Posted by: raindog on January 06, 2006 04:20 AM
In this case, six hands would be cheaper than three.

#

Little need for it in FOSS world

Posted by: Anonymous Coward on January 04, 2006 05:15 AM
Most OCR is used as a last ditch effort to translate between incompatible file formats. That is less of an issue in the FOSS world, so few people care enough to do some development...

#

Re:Little need for it in FOSS world

Posted by: Anonymous Coward on January 07, 2006 06:57 AM
Yeah, like a sheet of paper and a hard disk. Most incompatible.

#

Unrelated

Posted by: Anonymous Coward on January 04, 2006 06:03 AM
Thank you for your article, luckily I haven't had to try scanning in Linux, and the one time I did, I couldn't get the scanner detected anyway.

My unrelated question was what DE is that? Kooka being a KDE prog, I'd think KDE, but that looks like a Clearlooks type windeco? If it's KDE, what windeco is it?

Thanks again

#

Re:Unrelated

Posted by: Anonymous Coward on January 04, 2006 08:59 AM
Plastik

#

Actually there are recent advances in Hebrew

Posted by: Anonymous Coward on January 04, 2006 08:11 AM
Thanks for the summary.


There has been good progress on the Hebrew front recently with HOCR ("Hebrew OCR"). You can see the project's home page at <a href="http://hocr.berlios.de/index.html" title="berlios.de">http://hocr.berlios.de/index.html</a berlios.de>.


The project already provides Gtk- and QT-based GUI, Perl and Python bindings and most importantly - RPM's and debian packages.

#

SimpleOCR under Wine works well

Posted by: Anonymous Coward on January 04, 2006 06:31 PM
I have contributed one book to Project Gutenberg and am working on others. This needs OCR (my typing skills are not THAT brilliant).

I went through what seems to be the common experience of searching and testing all the available Linux options, but I settled (finally) on using a Windows app (SimpleOCR - www.simpleocr.com) under Wine. Both use and install are and were trivial; just run.

SimpleOCR offers both printed text and handwriting recognition. I haven't tried the handwriting bit because I don't need that, but I can recommend the printed text. It seems to work very well - I don't have metrics, but I'm happy. It is definitely far better than OCRAD at the moment.

Reading this article has nudged my conscience; although it's freeware, I'm going to go back and make a contribution somehow.

OK, it's not FOSS, but it's a very useable, Linux runnable, solution to a practical problem.

#

Re:SimpleOCR under Wine works well

Posted by: Anonymous Coward on January 08, 2006 09:55 PM
I think you should make a contribution to wine as well

#

Not as bad as all that

Posted by: Anonymous Coward on January 04, 2006 11:02 PM
Higher-resolution scans make for better OCR. When we started transcribing <a href="http://home.alltel.net/kollar/utp/" title="alltel.net">Unix Text Processing</a alltel.net> from PDF scans a couple of years back, several of us used GOCR to bring in text. It took no more time and less effort compared to typing it in by hand. Like with your experience, it requires some spell-checking and proofreading, but our results weren't nearly as bad as the listings showed.


Playing around with GOCR, I got results similar to yours with 300dpi scans. Things get much better with higher resolutions (use 2400dpi if you can get it).

#

mentalix pixel!fx/OCR runs on linux

Posted by: Anonymous Coward on January 05, 2006 03:02 AM
we ended up purchasing this. AFAIK they are the only game in town that runs on linux. you basically write TCL scripts which are processed by a built in enterpreter. I wrote a script (along with some external perl support modules) to download multi-page fax tiffs and read barcodes on coversheets to route the attached documents. I used EAN13 barcodes and enlarged them pretty big on the coversheets so that after apply filters etc in the script I have about 99% accuracy. the software works ok but we found a hand full of bugs during our development with it. now that all the work arounds are in place its been running for a couple years now without any issues.

#

werd to ya

Posted by: Anonymous Coward on January 05, 2006 11:24 AM
sup

#

Don't use jpg as the image type... PNG?

Posted by: Anonymous Coward on January 06, 2006 05:19 PM
I would expect better results with an image with less artifacts. You can see the damage in the screen shots in the article.
With a text document you should get good compression with PNG and far less damage to the image. I'd like to see what your success rate is with images of reasonable quality.

#

Re:Don't use jpg as the image type... PNG?

Posted by: Anonymous Coward on January 25, 2006 02:45 AM
The fact that the auhor of the review chose JPEG files to test ocr accuracy clearly shows he is an inept... or he has an agenda.

With clear pnm, or appropiately converted png or ps files, ocrad can achieve 100% accuracy.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya