[CLUE-Tech] OCR & spell-checking

Matt Gushee mgushee at havenrock.com
Tue Aug 10 17:02:05 MDT 2004


Hi, all--

Well, I am in the midst of a project I alluded to in my recent scanner
questions, wherein I am retrieving about 500 pages of text from printed
pages. I've done all the scans and OCR'ed them (using ocrad from the
FSF, which I found to give better results than gocr, at least for the
current material). Then I did an initial cleanup of the most predictable
and frequent errors with sed. Now I'm in the next phase, which is a
semi-automatic cleanup using aspell.

The problem is just that this is taking a lot longer than I would like.
The book has 45 chapters, I've gotten through 22; each chapter takes,
oh, probably 40-60 minutes. But after this phase there will still be
another phase of cleanup. I think part of the problem is that ordinary
spell checkers are designed mainly to deal with human errors, whereas
OCR output contains a lot of garbage characters. The real killer is
numerals and punctuation marks showing up in the middle of words, which
splits them into separate words for aspell's purposes. So after a pass
with the spell-checker, I'll have to go through everything again and
replace the garbage (and no, I don't think the errors are predictable
enough to be able to do any more with sed).

So anyway, though it's not impossible to get through this project, I
would like to find a way to speed it up.

Has anyone done anything like this? I'd be interested to know:

 - if there's an interactive spell-checker (or something similar) that
   is better than aspell at handling garbage characters. Or is there
   a way to configure aspell to handle them?

 - should I expect better results from the OCR? I haven't done a lot of
   OCR work, but it seems to me that the scans are about as good as
   they're going to get. All pages are quite clear to the human eye;
   there are some specks, but not many; the font is Granjon (11pt), a
   traditional book font--so, though it's a bit on the elegant side,
   it's not at all idiosyncratic. I haven't seen any documentation for
   ocrad or gocr that gives specific guidelines for the input, but based
   on general OCR info I have read, I think I've done the graphics
   right: they are black & white PBMs at 300 dpi. I gather that 300 dpi
   usually gives good results, and anything higher gives rapidly
   diminishing returns. I did a cursory test myself, with samples at
   200, 300, 400, and 600 dpi. 400 and 600 dpi seemed to give different
   errors in the output than 300, but not really fewer.

 - or is there a decent proprietary OCR program for Linux that might be
   significantly better at a reasonable price? The 2 or 3 that I found
   seem to cost about $1000 and up. That's way beyond my budget, but I
   could spend a hundred or two if it would make a big difference.

Thanks in advance for any hints.

-- 
Matt Gushee                 When a nation follows the Way,
Englewood, Colorado, USA    Horses bear manure through
mgushee at havenrock.com           its fields;
http://www.havenrock.com/   When a nation ignores the Way,
                            Horses bear soldiers through
                                its streets.
                                
                            --Lao Tzu (Peter Merel, trans.)



More information about the clue-tech mailing list