[CLUE-Tech] OCR & spell-checking
Matt Gushee
mgushee at havenrock.com
Tue Aug 10 17:02:05 MDT 2004
Hi, all--
Well, I am in the midst of a project I alluded to in my recent scanner
questions, wherein I am retrieving about 500 pages of text from printed
pages. I've done all the scans and OCR'ed them (using ocrad from the
FSF, which I found to give better results than gocr, at least for the
current material). Then I did an initial cleanup of the most predictable
and frequent errors with sed. Now I'm in the next phase, which is a
semi-automatic cleanup using aspell.
The problem is just that this is taking a lot longer than I would like.
The book has 45 chapters, I've gotten through 22; each chapter takes,
oh, probably 40-60 minutes. But after this phase there will still be
another phase of cleanup. I think part of the problem is that ordinary
spell checkers are designed mainly to deal with human errors, whereas
OCR output contains a lot of garbage characters. The real killer is
numerals and punctuation marks showing up in the middle of words, which
splits them into separate words for aspell's purposes. So after a pass
with the spell-checker, I'll have to go through everything again and
replace the garbage (and no, I don't think the errors are predictable
enough to be able to do any more with sed).
So anyway, though it's not impossible to get through this project, I
would like to find a way to speed it up.
Has anyone done anything like this? I'd be interested to know:
- if there's an interactive spell-checker (or something similar) that
is better than aspell at handling garbage characters. Or is there
a way to configure aspell to handle them?
- should I expect better results from the OCR? I haven't done a lot of
OCR work, but it seems to me that the scans are about as good as
they're going to get. All pages are quite clear to the human eye;
there are some specks, but not many; the font is Granjon (11pt), a
traditional book font--so, though it's a bit on the elegant side,
it's not at all idiosyncratic. I haven't seen any documentation for
ocrad or gocr that gives specific guidelines for the input, but based
on general OCR info I have read, I think I've done the graphics
right: they are black & white PBMs at 300 dpi. I gather that 300 dpi
usually gives good results, and anything higher gives rapidly
diminishing returns. I did a cursory test myself, with samples at
200, 300, 400, and 600 dpi. 400 and 600 dpi seemed to give different
errors in the output than 300, but not really fewer.
- or is there a decent proprietary OCR program for Linux that might be
significantly better at a reasonable price? The 2 or 3 that I found
seem to cost about $1000 and up. That's way beyond my budget, but I
could spend a hundred or two if it would make a big difference.
Thanks in advance for any hints.
--
Matt Gushee When a nation follows the Way,
Englewood, Colorado, USA Horses bear manure through
mgushee at havenrock.com its fields;
http://www.havenrock.com/ When a nation ignores the Way,
Horses bear soldiers through
its streets.
--Lao Tzu (Peter Merel, trans.)
More information about the clue-tech
mailing list