ocr + fedora core and a big book..

Bill Rugolsky Jr. brugolsky at telemetry-investments.com
Fri Jan 13 20:24:03 UTC 2006


On Fri, Jan 13, 2006 at 10:47:02AM +0000, Paul F. Johnson wrote:
> Grab a copy of gocr, compile and install (it's not in FE which is odd).
> When you scan, ensure it's at as high a resolution as possible (minimum
> in my experience of 300 dpi) and grey scaled.
> 
> Use either gimp or xsane to grab the scan and tell gocr to do it's
> business.
> 
> OCR is not an exact science and you will really need to sit down and go
> through the scanned text to ensure that the numbers scanned are correct
> (very easy to spot, you may have @ instead of 0, l for 1 and the such).
> Save the file generated. You may then need to either write a script to
> delimit using " " as the target or feed it into emacs and then search
> and replace " " for "," - save.

Sadly, in my (limited) experience, none of the free software solutions
such as Gocr or Clara OCR is really up to the task.  The leading
proprietary packages are vastly superior.  Some of them have free 30-day
evaluations.

With a proper setup for lots of automated training, Clara might be able
to do the job.  Especially if you do some image morphology (using, e.g.,
GIMP) to clean up the scans.  But you'll have to do some serious work.

A tried and true technique that avoids using proprietary software
is to simply pay multiple people to type the whole thing, and then
reconcile the differences (or use majority voting). :-)

Regards,

	Bill Rugolsky




More information about the fedora-list mailing list