OCR on linux

Tony Baechler tony at baechler.net
Fri Apr 25 08:31:41 UTC 2008

Daniel Dalton wrote:
> If I was to buy a new scanner what model is the easiest to set up and 
> the best supported?
> What one would you recommend?


Pretty much any scanner should work nowadays. You want one that's TWAIN 
compatible. That includes most Epson, Cannon, HP, etc. You probably want 
a USB scanner. The only thing to watch is that some require their own 
Windows drivers which of course won't work in Linux. This seems true of 
HP but I had this with an Epson also. I don't yet do scanning in Linux 
so I can't really give specific help besides that, but if in doubt, look 
for something like "best scanner" or "supported scanner models" at 

If you get one to work, I would be interested in your results. I am 
interested in trying to scan documents in Linux and have found the OCR 
thread interesting. I would also be interested in which engine produces 
the best text quality. I know from trying different ones under Windows 
that results can drastically vary depending on many factors.

You asked about page images with text. First, be aware that there are at 
least 4 different types of .tif images. One is compressed, one is for 
faxes, one is for multiple pages and one is the standard, old fashioned, 
single page. You want the later. You'll know that it's right because it 
will only support one page per document and the files will be very big, 
about 1 MB per file. I've had bad luck with the other .tif variations. 
Also, there are many sources of page images not mentioned. Just a few 
are these:

http://www.gutenberg.org/ now offers .jpg page images but you have to do 
some hunting

http://www.archive.org/ look for text, American Libraries, all should be 
high quality images

http://onlinebooks.library.upenn.edu/ look at serials, links to many 
magazines with page images but sites aren't very accessible

http://www.loc.gov/ and http://memories.loc.gov/ very comprehensive and 
will require searching, has many "Base Ball" guides, images are in .jpg

I hope this is helpful to you. Have a good weekend.

More information about the Blinux-list mailing list