extracting text from png files

Linux for blind general discussion blinux-list at redhat.com
Mon Dec 17 15:03:23 UTC 2018


What you're looking for is Ocular Character Recognition or OCR for short.

I've never managed to figure out its command line syntax, but I
believe tesseract is considered the current standard option for Linux.

There's also Cuneiform, which I have actually used with some success
in the past, but I believe its either contrib or non-free under
Debian, so you might need to enable extra repositories depending on
how strict your distro is about sticking to FOSS principles.

I will warn you, in my experience, OCR is as likely to produce
gibberish as legible text. A scan of a page of prose type set in a
standard font will probably OCR well, but the more mixed text is with
graphics, the fancier the font, and the more complicated the page
layout, the more likely errors are. I've tried OCR'ing scanlated
manga(Japanese comics) in the past and have gotten results that
included unpredictible patterns of letters and numbers misidentified
as others(S and 5, P and D, I and 1, LI and U, B and g where just some
of the common substitutions I encountered trying to fix the OCR'd
text), characters my screenreader could'nt identify or identified as
characters I'm unfamiliar, and even when the text was clear,
paragraphs out of order wasn't uncommon.

-- 
Sincerely,

Jeffery Wright
Bachelor of Computer Science
President Emeritus, Nu Nu Chapter, Phi Theta Kappa.




More information about the Blinux-list mailing list