ocr success

Sat Dec 20 22:11:19 UTC 2008

That's a good idea, I didn't think of that, I guess I should invest
some time into writing something like this.

On Sat, Dec 20, 2008 at 05:22:59AM -0800, marbux wrote:
> 2008/12/20 Daniel Dalton <d.dalton at iinet.net.au>:
> The images seem to
> > place random code in the doc (that's ok, some quick editing with emacs,
> > nano, vi or your favourite editor will fix that.
> 
> I suggest creating a mental note to examine the randomness of the
> unwanted code and other mistakes on an ongoing basis. To the extent
> that it is repetitive between different documents with different
> characteristics, a cleanup script can be written to handle it, which
> might make a good community project if there is not already such a
> project associated with one of the apps.
> 
> Seemingly random characters produced by the OCR process often have
> patterns that can be processed by a regex, e.g., an unusual Unicode
> special character in a "word." Reviewing source code for the document
> can point the path to e.g., the symbol's Unicode number, which is a
> character entity written in plain text that can be processed by a
> script.
> 
> Particular character combinations are also often handled poorly by OCR
> because their combination appears visually as very similar to another
> character. E.g., "rn" is often mistranslated as "m." Throw in
> variation in typefaces and the quality of the source document, you'll
> have the same errors occurring over and over again.
> 
> Building a quality list of recurring "words" that are not words and
> their correct equivalents can also provide the input for an automagic
> substitution routine in the clean-up script.
> 
> Many OCR errors result from the variability in type faces. For
> frequently read publications like a newspaper, it can be helpful to
> build a "not-word" list adapted for the particular type face used in
> the publication which can then be used for other publications that
> share the same or a very similar type face. Involving sighted people
> who have good type face recognition skills could be of assistance
> here. Often it is unnecessary to identify the particular type face so
> long as it can be recognized as within a certain classification of
> type faces.
> 
> A database for typeface classifications used by particular frequently
> read publications can also play into the quality of clean-up scripts.
> 
> Just some random thoughts from a sighted person who has struggled with
> OCR over the decades. I was a typographer in my first career.
> 
> Best regards,
> 
> Paul
> 
> 
> 
> 
> -- 
> Universal Interoperability Council
> <http:www.universal-interop-council.org>
> 
> _______________________________________________
> Blinux-list mailing list
> Blinux-list at redhat.com
> https://www.redhat.com/mailman/listinfo/blinux-list