ocr success

Sat Dec 20 13:22:59 UTC 2008

2008/12/20 Daniel Dalton <d.dalton at iinet.net.au>:
The images seem to
> place random code in the doc (that's ok, some quick editing with emacs,
> nano, vi or your favourite editor will fix that.

I suggest creating a mental note to examine the randomness of the
unwanted code and other mistakes on an ongoing basis. To the extent
that it is repetitive between different documents with different
characteristics, a cleanup script can be written to handle it, which
might make a good community project if there is not already such a
project associated with one of the apps.

Seemingly random characters produced by the OCR process often have
patterns that can be processed by a regex, e.g., an unusual Unicode
special character in a "word." Reviewing source code for the document
can point the path to e.g., the symbol's Unicode number, which is a
character entity written in plain text that can be processed by a
script.

Particular character combinations are also often handled poorly by OCR
because their combination appears visually as very similar to another
character. E.g., "rn" is often mistranslated as "m." Throw in
variation in typefaces and the quality of the source document, you'll
have the same errors occurring over and over again.

Building a quality list of recurring "words" that are not words and
their correct equivalents can also provide the input for an automagic
substitution routine in the clean-up script.

Many OCR errors result from the variability in type faces. For
frequently read publications like a newspaper, it can be helpful to
build a "not-word" list adapted for the particular type face used in
the publication which can then be used for other publications that
share the same or a very similar type face. Involving sighted people
who have good type face recognition skills could be of assistance
here. Often it is unnecessary to identify the particular type face so
long as it can be recognized as within a certain classification of
type faces.

A database for typeface classifications used by particular frequently
read publications can also play into the quality of clean-up scripts.

Just some random thoughts from a sighted person who has struggled with
OCR over the decades. I was a typographer in my first career.

Best regards,

Paul

-- 
Universal Interoperability Council
<http:www.universal-interop-council.org>