Convert PDF to Text?

Jonathan Underwood jonathan.underwood at gmail.com
Tue Apr 24 11:33:03 UTC 2007


On 21/04/07, Keith G. Robertson-Turner
<fedora-gmane.00003 at genesis-x.nildram.co.uk> wrote:
> I have some PDF documents that are photocopied text documents (embedded
> image, rather than text glyphs). When I open these with Evince, I am
> able to copy and paste the actual text. At first I though this was some
> kind of OCR process, but then I realised it's actually the document
> itself, which has the original text embedded in it (OCRed and embedded
> during the original scan).
>
> Is there any command I can use to extract the text from these PDF
> documents in a batch? I have a couple of thousand documents that need
> converting.

Have you looked at pdftk? "If PDF is electronic paper, then pdftk is
an electronic staple-remover, hole-punch, binder, secret-decoder-ring,
and X-Ray-glasses. Pdftk is a command-line tool for doing everyday
things with PDF documents."

http://www.accesspdf.com/pdftk/




More information about the fedora-list mailing list