Copying text from a protected pdf file
George White
aa056 at chebucto.ns.ca
Fri Sep 16 12:44:01 UTC 2005
Quoting Paul Smith <phhs80 at gmail.com>:
> I have got a pdf file, whose text I would like to copy to a word
> processor. However, it seems to be protected, as when I copy and paste
> a piece of text from there into a word processor, I only see garbage.
> Is there some way of getting clean text from the pdf file?
The PDF format has many ways to display text. To be able to extract text
you need a file that stores strings and uses font information to render them
in the viewer. You may be seeing images that were rasterized long ago.
You should provide the output of the "pdffonts" command, preferrable for a
minimal document (a big document could combine sections that use fonts with
images).
For example, the simplest case is a document that uses the PostScript Type 1
fonts provided by the viewer:
$ pdffonts /usr/share/doc/cups-1.1.20/ssr.pdf
name type emb sub uni object ID
------------------------------------ ------------ --- --- --- ---------
Times-Roman Type 1 no no no 4 0
Helvetica Type 1 no no no 7 0
Helvetica-Bold Type 1 no no no 8 0
Times-Bold Type 1 no no no 5 0
Courier Type 1 no no no 3 0
Symbol Type 1 no no no 9 0
Times-Italic Type 1 no no no 6 0
--
George N. White III
Head of St. Margarets Bay, Nova Scotia
More information about the fedora-list
mailing list