ps to pdf and then to text editor

George N. White III aa056 at chebucto.ns.ca
Sun Apr 9 13:22:14 UTC 2006


On Sun, 9 Apr 2006, Paul Smith wrote:

> I print to a file, file.ps, a web-page with text. Then, I apply ps2pdf
> and I get file.pdf. However, I cannot copy (from file.pdf) the text to
> a text editor. Can one get a pdf file with copyable text?

Does this work with a really trivial web page?

What does "pdffonts file.pdf" show?

If the pdf file uses strings, then you stand a better chance of being able 
to cut and paste from a pdf viewer to the editor, but you may run into 
encoding issues, so the pasted text is gibberish.

I get:

$ cat t.html
abc

Print to ps from Firefox, convert to pdf, load in Adobe Reader, and
cut and paste gives: "^Y^Z^[", so the encoding is a problem.  Xpdf
would not let me copy the text.  The t.html.ps file has:

8 dict begin
/FontName /Nimbus_Roman_No9_L.Regular.0.0.Set0 def
/FontType 1 def
/FontMatrix [ 0.001 0 0 0.001 0 0 ]readonly def
/PaintType 0 def
/FontBBox [-168 -281 1031 1098]readonly def
/Encoding [
/.notdef
/uni0066/uni0069/uni006C/uni0065/uni003A/uni002F/uni0068/uni006F
/uni006D/uni0067/uni0077/uni0074/uni0057/uni0073/uni002E/uni0031
/uni0020/uni0030/uni0034/uni0039/uni0032/uni0036/uni0041/uni004D
/uni0061/uni0062/uni0063/

This is the 'abc' --> '^Y^Z^[' encoding.

$ pdffonts t.html.pdf
name                         type         emb sub uni object ID
---------------------------- ------------ --- --- --- ---------
YNAHAD+Nimbus_Roman_No9_L.Regular.0.0.Set0
                              Type 1C      yes yes no  9 0

If the pdf file uses images, you need to use an OCR tool to get the text.
I have seen cases where printing docs to PS on Win32 results in the
text being rasterized in the driver so the PS file has images.  This may 
happen with screen fonts and/or certain effects (transparency, text 
outlines filled with colored patterns).

-- 
George N. White III  <aa056 at chebucto.ns.ca>




More information about the fedora-list mailing list