ps to pdf and then to text editor
Paul Smith
phhs80 at gmail.com
Mon Apr 10 13:23:51 UTC 2006
On 4/9/06, George N. White III <aa056 at chebucto.ns.ca> wrote:
> > I print to a file, file.ps, a web-page with text. Then, I apply ps2pdf
> > and I get file.pdf. However, I cannot copy (from file.pdf) the text to
> > a text editor. Can one get a pdf file with copyable text?
>
> Does this work with a really trivial web page?
>
> What does "pdffonts file.pdf" show?
>
> If the pdf file uses strings, then you stand a better chance of being able
> to cut and paste from a pdf viewer to the editor, but you may run into
> encoding issues, so the pasted text is gibberish.
>
> I get:
>
> $ cat t.html
> abc
>
> Print to ps from Firefox, convert to pdf, load in Adobe Reader, and
> cut and paste gives: "^Y^Z^[", so the encoding is a problem. Xpdf
> would not let me copy the text. The t.html.ps file has:
>
> 8 dict begin
> /FontName /Nimbus_Roman_No9_L.Regular.0.0.Set0 def
> /FontType 1 def
> /FontMatrix [ 0.001 0 0 0.001 0 0 ]readonly def
> /PaintType 0 def
> /FontBBox [-168 -281 1031 1098]readonly def
> /Encoding [
> /.notdef
> /uni0066/uni0069/uni006C/uni0065/uni003A/uni002F/uni0068/uni006F
> /uni006D/uni0067/uni0077/uni0074/uni0057/uni0073/uni002E/uni0031
> /uni0020/uni0030/uni0034/uni0039/uni0032/uni0036/uni0041/uni004D
> /uni0061/uni0062/uni0063/
>
> This is the 'abc' --> '^Y^Z^[' encoding.
>
> $ pdffonts t.html.pdf
> name type emb sub uni object ID
> ---------------------------- ------------ --- --- --- ---------
> YNAHAD+Nimbus_Roman_No9_L.Regular.0.0.Set0
> Type 1C yes yes no 9 0
>
> If the pdf file uses images, you need to use an OCR tool to get the text.
> I have seen cases where printing docs to PS on Win32 results in the
> text being rasterized in the driver so the PS file has images. This may
> happen with screen fonts and/or certain effects (transparency, text
> outlines filled with colored patterns).
Thanks, George and Mike. After pstill, I get
$ pdffonts file.pdf
name type emb sub uni object ID
------------------------------------ ------------ --- --- --- ---------
Nimbus_Roman_No9_L.Regular.0.0.Set0 Type 1 yes no no 33 0
Verdana.Bold.0.0.Set0 Type 1 yes no no 37 0
Verdana.Regular.0.0.Set0 Type 1 yes no no 41 0
Lucida_Sans.Regular.0.0.Set0 Type 1 yes no no 45 0
Arial.Regular.0.0.Set0 Type 1 yes no no 49 0
Arial.Bold.0.0.Set0 Type 1 yes no no 53 0
Verdana.Italic.0.0.Set0 Type 1 yes no no 57 0
[1]- Done acroread anselmo.pdf
[2]+ Done kwrite
$
After ps2pdf, I get
$ pdffonts file.pdf
name type emb sub uni object ID
------------------------------------ ------------ --- --- --- ---------
EOZSTF+Verdana.Regular.0.0.Set0 Type 1C yes yes no 13 0
MQEXGW+Arial.Regular.0.0.Set0 Type 1C yes yes no 19 0
DMCZLT+Lucida_Sans.Regular.0.0.Set0 Type 1C yes yes no 17 0
YTBXNU+Nimbus_Roman_No9_L.Regular.0.0.Set0 Type 1C yes yes no 8 0
GBGOAU+Verdana.Bold.0.0.Set0 Type 1C yes yes no 10 0
GMTXSU+Arial.Bold.0.0.Set0 Type 1C yes yes no 23 0
AJKQFS+Verdana.Italic.0.0.Set0 Type 1C yes yes no 26 0
$
Paul
More information about the fedora-list
mailing list