Copying text from a protected pdf file
Antonio Olivares
olivares14031 at yahoo.com
Fri Sep 16 12:10:29 UTC 2005
--- Paul Smith <phhs80 at gmail.com> wrote:
> On 9/15/05, Deron Meranda <deron.meranda at gmail.com>
> wrote:
> > > > > > > I have got a pdf file, whose text I
> would like to copy to a word
> > > > > > > processor. However, it seems to be
> protected, as when I copy and paste
> > > > > > > a piece of text from there into a word
> processor, I only see garbage.
> > ...
> > > Thanks, Leonard. I have just checked: the pdf
> file is not copy
> > > protected, but, even so, what I can copy into a
> word processor is
> > > garbage. It may be something relating with
> encodings.
> >
> > It could be encodings. Text in PDF is really only
> in terms of glyphs,
> > not characters, which makes text extraction
> particularly difficult
> > and font-specific. Fortunately there are a few
> standard PDF encodings
> > defined by Adobe (these map "characters" to
> glyphs, and are not
> > quite the same things as you'd think of an
> "encoding" being), but
> > each PDF file can create it's own custom encodings
> as well and
> > visually you'd see nothing different. There's
> also nothing to keep
> > the "text" in a PDF file from being written weird
> (such as writing
> > from right-to-left) since it's just graphics
> instructions; but most PDF
> > generating programs do it in the obvious way.
> >
> > You might want to look at the "pdftotext" program
> (which is part of
> > the xpdf package, obsoleted in FC4). It generally
> can do a good job
> > of extracting text.
> >
> > Just some more information... are your documents
> generally
> > written in English (or use the English alphabet)?
> And are they more
> > like plain prose (paragraphs of text), or fanciful
> like marketing marterials
> > with lots of interspersed graphics, panels, and so
> forth?
>
> Thanks, Deron. My documents are not written in
> English, and they only
> have text and tables, apparently created with MS
> Windows. pdftotext
> and pdftohtml do not produce good or reasonable
> results.
>
> Paul
>
> --
> fedora-list mailing list
> fedora-list at redhat.com
> To unsubscribe:
> http://www.redhat.com/mailman/listinfo/fedora-list
>
Have you tried converting your file to postscript and
then using ps2ascii or something similar?
Best Regards,
Antonio
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
More information about the fedora-list
mailing list