Re: extract text from PDF file
- From: Lee Sau Dan <danlee@xxxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Fri, 04 Aug 2006 00:00:27 +0800
"Ken" == Ken Sharp <ken@xxxxxxxxxxx> writes:
>> Yes sry, I've tested a little bit and pdftotext works if I can
>> copy&paste, if I can't convert the pdf to text copy&paste also
>> didn't work.
Ken> No surprise I guess, sounds like you have a re-encoded
Ken> font. The only way to deal with this is to go from the
Ken> encoding back to the glyph name, then try and gifure out if
Ken> the glyphname has a matching ASCII character.
In an attempt to obfuscate a PDF file, I have tried reencoding a font.
But both xpdf and Acrobat Reader (Linux version) can do the cut&paste
correctly. (They can even recognize the "fi" ligature!) At first, I
was puzzled: the recoding was a random shuffling; how can these tools
figure out which charcode is which char? It must be the glyph
names...
So, I went a step further: obfuscate also the glyph names in the
fonts! After that, neither xpdf nor Acrobat Reader can work out the
original text anymore. ;)
Ken> Sadly, some subset fonts will use unhelpful glyph names; for
Ken> example instead of /adieresis, it might be /G00. There isn't
Ken> really any way to go from names like that back to ASCII.
That's also what I do. I gave generated, meaningless names to the
glyphs. And that has achieved my goal: obfuscation. :)
Ken> TBH, the 'right' way to do this is by extracting text from
Ken> the original application. If you can't do that, then the most
Ken> reliable method is to OCR the document.
I think so, too. But instead of printing a hard copy and then
scanning it back in (or converting the PS/PDF to an image file
directly), are there OCR software that are smart enough to recognize
from the glyph information? e.g. the glyph metrics may already be
useful for guessing which characters they map to. There is no need to
rework these measures from a raster sub-image.
Ken> I'm afraid neither PostScript nor PDF is intended as an
Ken> 'editable' format, and so there is no provision for doign
Ken> what you want.
In the worse case, the PS/PDF file could have just drawn each
character as a drawing, skipping the use of the font machinery! :(
(Yeah, I know that's very inefficient, it can't take advantage of the
font caching mechanism.)
--
Lee Sau Dan 李守敦 ~{@nJX6X~}
E-mail: danlee@xxxxxxxxxxxxxxxxxxxxxxxxxx
Home page: http://www.informatik.uni-freiburg.de/~danlee
.
- Follow-Ups:
- Re: extract text from PDF file
- From: Ken Sharp
- Re: extract text from PDF file
- References:
- extract text from PDF file
- From: Fabian Holler
- Re: extract text from PDF file
- From: Ken Sharp
- Re: extract text from PDF file
- From: Fabian Holler
- Re: extract text from PDF file
- From: Ken Sharp
- Re: extract text from PDF file
- From: Fabian Holler
- Re: extract text from PDF file
- From: Ken Sharp
- extract text from PDF file
- Prev by Date: Re: different paper trays
- Next by Date: Re: different paper trays
- Previous by thread: Re: extract text from PDF file
- Next by thread: Re: extract text from PDF file
- Index(es):
Relevant Pages
|
Loading