Re: extract text from PDF file



"Ken" == Ken Sharp <ken@xxxxxxxxxxx> writes:

>> Yes sry, I've tested a little bit and pdftotext works if I can
>> copy&paste, if I can't convert the pdf to text copy&paste also
>> didn't work.

Ken> No surprise I guess, sounds like you have a re-encoded
Ken> font. The only way to deal with this is to go from the
Ken> encoding back to the glyph name, then try and gifure out if
Ken> the glyphname has a matching ASCII character.

In an attempt to obfuscate a PDF file, I have tried reencoding a font.
But both xpdf and Acrobat Reader (Linux version) can do the cut&paste
correctly. (They can even recognize the "fi" ligature!) At first, I
was puzzled: the recoding was a random shuffling; how can these tools
figure out which charcode is which char? It must be the glyph
names...

So, I went a step further: obfuscate also the glyph names in the
fonts! After that, neither xpdf nor Acrobat Reader can work out the
original text anymore. ;)



Ken> Sadly, some subset fonts will use unhelpful glyph names; for
Ken> example instead of /adieresis, it might be /G00. There isn't
Ken> really any way to go from names like that back to ASCII.

That's also what I do. I gave generated, meaningless names to the
glyphs. And that has achieved my goal: obfuscation. :)


Ken> TBH, the 'right' way to do this is by extracting text from
Ken> the original application. If you can't do that, then the most
Ken> reliable method is to OCR the document.

I think so, too. But instead of printing a hard copy and then
scanning it back in (or converting the PS/PDF to an image file
directly), are there OCR software that are smart enough to recognize
from the glyph information? e.g. the glyph metrics may already be
useful for guessing which characters they map to. There is no need to
rework these measures from a raster sub-image.


Ken> I'm afraid neither PostScript nor PDF is intended as an
Ken> 'editable' format, and so there is no provision for doign
Ken> what you want.

In the worse case, the PS/PDF file could have just drawn each
character as a drawing, skipping the use of the font machinery! :(
(Yeah, I know that's very inefficient, it can't take advantage of the
font caching mechanism.)



--
Lee Sau Dan 李守敦 ~{@nJX6X~}

E-mail: danlee@xxxxxxxxxxxxxxxxxxxxxxxxxx
Home page: http://www.informatik.uni-freiburg.de/~danlee
.



Relevant Pages

  • Re: Sort ouy characters
    ... Why under Windows VISTA TextOutW with Arial font displays all characters normally, but under Windows XP not? ... procedure should retrieve a glyph index 0 for a character which doesn't ...
    (microsoft.public.vb.winapi)
  • Re: Missing characters/unusable font in Word 2004
    ... character number of a glyph is by its position in the table. ... very small font that contains few glyphs, so several of those positions are ... we're only interested in its Unicode value. ...
    (microsoft.public.mac.office.word)
  • Re: Rotated Text Help Needed
    ... > characters such that the top left corner of the character cell sits at the ... > character cell as well as the character glyph itself. ... > Having carried out a quick check (using much larger font sizes) I can see ...
    (comp.lang.basic.visual.misc)
  • Re: Missing glyph when using mathpazo
    ... cmr contains an appropriate glyph, ... The textcomp package does nothing more than declaring several characters ... still composed of the percent character and two subscript zeros. ... Palatino font contains a black square at the position where the ...
    (comp.text.tex)
  • Re: Missing glyph when using mathpazo
    ... cmr contains an appropriate glyph, ... The textcomp package does nothing more than declaring several characters ... still composed of the percent character and two subscript zeros. ... Palatino font contains a black square at the position where the ...
    (comp.text.tex)

Loading