Re: extract text from PDF file



In article <87psfhdd2c.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxx>,
danlee@xxxxxxxxxxxxxxxxxxxxxxxxxx says...

Ken> TBH, the 'right' way to do this is by extracting text from
Ken> the original application. If you can't do that, then the most
Ken> reliable method is to OCR the document.

I think so, too. But instead of printing a hard copy and then
scanning it back in (or converting the PS/PDF to an image file
directly), are there OCR software that are smart enough to recognize
from the glyph information? e.g. the glyph metrics may already be
useful for guessing which characters they map to. There is no need to
rework these measures from a raster sub-image.

I don't think there's anything that will deal with a PDF file, extract
the font outlines and use those, no, but I could be wrong. In any event,
as you note below, this is insufficient in the general case, since
'text' can also be emitted as linework (draw a path matching the glyph
and fill, or stroke) or as images (type 3 fonts, especially logos).


Ken> I'm afraid neither PostScript nor PDF is intended as an
Ken> 'editable' format, and so there is no provision for doign
Ken> what you want.

In the worse case, the PS/PDF file could have just drawn each
character as a drawing, skipping the use of the font machinery! :(
(Yeah, I know that's very inefficient, it can't take advantage of the
font caching mechanism.)

Its also possible to get multiple instances of the same 'text', for
example by filling a glyph in one colour, and stroking in another. A
naive extraction might extract the text twice ;-)


Ken
.



Relevant Pages

  • A Pixel Matching Problem
    ... How might you go about creating an OCR for images? ... draw an e in 10 font sizes for each font as your search templates. ... You carve the rectangle out of the bigger one, ...
    (comp.lang.java.programmer)
  • Re: GDI does not provide all GLYPH handles for big font size
    ... By the way, if the font gets even bigger, DrvTextOutis no ... MoreGlyphs = FALSE; ... // will do a STROBJ_bEnum first, in order to load up the Glyph data. ... int x, y; ...
    (microsoft.public.development.device.drivers)
  • Re: extract text from PDF file
    ... Ken> the glyphname has a matching ASCII character. ... In an attempt to obfuscate a PDF file, I have tried reencoding a font. ... It must be the glyph ...
    (comp.lang.postscript)
  • =?iso-8859-1?q?Re:_Problem_in_char_"_=AE_"?=
    ... computer, when you edit a file in a text editor, it shows up ... it might be a font issue because if one computer is editing using ... if the codepage doesn't map properly but it's something to be aware of. ... instead of the glyph you're trying to encode. ...
    (comp.lang.c)
  • Re: Missing characters/unusable font in Word 2004
    ... character number of a glyph is by its position in the table. ... very small font that contains few glyphs, so several of those positions are ... we're only interested in its Unicode value. ...
    (microsoft.public.mac.office.word)