Re: extract text from PDF file
- From: Ken Sharp <ken@xxxxxxxxxxx>
- Date: Thu, 3 Aug 2006 17:42:50 +0100
In article <87psfhdd2c.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxx>,
danlee@xxxxxxxxxxxxxxxxxxxxxxxxxx says...
Ken> TBH, the 'right' way to do this is by extracting text from
Ken> the original application. If you can't do that, then the most
Ken> reliable method is to OCR the document.
I think so, too. But instead of printing a hard copy and then
scanning it back in (or converting the PS/PDF to an image file
directly), are there OCR software that are smart enough to recognize
from the glyph information? e.g. the glyph metrics may already be
useful for guessing which characters they map to. There is no need to
rework these measures from a raster sub-image.
I don't think there's anything that will deal with a PDF file, extract
the font outlines and use those, no, but I could be wrong. In any event,
as you note below, this is insufficient in the general case, since
'text' can also be emitted as linework (draw a path matching the glyph
and fill, or stroke) or as images (type 3 fonts, especially logos).
Ken> I'm afraid neither PostScript nor PDF is intended as an
Ken> 'editable' format, and so there is no provision for doign
Ken> what you want.
In the worse case, the PS/PDF file could have just drawn each
character as a drawing, skipping the use of the font machinery! :(
(Yeah, I know that's very inefficient, it can't take advantage of the
font caching mechanism.)
Its also possible to get multiple instances of the same 'text', for
example by filling a glyph in one colour, and stroking in another. A
naive extraction might extract the text twice ;-)
Ken
.
- Follow-Ups:
- Re: extract text from PDF file
- From: TeXtonyx
- Re: extract text from PDF file
- References:
- extract text from PDF file
- From: Fabian Holler
- Re: extract text from PDF file
- From: Ken Sharp
- Re: extract text from PDF file
- From: Fabian Holler
- Re: extract text from PDF file
- From: Ken Sharp
- Re: extract text from PDF file
- From: Fabian Holler
- Re: extract text from PDF file
- From: Ken Sharp
- Re: extract text from PDF file
- From: Lee Sau Dan
- extract text from PDF file
- Prev by Date: Re: different paper trays
- Next by Date: Re: different paper trays
- Previous by thread: Re: extract text from PDF file
- Next by thread: Re: extract text from PDF file
- Index(es):
Relevant Pages
|