Re: how to tell if a file has been ocr'd



On May 11, 9:56 am, rpresser <rpres...@xxxxxxxxx> wrote:
On May 11, 4:34 am, q...@xxxxxxxxxxxxxx (Aandi Inston) wrote:

sco...@xxxxxxxxxxx wrote:
Does anyone know how I can tell if a pdf file has been ocr'd or not
without opening it. Let's say I have a directory with 100 pdf's in it
and half have been ocr'd. Is there anything in the registry or
anywhere else that would tell me? I can code something up to figure
this out, I just need to find a setting.

It's not a setting and it can hardly be in the registry. A deep
examination of the file might tell you something; you can tell if the
file contains any text (by looking for Font resources in each
resources dictionary, with a recursive scan). This doesn't prove OCR,
but it's unlikely to be a raw scan. Clearly, any program doing this
examination would have to open the file.

The xpdf package has a tool called pdffonts that will do the font
scan:

C:\TEMP>y:\xpdf\pdffonts.exe pdfbook-ACU-sample.pdf
name type emb sub uni object
ID
------------------------------------ ------------ --- --- ---
---------
Helvetica Type 1 no no no 2
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 69
0
ETKBCW+MicroFLF TrueType yes yes yes 70
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 71
0
Times-Roman Type 1 no no no 7
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 84
0
ETKBCW+MicroFLF TrueType yes yes yes 85
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 86
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 99
0
ETKBCW+MicroFLF TrueType yes yes yes 100
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 101
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 54
0
ETKBCW+MicroFLF TrueType yes yes yes 55
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 56
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 144
0
ETKBCW+MicroFLF TrueType yes yes yes 145
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 146
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 204
0
ETKBCW+MicroFLF TrueType yes yes yes 205
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 206
0

If you have only a few fonts that aren't subsetted, it's unlikely to
have been OCRed; lots of fonts would tip the scale the other way.

Another choice is xpdf's pdfinfo:

C:\TEMP>pdfinfo pdfbook-ACU-sample.pdf
Producer: iTextdotNET by ujihara.jp based on iText 1.3 by
lowagie.com (based on itext-paulo-153)
CreationDate: Thu Apr 19 18:09:00 2007
ModDate: Thu Apr 19 18:09:00 2007
Tagged: no
Pages: 12
Encrypted: no
Page size: 612 x 792 pts (letter)
File size: 9793869 bytes
Optimized: no
PDF version: 1.4

If the Producer is a well-known OCR package like Omnipage, chances are
it's an OCRed document.

Finally, if you're willing to waste some time and disk space,
pdftotext would actually try extracting the text, and the size of that
output would be a strong guess. You could limit it to the first page
of a long document.

A clever person could assign scores to these various tests and wrap it
up in a script which would give a percentage confidence that the pdf
was OCRed...

I decided to run pdfinfo and use the producer tag to pull all
producers used to either create the file or ocr it. I then listed
every file for each Producer and let the documents team sort out which
is which. We have over 30,000 pdf's to go through and needed a quick
and dirty solution. I'd say this fits both those categories.

Thanks!

My second approach was to use the adobe api's to write a c# app to
find the info, but this is good enough.

Scott

.



Relevant Pages

  • Re: Importing font from PowerPoint into Producer issue
    ... Any idea how I can get Producer to accept ... What do you mean when you say "my PowerPoint font is Verdana"? ... I am not sure under what circumstances PowerPoint would output fields ...
    (microsoft.public.powerpoint)
  • A Pixel Matching Problem
    ... How might you go about creating an OCR for images? ... draw an e in 10 font sizes for each font as your search templates. ... You carve the rectangle out of the bigger one, ...
    (comp.lang.java.programmer)
  • Re: extract text from PDF file
    ... Ken> reliable method is to OCR the document. ... e.g. the glyph metrics may already be ... I don't think there's anything that will deal with a PDF file, extract ... the font outlines and use those, no, but I could be wrong. ...
    (comp.lang.postscript)
  • Re: Fonts
    ... Get it off the disk that you got it off before. ... Where can I get FREE OCR A fonts? ... >font vendor - n/a ... >Optical Character Recognition face extended to the ...
    (microsoft.public.windowsxp.customize)
  • Re: Fonts
    ... Where can I get FREE OCR A fonts? ... >Optical Character Recognition face extended to the ... >> I lost the OCR A EXTENDED font when moving from WIN98 ...
    (microsoft.public.windowsxp.customize)