Re: how to tell if a file has been ocr'd



On May 11, 9:56 am, rpresser <rpres...@xxxxxxxxx> wrote:
On May 11, 4:34 am, q...@xxxxxxxxxxxxxx (Aandi Inston) wrote:

sco...@xxxxxxxxxxx wrote:
Does anyone know how I can tell if a pdf file has been ocr'd or not
without opening it. Let's say I have a directory with 100 pdf's in it
and half have been ocr'd. Is there anything in the registry or
anywhere else that would tell me? I can code something up to figure
this out, I just need to find a setting.

It's not a setting and it can hardly be in the registry. A deep
examination of the file might tell you something; you can tell if the
file contains any text (by looking for Font resources in each
resources dictionary, with a recursive scan). This doesn't prove OCR,
but it's unlikely to be a raw scan. Clearly, any program doing this
examination would have to open the file.

The xpdf package has a tool called pdffonts that will do the font
scan:

C:\TEMP>y:\xpdf\pdffonts.exe pdfbook-ACU-sample.pdf
name type emb sub uni object
ID
------------------------------------ ------------ --- --- ---
---------
Helvetica Type 1 no no no 2
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 69
0
ETKBCW+MicroFLF TrueType yes yes yes 70
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 71
0
Times-Roman Type 1 no no no 7
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 84
0
ETKBCW+MicroFLF TrueType yes yes yes 85
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 86
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 99
0
ETKBCW+MicroFLF TrueType yes yes yes 100
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 101
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 54
0
ETKBCW+MicroFLF TrueType yes yes yes 55
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 56
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 144
0
ETKBCW+MicroFLF TrueType yes yes yes 145
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 146
0
MDWSSN+TimesNewRomanPSMT TrueType yes yes yes 204
0
ETKBCW+MicroFLF TrueType yes yes yes 205
0
BCROIR+TrebuchetMS-Bold TrueType yes yes yes 206
0

If you have only a few fonts that aren't subsetted, it's unlikely to
have been OCRed; lots of fonts would tip the scale the other way.

Another choice is xpdf's pdfinfo:

C:\TEMP>pdfinfo pdfbook-ACU-sample.pdf
Producer: iTextdotNET by ujihara.jp based on iText 1.3 by
lowagie.com (based on itext-paulo-153)
CreationDate: Thu Apr 19 18:09:00 2007
ModDate: Thu Apr 19 18:09:00 2007
Tagged: no
Pages: 12
Encrypted: no
Page size: 612 x 792 pts (letter)
File size: 9793869 bytes
Optimized: no
PDF version: 1.4

If the Producer is a well-known OCR package like Omnipage, chances are
it's an OCRed document.

Finally, if you're willing to waste some time and disk space,
pdftotext would actually try extracting the text, and the size of that
output would be a strong guess. You could limit it to the first page
of a long document.

A clever person could assign scores to these various tests and wrap it
up in a script which would give a percentage confidence that the pdf
was OCRed...

I decided to run pdfinfo and use the producer tag to pull all
producers used to either create the file or ocr it. I then listed
every file for each Producer and let the documents team sort out which
is which. We have over 30,000 pdf's to go through and needed a quick
and dirty solution. I'd say this fits both those categories.

Thanks!

My second approach was to use the adobe api's to write a c# app to
find the info, but this is good enough.

Scott

.