Re: Scanner recommendation for decent scans



I did some more experimenting and found out that:

1. Scanning at 600dpi black and white with a higher threshold (so that
more black dots are produced) yields a much much higher OCR accuracy.
OCR accuracy may be 99% or more.

2. I created a special OCR training bitmap file with the special
single character Fractions ('1/4', '1/2') far apart so that there is a
lot of white space around each character. Using a large font such as
24 point and repeatedly OCR'ing each character allowed me to train the
OCR software to recognize those fractions.

The extra 4 or 5 minutes per scan actually saves grunt work in
correctng OCR problems and text ordering (i.e., layout was OCR'ed
incorrectly) problems.

I do like the idea of photo-copying using this workflow:

1. photocopy each 10' by 12' page, adjust the contrast and outuput on
8.5 by 11 inch pagper

2. Scan at 600 dpi or higher (this is a single scan which saves time
trying which was spent trying to join two scans together)

3. OCR

4. Correct, reformat text, etc.

Advantages:

1. handling the odd sized, aged paper journals is easier since each
page is 'scanned' by the copier 1 time instead of 2 scans with a 8.5
by 11 scanner.

2. Copier can do a better job of contrast, threshold, etc than the
scanner.

3. Copies could be *** fed into scanner (requires me byying a new
scanner)

4. Full scan, OCR, output cycle could be automated with *** feeder.




On Thu, 21 Jul 2005 14:17:26 GMT, "lostinspace"
<lostinspace@xxxxxxxxxxxxxxxx> wrote:

>----- Original Message -----
>From: <>
>Newsgroups: comp.periphs.scanners
>Sent: Thursday, July 21, 2005 12:05 AM
>Subject: REQ: Scanner recommendation for decent scans
>
>
>>I have a Visoneer OneTouch 8900 USB scanner and am very tired of the
>> poor scan results. Scans are blurry, OCR is mostly pointless at
>> anything below 300 dpi (slow scans at 300 dpi).
>>
>> Are there any better scanners priced less than $200 USA? Especially
>> with a much better OCR package (not TextBridge Pro 9.0).
>>
>> I use the scanner for:
>> 1. Scanning old printed journals (see below)
>> 2. Scanning old photographs for CDR archival / retouching /
>> reprinting
>>
>> The issues I have:
>> 1. Slow scanning times (45 seconds a scan or longer)
>> (2.4ghz P4, usb1, 128mb or more free memory, gigabytes
>> of free disk space)
>>
>> 2. Poor raw OCR (accuracy for 300dpi is 91% and the real accuracy is
>> that about 25% of the words are mis-OCRed - a big issue). I've tried
>> a 600dpi scan but that takes about 4-5 minutes per scan.
>>
>> Here are the details for the printed journals:
>> 1. Each page is 10 inches by 12 inches (I cannot cut them
>> and ***-feed them)
>> 2. Each page is slightly yellowed (about tan coloured)
>> 3. The text is 8 or 9 point with serifs
>> 4. The newspaper is usually in a 4 column format
>> 5. There is some mixing of text point sizes ( 7 point is the smallest,
>> with 8 point the usual font)
>> 6. There are a significant number of single character fractions such
>> as '1/4', '1/2', '2/3', '7/8' (i.e., each digit of the fraction is 4
>> or 5 point)
>>
>> My workflow:
>>
>> 1. Scan test page at 300 dpi (requires 2 scans since the page is
>> larger than the scanning area)
>> 2. Set scanner to scan in black and white (set threshold level to
>> compensate for the yellow paper)
>> 3. Scan
>> 4. Fix any page skew
>> 5. Crop if needed
>> 6. Save as windows monochrome bitmap file
>>
>> Repeat 1-6 for an entire journal issue (about 20 pages)
>>
>> 7. Bring the pages into TextBridge Pro 9.0 as poor quality newspaper
>> scans.
>> 8. Enable OCR training
>> 9. Process each page
>> 10. Send text to Notepad (removes all garbage formatting)
>> 11. Reorder paragraphs if necessary to be in the correct order
>> 12. Copy and paste into MS word
>> 13. Spell check in MS word - correct mispellings due to OCR process -
>> leave in mispellings in original source material
>>
>>
>
>Andy,
>
> This inquiry might be better served in:
>comp.doc.management
>or
>comp.ai.doc-analysis.ocr
>
>I do both daily and extensive (for over five years) scanning with a bottom
>line scanner ($50 Canon) utilizing Omnipage 9.0. Many folks suggest that 9.0
>is terrible quality, however I'm most pleased in the results as compared to
>my previous scanner. (My machine a Althon 2.0 with 512k, USB 2.0 none of
>this increases the actual scan speed, only the scanner may do that).
>
>The first place I suggest you start is in cleaning both sides of the scanner
>glass. It's a careful and tedious process to removed all the streaks,
>smudges and accumulated plastic. I often repeat the cleaning 3-4 time in
>each cleaning session for quality. (Caution; no paper towels, no windex)
>
>Any DPI less than 300 will not offer good quality for OCR.
>
>If your working with either yellowed or aged documents, color is preferable
>over black and white.
>You'll be surpised how much improvement this will make.
>
>Small fonts and fractions will be a problem that you'll never resolve.
>Flatbed scanners just don't have enough depth-of-field or magnification
>options for small fonts. Many of the docs that I work with utilize fifths of
>seconds repeatedly and that rarely scan correctly.
>
>The OCR corrections are best made in the OCR software rather than a spell
>checker. They do offer a "change all".
>
>As far as multi-column newspaper OCR?
>Your best results will be in using a quality copy machine to increase the
>page sizes.
>I've had good results doing so with some 100+YO newspapers (four column 11 x
>17) and the machines at Kinko's.
>
>In summary, I believe your looking for an automated solution that just
>doesn't exist.
>You either accept crappy results and move on or do the manual editing to
>assure the desired goal, there doesn't seem to be any in-between.
>
>

.