How to improve Document Scanning and Zone OCR Accuracy
How to Improve Document Scanning Zone OCR Accuracy
Zone alignment
Often, the location of a zone moves from image to Image. This is due to changes in the form types, or even border issues with scanning. Auto cropping can help to some degree, but too often, zones move ever so slightly to make fixed zones impossible to use. Some tools incorporate "Anchor Words" which are high accuracy words that are used as an Anchor point. A general area of interest is defined on the document and the Anchor Word is used to align the zones to it. As long as there is consistency to a anchor word, your zone positions should improve greatly with these kinds of tools. Only high confidence words should be used. In our example, high confidence words are shown along with the selected anchor word (Part).
Zone Image Improvement
Since zone images are used temporarily for text extraction purposes, they can be further processed without consideration of the saved file. Some Image Processing options that can effect the OCR results include:
- Line Removal will remove grid lines that sometimes interfere with the OCR or result in additional characters recognized such as an "I". More advanced systems can remove lines that intersect with text and repair the common area where the characters intersect with the removed line.
- Edge smoothing can be used to deal with "spurs" and other edge issues on documents.
- Adaptive Thresholding involves two dimensional image processing can also be used where neighboring pixels are incorporated into sophisticated algorithms to help smooth out the characters.
- Pixel Expansion or thickening is helpful for lightly scanned images. The pixels are expanded in 4 or 8 directions to help the OCR recognize the objects.
Using multiple OCR engines and word confidence
Some systems use a confidence scores on the resulting words captured in a OCR Zone. When combined with multiple OCR engines, you can take the best results to obtain the higher accuracy. OCR engines use different techniques to match characters and deal with broken and disjointed image data. Passing images through each engine, scoring the confidence, then using the best scoring engine can help improve accuracy or missed data when one engine fails on a zone.
Using Regular Expressions
With the use of regular expressions, you can expand the search region to look for keywords or have rules that only return the exact character string. If we have a zone in which we want to extract a zip code, we can look for the numeric sequence of 5 numbers or 5 numbers followed by a dash and
To Wrap It Up
When extracting text from an image, OCR Accuracy, Zone placement and post processing is crucial. To optimize your scanning success:
- Use Good Pre-processing Techniques
Good pre-processing can be as important as the scanning technologies involved. Encourage accuracy by setting document procedures and guidelines to:- Use adequate white space
- Limit lines and gridlines
- Limit the use of color
- Use OCR friendly fonts and sizes
- Scan at 200 or 300 DPI Minimally
- Use an Intelligent Document and Data Capture Solution
Software such as ImageRamp uses advanced cleanup and validation technology with preview and testing mechanisms.