Comparison of OCR Tools for Mandarin

The LRC has a number of tools for running OCR, or optical character recognition. OCR is needed when a text has been scanned. A scanned document is composed of images - although it looks like text, it is actually an image of the text. To convert those images into real text, which can be searched, copied/pasted, and converted into other formats such as Microsoft Word, OCR is used.

A recent request for help with OCR for a scanned Mandarin Chinese text became an opportunity to compare the accuracy of several OCR tools. Here's an excerpt from the original scan:

Excerpt from originally scanned text in Chinese

On the Mac platform, we have 2 software with OCR capabilities. The first is ABBYY FineReader, which offers OCR in 171 languages; however, unfortunately, Chinese is not available. The second software for Mac is Adobe Acrobat. Here's the result of running OCR in Acrobat on the same excerpt:

Excerpt showing OCR of same text using Adobe Acrobat. 5 characters recognized inaccurately (out of 85) are highlighted.

The red highlights show characters which Acrobat recognized inaccurately, a total of 5 characters incorrect out of 85, or 94% accuracy.

On the Windows platform, Adobe Acrobat is also an option for OCR. Curious as to whether the same OCR engine was used in both the Mac and Windows versions of the software, I ran the OCR again in Adobe Acrobat for Windows. The results were identical with the Mac version.

The other option for OCR in Windows available in the LRC is OmniPage 17. Here are the results of OCR on the same text excerpt using OmniPage.

 

In other words, OmniPage was able to recognize this small sample with 100% accuracy. In an additional comparison of Adobe Acrobat and OmniPage on Windows with a larger sample of text, OmniPage continued to outperform Adobe Acrobat, but the difference in accuracy decreased with the larger text size. Out of 698 characters, Adobe Acrobat made 6 errors, for an accuracy of 99.1%. OmniPage made 4 errors, giving it an accuracy of 99.4%. Interestingly, the two engines struggled with different characters: Acrobat correctly recognized 3 characters that OmniPage failed on while OmniPage succeeded with 5 characters that caused trouble for Acrobat.

A more significant difference between the two software is the interface and user control of the OCR process. Adobe Acrobat is very simple to use but gives the user fewer options. To run OCR in Acrobat, with a PDF file open, go to Tools -> Recognize Text -> In this file. Verify that the correct language is shown in the dialogue box (or click Edit Settings to change the language if needed), and then click OK to perform the OCR. In OmniPage, on the other hand, running OCR involves several steps. The user can manually adjust which areas of each page are being processed for OCR. In other words, if the scan includes headers (such as running titles) or footers (such as page numbers), the user can choose whether or not to include those in the OCR. Depending on the scan quality, the pages might also include dark areas around the margins, smudges, or handwriting marked on the original, all of which could interfere with OCR but can be ignored using OmniPage. Finally, OmniPage gives an option for reviewing OCR "suspects" - characters or words which it identifies as being likely to be incorrect. Reviewing and correcting these during the OCR process should lead to an even higher accuracy rate overall.

(Note: Acrobat does also have a "suspects" function that can be used after performing OCR; however, it seems to be identifying all areas where OCR was performed, i.e., "suspected text", rather than suspected inaccurate characters. For each "suspect" in Acrobat, the user can review and edit the text or indicate that it is not text. Thus it could be used for ignoring areas of the page that should not have been OCRed but doesn't seem to help pinpoint characters that may have been recognized inaccurately.)

In summary, for most users doing occasional OCR, Adobe Acrobat is probably a good choice for its simplicity of use, especially as our school has a site license, making Acrobat readily available on any college computer. However, for users working on larger projects involving a great deal of OCR, it may be worthwhile to come to the LRC to use OmniPage. The accuracy seems to be slightly better, but more importantly, the user control of the OCR process may be of significant benefit, particularly if the text is to be outputted to another format, such as a Microsoft Word or text document.

No matter what tool is used for OCR, to get the best OCR results, it's important to start with a high-quality scan. Our AccessAbility office recommends scanning using the "black-and-white" setting (not grayscale or color) and at a resolution of at least 300 ppi.


Tags:

« Chinese Readability... | Main | Crossing the World... »

Comments:

Post a Comment:
Comments are closed for this entry.