Converting Scanned Document to Text File by Using SVM Classifier

This project presents a procedure for conversion of scanned documents to a digital format (text file), available for use on every personal computer.

Abstract
This project presents a procedure for conversion of scanned documents to a digital format (text file), available for use on every personal computer. The conversion preserves the original document structure, including the lines, words and basic punctuation. The conversion procedure uses Support Vector Machine (SVM) classifier.

This project expands a former project, by adding features to the converter so it can handle documents which were scanned in an angle, and by adding the capability of converting additional characters.

The problem
In many cases rises the need of converting hard-copy into a digital format. Manual conversion is time consuming and prone to human mistakes. Finding a way for an automatic conversion can save a considerable amount of time and can be useful for many applications.

The solution:
The solution is based on a SVM (Support Vector Machine) classifier. The following steps are being carried out:

Gathering of a sufficient number of examples for each character
Using a GUI to classify the examples, attach a label to each example, extract its graphical properties and collect information of the spaces between letters and words
Training a classifier on the collected database. The training parameters and the optimal properties (features) used for the training, were selected using 5-fold Cross Validation error and a feature selection algorithm
Classify new documents using the final classifier

The classifier is created according to the following diagram:

Example
The following example demonstrates the conversion of a scanned rotated document to a text file, accessible with every text editor. The classification errors are marked red.

Conclusions

During that project I have fulfilled my three project goals: learning and using SVM based classifier, adding rotation tool to the algorithm and adding additional characters.

This project shows that by adding the additional 11 characters (to the 29 characters in the previous project) the probability of a wrong classifying decision rises from 2.7 to 5.6 percent.

Rotation of the text before scanning operation will raise the error probability. For example: a five degrees rotation will raise the error probability from 2.7 to 5.8.

In future applications, in order to reduce the probability of a wrong classifying decision, there is a need of using secondary classifiers.

Acknowledgment

I am thanking my project supervisor Dori Peleg for his help, patience and understanding.

A special thanks to my wife and kids.

I am also grateful to the Ollendorf Minerva Center Fund for supporting this project.