Hand Writing Conversion to Text Files Using SVM Classifier

This project presents a procedure for conversion of scanned documents that are either hand-written or typescript, to a digital format (text file), available for use on every personal computer.

Abstract
This project presents a procedure for conversion of scanned documents that are either hand-written or typescript, to a digital format (text file), available for use on every personal computer. The conversion presented, preserves the original document structure, including the lines, words and basic punctuation. The conversion procedure uses Support Vector Machine (SVM) classifier, which was adapted to the classified font needs. One of the SVM’s advantages is its ability to handle non-linear classification problems that are non-separable using linear methods. The results show that handwriting can be classified with an error rate of 2-3%, and typescript classification error rate of 0.07%.

The problem
In many cases rises the need of converting hard-copy documents (books, hand written notes etc.) into a digital format. Manual conversion is time consuming and prone to human mistakes. Finding a way for an automatic conversion can save a considerable amount of time and can be useful for many applications.
The main problem of an automatic solution is the wide variance of human hand-writing. The hand-writing changes according to the writer’s mood, environment, writing implement and more. Typescript also, can vary due to a difference in the printing angle, ink absorption.
The solution
The solution is based on a SVM (Support Vector Machine) classifier, which is created using each writer’s writing examples. The classifier is
created for each writer according to the following steps :

  • Gathering of a sufficient number of examples for each character
  • Using a GUI to classify the examples, attach a label to each example, extract its graphical properties and collect information of the spaces between letters and words
  • Training a classifier on the collected database. The training parameters and the optimal properties (features) used for the training, were selected using 5-fold Cross Validation error and a feature selection algorithm
  • Classify new hand written documents using the final classifier

1
Figure 1 – Block diagram of the classification flow of a new document

 

Example
The following example demonstrates the conversion of a scanned hand written document to a text file, accessible with every text editor.
The classification errors are marked red.

2
Figure 2 – Scanned image of the hand written document
3
Figure 3 – Output of conversion to text file flow

Results and Conclusions
This project shows that scanned typescript can be classified almost perfectly, reaching an error rate of 0.07%.
Scanned hand Writing classification error rate is 2-2.9%.
There is a need to select specific features of each writer’s characters, in order to reduce runtime and reach optimal results.
Using graphical features instead of 200×200 pixels for each charcter, simplifies the classification problem significantly.

Acknowledgment
We are grateful to our project supervisor Dori Peleg for his help and guidance throughout this work.
A special thanks to the people behind the OSU SVM , for creating an easy to use interface to their implementation of the SVM classification algorithm.
We are also grateful to the Ollendorff Minerva Center that supported this project.