Form Processing: Automatic Recognition of ID Numbers in Scanned Forms

Form processing is a process whereby information entered into data fields is converted into electronic formץ

Introduction: What is form processing?
Form processing is a process whereby information entered into data fields is converted into electronic form:

  • Entered data are “”captured”” from their respective fields
  • Forms themselves are digitized and saved as images

In most cases forms processing is considered complete when the data from all the forms have been captured, verified and saved into a database. It is also essential that the integrity of the captured data is preserved.
Forms can be processed manually or using forms processing software. In the advantages of form processing using computer software are very clear.

The aim of the project
In this project we were asked to develop an automatic solution for picking a scanned exam form of a certain student from a whole database of such forms, on request. The identification had to be done by recognizing the personal ID number of the student on the title page of the form. On each form, this ID number is marked by the student himself, by checking the right digits in a table, see the examples below.

Problems that we had to overcome
Forms which are filled by humans may include a lot of obstacles as skewing of the image, unclear mark, deleted mark and so.
Once we overcome those obstacles we can segment and label the scanned form and extract the ID number of the student from the marks square.
Thus, our project divided into two main parts:

  • Fixing and cropping the relevant check boxes area
  • Segment the check boxes area and extract the ID number

The solution
General scheme:

1

  • The forms is scanned into a known format (jpg,bmp)
  • The image is read and cropped roughly
  • An algorithms applied to find the image skew from the origin

2

  • The skew is fixed using a correlation algorithm in order to detect the exact location of the check boxes

3

  • The check box area is segmented and labeled

4

  • The marked square is detected and the ID number extracted, each mark has a value in the range of 1-90

5

 

Tools
The project was developed with MATLAB version 7.1, on PC platform.
The forms were scanned using the scanner that used for scanning the exam forms at the computer administration center of theEE faculty.

 

Conclusions
We developed and implemented a program which extracts the ID number from scanned exam forms, as used in the EE faculty.
The program was tested on 100 exam forms (both B&W and colored forms, including many cases of problematic forms with obstacles as described before). In those forms, accuracy of 100% was achieved. We believe that our program is ready to be used for regular work with the scanned forms of the type that we worked with.
As a followup project, a user friendly graphical interface shall be build, which integrates the process of scanning, automatic identification of the student’s ID, and encode the ID in the filename of the scanned exam.

 

Acknowledgment
We are grateful to our project supervisor Johanan Erez for his help and guidance throughout the work and to Shula Fine from the EE computer administration center that gave technical support with the scanning of the forms. We are also grateful to the Ollendorff Minerva Center Fund for supporting this project.