Real Time Gesture Recognition

The goal of this project is to develop a computer program implementing gesture recognition in real time.

Abstract

The goal of this project is to develop a computer program implementing gesture recognition in real time.
At any time, a user can exhibit his hand doing a specific gesture – presenting zero up to five fingers of his hand in front of a video camera linked to a computer.
The computer program has to handle the video stream of images captured by the camera, identify the user’s hand, separate it from other elements in the input picture, count the fingers shown by the user and present the result to the user as fast as possible (within real time requirements).

Identification and isolation of the hand
Problem analysis

The first problem to deal with is how, by means of image processing and given a set of representative images, to analyze the images and to find the user’s hand in it.
Assuming the user will not always put his hand in the same area of the picture, the solution could not be based on clipping the margins of the picture around a certain targeted area.
Here are some given examples of the same sign done in different areas of the picture:

Figure 1 – Same sign done in different areas of the picture

Moreover, assuming the user will not always put his hand in the same distance from the video camera, the solution could not be based on size evaluation applied by pixel counting (only for exceptional size disqualification).
Here are given few examples of the same sign done in different distances from the video camera:

Figure 2 – Same sign done in different distances from the video camera

Therefore, and with a “minimum-possible constraints” approach, the chosen solution is color characterization.

Background removal

According to the requirements, the video camera is fixed to its place and is not suppose to move, and this implies that the background is more or less permanent and therefore can be removed.
By removing the background, the given problem is reduced to the problem of identifying and isolating the user’s hand in a picture containing only his hand and maybe other parts of his body.
Here are some images describing the background removal process:

Figure 3 – Removing the background

Color characterization

Now, when we need to identify and isolate the user’s hand from a picture containing only his hand and maybe some other parts of his body, we use a color characterization method, which is based on both RGB and HSV color models.

Figure 4 – RGB and HSV color models

The final isolation of the user’s hand from the picture is based on average hue criteria because of the illumination constellation around the position of the video camera:
a light source positioned right above the video camera enhances the brightness of the user’s hand because when exhibited by the user, it is very close to the light source.

Figure 5 – The video camera and the light source above it

Here are some images describing the color characterization process:

Figure 6 – Color characterization

Finger count

Problem analysis

Now, when we have a binary picture containing only the pixels of the user’s hand, the current problem we have to solve is how, again by means of image processing only, to analyze the picture and to count the user’s fingers exposed in it.
Like in previous stages, assuming the user will not always put his hand in the same distance from the video camera, the solution could not be based on finger’s size evaluation applied by pixel counting (only for exceptional size disqualification).
Therefore, the chosen solution involves size evaluation applied by examining length-width ratio.
Moreover, in order to execute a size evaluation in the way mentioned, we need an algorithm that enables us to assess the width and length of each finger, and for that it has primarily to identify and isolate the fingers exhibited in the input picture.
For those reasons, we developed a suitable algorithm based on pixel connectivity.

Figure 7 – 8-connected pixels

Perimeter determination

The pixel-connectivity-based algorithm is applied on a perimeter picture of the hand.
For that reason, the finger count process begins with creating a perimeter picture of the hand:

Figure 8 – Perimeter determination

Identification of candidate fingers using pixel connectivity
Identification of candidate fingers using pixel connectivity

This part is the main part of the finger count process.
The main idea is to execute a scan over the hand perimeter binary picture in such a way that we could find and collect different groups of connected pixels that assemble the group of candidates to be a finger in the picture.
The scan is executed for each y value from the top of the picture to its bottom.
Each scan is executed in both directions of the x values (from the left of the picture to its right and backwards).
For each new perimeter-pixel currently found in the scan, the program checks if it belongs to an existing group of pixels, i.e. an existing candidate finger.
This examination is carried out by examining if the current pixel is 8-connected to an edge-pixel of an existing group of pixels collected before (or connected to the pixel next to the edge-pixel in that group).
If connected, this pixel is added to the existing group as its edge-pixel and if not connected, there are two possibilities: If this happens on the first pass (out of the two passes that the scan takes on the current scanline – one pass for each scan direction) – this pixel is inserted to a waiting list in order to wait to his turn in the second pass.
If this happens on the second (and last) pass – this pixel “founds” a new group of pixels, to be considered later as a potential candidate for a finger in the picture.

Figure 9 – Scan Process

The scheme in figure 9 describes the above scanning process on a hand-perimeter binary picture of a user’s hand, exhibiting two of his fingers (his Thumb and Index fingers):
The black arrow describes the current-scanline y value (and the scan current direction is from left to right, on its first pass),
the red arrow describes the current x value of the current scanned pixel,
the different colored pixels’ groups describes previous scanned (and therefore already existing) pixel groups assembling three different candidate fingers (green, yellow and blue discovered in this order),
the two pixels marked with a red question-mark represent two pixels that were previously inserted to the waiting list, in order to wait for the second scan pass of the current scanline, because they were found unconnected to any existing group of pixels in the first pass of this scan,
the red pixel describes the current scanned pixel which will be added to the existing green group because it is 8-connected to it, and because the adding-to-groups order is from first to last discovered group (arbitrarily chosen).
Here are images describing the input and the output of this process:

Figure 10 – Scan input and output images

Pixel group normalization

The results obtained by the previous stage still do not enable us to perform size evaluation based on length-width ratio, because some of the pixel groups we collected, that suppose to be potential candidates for a finger, may contain redundant pixels from the hand’s perimeter:

Figure 11 – Finger candidates with redundant hand perimeter pixels

In order to be able to assess the finger candidates’ length and width (for applying size evaluation), we have to normalize this pixel groups by getting rid of their redundant pixels.
The algorithm we developed specifically for that enables us to do so.
For understanding how this algorithm works, consider the scheme described in Figure 12:
The highlighted pixels assemble one of the finger candidate groups, which contains redundant perimeter pixels (from the left edge of the hand) that we want to dispose of.
First, we calculate the distance between the last pixel and the first pixel of the group (denoted by a).
Then we move the pointer to the last pixel’s previous neighbor in the group, and calculate its distance from the first pixel and so on.
This process stops when the calculated distance stop to diminish and start to grow again (denoted by c).
Now we move the last pixel’s pointer to the pixel in which the last distance diminishing occurred (denoted by b), we determine this pixel as the group’s last pixel and dispose all the pixels after it.
The result is a normalized group of pixels with normal length and width dimensions.

Figure 12 – Pixel group normalization process

Here are images describing the input and the output of this process:

Figure 13 – Pixel group normalization example

Size evaluation using length-width ratio

The result of the previous stage was a list of normalized pixel groups, which are the potential candidates for fingers in the picture.
First we disqualify groups of pixels with exceptional size (less then a minimum count of pixels measured before).
Then we calculate each finger candidate’s length and width (the maximum longitudinal and lateral distances between two pixels of the group) and screen the list from finger candidates with unsuitable length-width ratio (according to the normal ratio that was measured before).
Finally, we can count the number of remaining groups in the list and present the result to the user.

Software tools

Figure 14 – Project main stages

The work in this project was done in two main stages.
The first stage was developing an image-processing algorithm.
The development of the algorithm was done using the MATLAB 6.5 environment, which enables a use of a large variety of included image-processing functions.
In this stage, there were no real time constraints and therefore it was possible to work on typical and representative movies previously chosen and saved using ASUS LIVE (for more information about ASUS LIVE see Appendix 1 of the project report).
The second stage was translating the MATLAB program into a C++ program in order to achieve the real time requirements.
Although translation of MATLAB code into C code can be done automatically with MATLAB, previous projects encountered problems using this method with image-processing MATLAB code, and because the resulted C code is impossible to debug, we preferred to translate the MATLAB code into C++ code ourselves and so we did.
Some of the translated functions needed a second and even a third optimized version in order to attend the real time requirements of the program.
The counting process result is not entirely stable and is subject to small variations caused by sampling-variables and different kinds of noise in the system.
In order to overcome such noise and enhance the system credibility we presented the majority count of the last five samplings.
This addition, possible under real-time application, caused a small delay in presenting the results but enhanced its credibility immensely.
In order to capture the image from the video camera, we used the VideoOCX ActiveX (for more information about VideoOCX ActiveX see Appendix 2 of the project report) within a visual C++ application (written using Visual Studio C++ 6.0).

Figure 15 – Project GUI

Conclusions

This project’s goal was to implement a real time gesture recognition program.
The proposed algorithm is based on several image-processing techniques such as color characterization, pixel connectivity and size evaluation based on length-width ratio.
The report is going into more details in the image processing methods used to achieve this goal.
In order to test our program, it was tested on several different users and the results were very satisfying.
The resulted VideoOCX oriented visual C++ application show that a real time gesture recognition program can provide good results with quite simple calculation involved.
This project focused mainly on hand gesture recognition and dealt with a specific gesture – presenting zero up to five fingers of the user’s hand.
The method at hand can be broaden and applied not only for the above specific gesture, but also for other gestures such as sign language, depending on adjustment of the pixel scan method and on insertion of an automatic orientation mechanism, which will enable the identification of more complex signs.
Moreover, an obvious advantage of the presented algorithm for counting fingers is its use of separation and definition of different fingers, in contrast to previous ideas (presented in previous VISL lab projects on this subject) that settled for only counting the fingers.
By defining different fingers, as executed in our algorithm, we are able to examine and explore them.
In this project we used the length-width ratio but the possibilities are unlimited.
We can, for example, check the angle relationships between them, their relative length and so on.
By “understanding” that one finger is much wider then its neighbors, we can deduce that a fault in the image-processing process has occurred and figure that we are actually looking at two attached fingers.
Basically, it is the beginning of A.I. (Artificial Intelligence) – “understanding” what happens in the picture.
A major improvement could be identifying different fingers by edge detection methods, using gradient differences.
This will allow recognition of folded fingers and not only spread up ones.
Initially, we checked this option and decided against it, having troubles separating the palm from the rest of edges in the picture.
However, It may be more useful using such methods on pre-selected palm that was selected by the selecting methods developed in this project.
The main drawback of our program is the fact that it depends on a near-perpendicular orientation of the user hand, a problem that the automatic orientation mechanism we mentioned before could solve.
Acknowledgments
We would like to thank the Lab chief engineer Johanan Erez for his professional guidance during this project.
We also wish to express our gratitude to the rest of the Lab’s staff for their help and support throughout this project and to the Ollendorff Minerva Center.