Speaker Independent Isolated Digit Recognition using Hidden Markov Models

The most common way of giving commands to your computer is, for today, the keyboard. This way keeps your hands constantly working whenever you want to communicate with your PC.

Abstract
The most common way of giving commands to your computer is, for today, the keyboard. This way keeps your hands constantly working whenever you want to communicate with your PC.
A speech-recognizing computer will have several advantages over the traditional one:

Hands free communication way
Supplying the user with ability to command his PC remotely (even from a cell phone)
Faster input (typing is, for most users, slower than talking)

The reasons above are only part of the advantages that computer voice recognition could supply.
Those advantages are the main motivating factor to develop reliable voice recognizing system.
The current project’s goal is to write a code that will be able, after a learning process, to recognize a one second records of one of the ten digits. This is an example of a small vocabulary recognizing algorithm.

Background
The main problem with speech recognizing computer is the big variety of commands (words) that exist to command it. When typing, each command is composed of a finite set of symbols, each of them easily recognized by the computer according to the pressed keys on the keyboard. Voice commands, on the other hand, do not consist of predefined components. Each of them has its own, unique voice signature. In the ideal world, recording each word and comparing a new sample to the record will solve the problem. However, this method is not practical since even when produced by a single speaker, the records vary in dependence of speaker’s mood, distance from the microphone, background noise, etc. Any other user will produce a different voice signature for the same command, thus making the recognition process even more complicated.
Many methods were developed in order to overcome this problem; one of them is presented in this project.

The method
The method we applied in the project is based on hidden Markov chains. The algorithm analyzes the speech signal, divides in into frames. Each frame is processed and then represented by a point in a p-dimension space. A special Markov machine (built and tuned according to the training set of voice samples) then estimates the probability of the current sequence of frames to represent each of the ten digits. The probability acquired for each digit is the used to classify the current voice sample.

The process consists of two parts:
1. Learning part: In this part the voice sample is being processed and its parameters are added to the general statistics of the training set of voice samples. The statistics are then used to represent each digit by a special Markov machine that will be used to classify new voice samples in the recognition part.

2. Recognition part: In this part, each sample is preprocessed in a way similar to the training part. After the preprocessing, the algorithm tries to match the frames sequence to each one of the ten Markov machines. Each machine represents a digit. The probability of the frame sequence to match each of the machines is calculated. The biggest probability tells us which digit should represent the current voice sample.

Tools
The code was written on MATLAB 7. No special hardware was used to run the algorithm.

Results and Conclusions
The recognition algorithm was tested on two groups:

1. Single speaker: The best result for this group is:

2. Multiple speakers: the best result for this group is:

The average percentage of right guesses clearly shows that the result is not a statistical mistake and that the algorithm has a tendency for right classifications. However, some digits are significantly more difficult for the algorithm to recognize than others.

Obviously, recognizing a single speaker voice sample I much easier than doing so for multiple speakers set.

The results achieved here can be significantly improved by altering some of the parameters of the algorithm and running it for longer time (more calculation cycles).

Acknowledgment
We are grateful to our project supervisor Eugeni Gershikov for his guidance and help during the project.
We would also like to thank Johanan Erez and the whole team of the VISL laboratory for their technical support without which the project could not be finished.
We are also grateful to the Ministry of Transport and the Ollendorf Minerva Center for supporting this project.