Speech Recognition using SVM

Speech Recognition project was developed as a specified tool for handling computers applications.

Abstract
Speech Recognition project was developed as a specified tool for handling computers applications.
Speech Recognition fulfills various aspects in our lives that we cannot imagine, Such as applications for the impaired hearing, voice dialing and voice to text (V2T) applications.
It seems that not only speech recognition is a luxury, but a need as well. Our goal is developing a fully integrated I/O system from this basic tool
The application should recognize 12 words concerning voice dialing; the digits 0-9 and the words : “send” and “stop” (to begin and end a call).
This systems is based on GUI (graphical user interface), which controls the basic features.

The problem (background)
Speech Recognition through learning machine faces to the common people a problem.
One may think to himself how can we recognize\classify sound signals and moreover, how can we build a learning machine?
Learning machine is a strong method that can be adjusted to any problem that you may think of.
Sound signals recognition\classification is being adjusted to the learning machine , by “cleaning” the signal and pass it through probabilistic model.

The solution (basic approach)
The solution is divided into 3 steps:
First:
a. Getting signals and processing them in the time domain: calculating the signal’s envelope, and truncating the signal, to lose noise samples.

The word “Four” : initial

The word “Four” : envelope

The word “Four” : after truncation

b. Transfer them to the frequency domain and perform LPF on the FFT, in order to lose the frequencies where noise shall appear.

Second:
Modeling the signal into an observation vector.
There are some known methods to create that vector, the methods used in this project are :

The LPC algorithm (Linear Prediction Coding) which tries to estimate the audio signal, using previous samples

The BINS method. windowing the FFT into BINS, where in each, mean and standard deviation are calculated.

We got 20 coefficients from the LPC algorithm and 20 coefficients from the BINS method (10 BINS).
Passing the vector into the learning machine.
in this Project the SVM algorithm (Support Vector Machine) was used to create the learning machine.
SVM tries to divide the space (of order N) into classes, with maximal margins between classes.

2 class example .Linearly divided.

In practice, dividing the plane is much harder and cannot be performed perfectly and we shall allow classification errors

More over, sometimes a non-linear classifier is needed:

A case in which a non-linear classifier is needed.

We tried three classifier kernels : Linear, Polynomial, Radial (RBF).
After these 3 preliminary steps, there is a complete system which is ready for classifying input voice signals.

Tools
Matlab 6.5
SVM Tool – http://www.ece.osu.edu/~maj/osu_svm

Performance
The three classifiers where checked on one speaker and two speakers, on both methods.

Radial	Polynomial	Linear	Reps per Construction\ Num. of Speakers
48.8333 1.6596	25.4667 0.5963	25.9767 0.5445	5 Reps \ 1 Speaker
38.75 0.5501	12.35 0.4894	11.85 0.6708	10 Reps \ 1 Speaker
26.7 3.2904	5.23 0.256	4.750 1.1180	18 Reps \ 1 Speaker
36.8 0.199	27.5 1.356	20.275 4.9961	20 Reps \ 2 Speakers
27.8 0.5606	16.33 1.7029	21.77 3.8346	30 Reps \ 2 Speakers
23 2.5355	11.5 2.4152	9.35 6.6914	38 Reps \ 2 Speakers

Conclusions
Through the work over the project we came to a few conclusions:

a. Despite of its theoretical poorness, the Linear classifier performed better
than the other two classifiers, because he doesn’t over-fit to the training set.

b. The project success of learning system is affected by :

Dictionary size : At first we had a dictionary size of 100 words.
This proved to be very hard to implement, so it was lowered to 12 words.
Samples: as more samples of the words were given, the classification got
better.

Quality of record: the better the environment is, the better the
classification is.

Number of speakers: The system supports one or two classifiers. The more
classifiers, the higher the computational complexity gets.

Coefficients handling: deriving the coefficients form the signal is the most
crucial stage of the system.

c. It is important to compare other methods of creating comparable
vectors, such as HMM (Hidden Markov Models) and hybrid systems, for
better performance.
d. For real time system, the recording environment should be as much as
noise-free as possible.
e. The dictionary should be expanded to support more applications and
needs, while trying not to damage the performance badly.

Acknowledgments
We are grateful to our project instructor Dori Peleg for his help and guidance throughout this work, and Lab Supervisor Johanan Erez.