Voice Recognition using Neural Networks

Voice recognition is a major problem nowadays. Until now this problem wasn't solved completely. We will try to approach this problem in a new way using neural networks.

Introduction
Voice recognition is a major problem nowadays. Until now this problem wasn’t solved completely. We will try to approach this problem in a new way using neural networks. The reason stems from the fact that voice recognition is done perfectly by human brain and we want to try and get closer to it. We tried two different techniques to use neural network in order to separate speakers. One technique is based on a previous project done by Yaron Eshet. It is based on the chaos property of a neural network: different initial conditions (speech segments) will cause different network responses. The second technique is based on the Liquid State Machine model, described in the article: “Isolated word recognition using a Liquid State Machine” by David Verstraeten and Benjamin Schrauwen and Dirk Stroobandt .


Part A:

Basic approach
In part A we used the properties of a chaotic system. Chaotic system is defined as a system in which two close initial conditions lead to two totally different outputs. This is a contrary to an ordered system, in which all the initial condition will lead to the same output. We were interested in finding the edge between chaos and order, so that our system will divide all the initial conditions to a few groups (basins). Because the neural network is a non conservative system, we need to add a periodic force (see: Chaos & Separation in Biological Neural Networks). The number of basins that our system will divide all the initial conditions depends on system parameters, such as the connectivity, and the period of the force. We investigated this dependence and here are the results:


Conclusions

We can see the there is a very thin edge between chaos and order, what makes the problem of finding it difficult. Besides, as we checked later, the system is very sensitive to noise and non robastic. That is the reason we’ve decided to approach in a different way.

Part B:


Basic approach

In part B we used the Liquid State Machine model in order to recognize speech. The main goal was to reconstruct the results of the article: “Isolated word recognition using a Liquid State Machine” (by David Verstraeten and Benjamin Schrauwen and Dirk Stroobandt ) but with a slight change- recognizing speech instead of a word out of a specific vocabulary. The LSM model (see: “biological neural networks” by Yigal Reichelhaus and Karina Odanyev) : The memory is constructed of a “liquid” . The input makes perturbations on the liquid. The “liquid state” is the current picture of the liquid, that contains information about the input. This information is extracted by the readout function, which should be trained.

2

In our case, the “liquid” is a neural network with randomly distributed connections (containing loops), the readout function is a feed forward network. Preprocessing for the speech is done using two techniques: MFCC-Mel Frequency Cepstral Coefficients and LPE- Lyon Passive Ear . By using preprocessing we reduce the amount of information for the system and leaving only the most important data for recognition task.
MFCC: This method uses the logarithmic sensitivity of the human auditory system for the intensity and the frequency.

3

LPE: This method is modeling the whole path of the auditory signal through our ear, using band pass filters, half-wave rectifiers and adaptive gain controllers.

4

Encoding techniques are used in order to translate the output of the preprocessing step to spikes which will be entered to the neural system. We used three different ways: Poisson- Fire rate is proportional to the amplitude of the signal.

5

LIF- The signal is entered to neuron model. The output of the neuron is the encoding.

6

BSA- Fire only when there is a good fit between a signal and a known filter.

7

8


results

Trying the all combinations of preprocessing and encoding, in order to separate two speakers we received the best combination: MFCC+LIF, which led to an approximately 3% mistake. By increasing the amount of the other speakers the mistake grew and saturated at about 27%:
9

 

Conclusions
As we can see, there is a process of learning but the results aren’t good enough and there is a place for optimization. The error for two speakers is small but as we increase the amount of speakers, the error becomes unacceptable.

 

Acknowledgment
We would like to thank our supervisors: Karina Odinaev and Igal Raichelgauz for a well guiding through our project. We would like to thank Johanan Erez and Eli Appelboim for helping to solve our problems. We would also like to thank David Verstraeten and Benjamin Schrauwen for their kind help and tips.
I would also like to thank the Ollendorff Minerva Center which supported this project.