Voice Recognition by a Realistic Model of Biological Neural Networks

Abstract
In this project, a new model for a voice recognition system is suggested. The model is based on a realistic model of neural networks, and it integrates principles from the theories of chaotic systems and Liquid State Machine. The model was implemented in MATLAB, and several tests were performed on it. The task of the system in those tests was to recognize a voice of a specific person (a voice that it was trained on) out of hundreds of other voices.

 

The Problem
The objective of this project was to design a system that can classify voices, i.e., recognize the voice of a specific person. However, the problem of voice classification is too wide to be solved by finite state machines, since it is obviously impossible to create a state for every word that every person in the world might say. Even if it were possible, it is impossible to save recordings of every word in the voice of every person in order to perform bit–by-bit comparison.

 

The Solution
The task of voice recognition is highly suitable for neural networks, since such networks can work as classifiers and distinguish the voice that they learn to identify from other voices. Such networks can learn the characteristics of the voice, and therefore using them does not require endless recordings of words.

A new model for a voice recognition system which is based on neural networks is suggested in this project. Our approach for voice recognition integrates concepts from the theories of chaotic neural networks and Liquid State Machines. The main principal of the proposed model is that the input signal is recognized upon the current state of the model, which is a limit cycle in which the output of the network is periodic and uniquely defines that state. We defined this state as a basin.  The model is presented in figure 1.

1
Figure 1: The model of the proposed system for voice recognition

The input signal u , which represent the auditory stimulus, consists of several parallel spike trains which are transmitted simultaneously to the input neurons of the neural network (see section 4.1 for more information on the creation of the stimulus). The input neurons transmit an internal signal x1 to the neural network l, which consists of a few hundreds of neurons in a three dimensional structure. The network is pushed (by the input signal) to a basin. The readout function f receives the spike trains of all the neurons in l and recognizes the basin that the network has converged to by comparing it with output patterns of basins that appear in the indicators map. It then classifies the input according to the indicator that belongs to that basin. The output signal y determines the class of voices to which the input signal belongs.

The neural network l consists of 135 spiking neurons in a 3x3x15 formation. The behavior of the neurons is simulated by the Leaky Integrate and Fire (LIF) model, and the neurons are connected by dynamic spiking synapses. Twenty percent of the neurons in the network are randomly chosen to be inhibitory, and the rest of the neurons are excitatory, in correspondence with the biological values. The connectivity of the network is moderate (lambda=2).

An important advantage of the proposed model is that several tasks can be performed with only one network at the same time: The readout function can be trained to recognize several people by finding the current basin of the network and comparing it to several indicators maps (one for each person).

 

Tools
MATLAB 6.5 was used for the developments of the voice recognition system and a GUI that enables a full control of it. The tests were performed with two databases of recorded speech: the first one was recorded in the SIPL laboratory of the Electrical Engineering faculty of the Technion (IIT), and second one was taken from the NIST database that is offered at  http://www.nist.gov/speech/tests/lang/2003/.

The neural network that was studied in this project has been created in a new simulator for neural microcircuits, CSIM, in MATLAB environment. Full details of the CSIM simulator can be found athttp://www.lsm.tugraz.at/csim/.

Two methods were used for encoding the recorded speech signal into spike trains:

  • Amplitude Encoding: In this method, a straight forward conversion is performed between the amplitude in time t and the number of neurons that would fire at that time.
  • MFCC Encoding: In this method the auditory signal is represented by Mel Frequency Cepstral Coefficients (MFCCs), which are coefficients that are based on human perception. The auditory signal is divided to small segments, and each of them is transformed (by FFT) to the frequency space. The frequency bands are positioned logarithmically on the mel scale, which is a scale of pitches that were determined by listeners to be equally distanced from each other.

 

The Classification Process

The classification process consists of three main parts:

  1. Training: In this part the system is trained on different auditory stimuli: some of them contain the voice that the system should learn to identify, and others contain voices of other people. Simulations of the neural network are performed for very voice segment.  The system learns the basins that the network converges to in every simulation and creates an indicators map: The indicators are numbers that are related to each basin, and they indicate how well this basin represents the wanted voice.
  2. Tuning: in this part the simulations are performed on another database (which include voice segments of the wanted person and of other persons), and the user tunes the classification parameters so that the indicators map would best suit the person that the system should identify.
  3. Testing: In this part a new stimulus is presented to the neural network.  The system finds the basin that the network converged to and makes a classification decision upon the indicator of that basin. The output is an answer whether the stimulus is the voice of the wanted person or of another person.

The different stages of the voice recognition process are depicted in figure 2.

2
Figure 2: A flow chart of the classification process

Results
Several terms were defined for valuation of the classification results:

  • Hit Segments – Voice segments that were classified correctly.
  • Miss-Hit Segments – Voice segments of the person that the system was trained to identify, which were classified as voices of other people.
  • False Alarm Segments – Voice segments of different people (not the one that the system was trained to identify), which were classified as the voice of the wanted person.

Table 3 presents the database that was used for the tests.

 

Num. of Voice Segments

 

 

Wanted Voice

Other voices

Data Set 1 (Training)

30

300

Data Set 2 (Tuning)

30

30

Data Set 3 (Test I)

100

400

Data Set 4 (Test II)

38

40

Table 3: The database that was used for the tests

 

Results for Amplitude Encoded Input
The stimuli that were created by amplitude encoding were examined, and no significant difference was found between the stimuli of the wanted person to those of other persons. The task of the system was therefore to identify the voice of the wanted person, not to identify a certain type of stimulus.

Table 4 presents the results of two classification tests that were performed on two different (parallel) systems. In these tests, database 1 was used for training the system, database 2 was used for tuning it and database 3 was used for testing it.

Initials Num.

Classified as

Test Num. 1

Test Num. 2

True Classification

1 – 100

Wanted (Hit)

71%

94%

100%

 

Unwanted (Miss-Hit)

29%

6%

0%

101-492

Wanted

(False-Alarm)

55.9%

61.23%

0%

 

Unwanted

41.1%

38.77%

100%

Table 4: The results of classification tests number 1 and 2

 

The results that are presented in table 4 show that both of the systems identified most of the segments of the wanted voice. Obviously, the indicators map of the network that was used in test 2 was much better than the one of the first network: 94% of the wanted voice segments were identified in the second test, while only 71% of them were identified in the first test. This shows that the internal structure of the neural network (which is raffled) significantly influences it’s classification ability.
The difference (in percents) between the segments that were classified correctly to the false alarm segments was 15% in the first test and 34% in the second test. These large differences show that the classification of any voice segment as a wanted segment is not a random process. The reason for the high false alarm rate is that the system was designed to find most of the voice segments of the wanted voice, even with a cost of many other segments that would be classified as wanted.

 

Results for MFCC Encoded Input
An examination of the stimuli that were generated in the MFCC method revealed that the stimuli of the wanted voice were quite similar to each other but were very different from the stimuli of the other voices. In this manner, the classification task became a task of distinguishing between two types of stimuli.
Table 5 presents the results of two classification tests that were performed on the same system. Data set 1 was used for training the system, data set 2 was used for tuning it, and data sets 3 and 4 were used for testing it.

 

True Classification

 

Classified

as

Test I

Segments:

100 wanted,

400 unwanted

Test II

Segments:

30 wanted,

30 unwanted

Wanted

Wanted (Hit)

87%

86.8%

Wanted

Unwanted (Miss-Hit)

13%

13.2%

Unwanted

Wanted (False Alarm)

55.3%

45%

Unwanted

Unwanted

44.7%

55%

Table 5 The results of the classification tests

The results detailed in table 6.13 indicate that the system is quite reliable in classifying new data: most of the segments of the wanted voice and about half of the segments of the unwanted voice were classified correctly. A proof for the consistence of the system rises from the fact that the hit rate was almost similar in the two classification tests even though they consisted of significantly different number of voice segments and there was no overlapping between them.

 

Conclusion
A new method for performing voice recognition by a realistic model of biological neural networks was presented and implemented in this project. Several systems were configured and trained by the presented method. They were tested on two types of stimuli: one that was created by amplitude encoding of recorded speech, and another that was created in the MFCC method. The amplitude encoding method was found to be efficient, while the MFCC method yielded stimuli which were very typical for each person.

Tests that were performed on stimuli of both kinds showed that the systems were efficient in identifying the voice that they were trained to find: The systems found and classified correctly most of the voice segments of the ‘wanted person’, even when they were given a stimulus that contained much more voice segments of other people. The conclusion is that such systems can handle a very high level of noise. Other tests have shown that the systems were consisted and stable in their classification performances.

Altogether, the tests that were carried out in this project proved that our model for a neural network based voice recognition system is highly suited for performing voice classification tasks.

 

Acknowledgment
I am grateful to my project supervisors Karina Odinaev and Igal Raichelgauz for their help and guidance throughout this work. I would also like to thank the Ollendorff Minerva Center for supporting this project.