Speech Recognition With the Use of Hidden Markov Models

Speech Recognition has been around for decades and, slowly but surely, has integrated itself into a number of applications including: cellular phones, computer interfaces, military intelligence, etc.

Abstract

Speech Recognition has been around for decades and, slowly but surely, has integrated itself into a number of applications including: cellular phones, computer interfaces, military intelligence and much, much more. Many tools have been developed with which to analyze speech and many algorithms have been tested and improved. Specifically, statistical modeling techniques such as the use of Hidden Markov Models have proved their worth as an integral part of any modern speech recognition algorithm.
The goal of this project was to create the basis for a speech driven Hebrew keyboard. We constructed an apparatus that can discern between words from two different male speakers in a vocabulary that consists of the twenty two Hebrew letters and the digits zero through nine, a total of 32 words. Such a system may be expanded to include a larger variety of speakers and a larger vocabulary.

 

Background

Speech recognition includes a vast array of techniques. Even when focusing on the subgroup of speech recognition techniques with the use of Hidden Markov Models (HMM), the options are endless.
The greatest challenge posed before any speech recognition algorithm is heavily downscaling the tremendous amounts of data present in speech while preserving the relevant information. This stage is necessary in order to reduce the demands of the processing unit to a level that is feasible with current technology. In addition, processing algorithms must be chosen wisely.
Our project includes quick and simple techniques to allow such processing.

 

Basic Approach

The system consists of two main parts: Training and Recognition. Both parts are processed in a similar manner; each word is temporally divided into a sequence of frames, each frame of approximately 20ms includes one “phoneme”, the shortest and most basic unit of speech. Each frame is then processed in two main stages: parameterization and vector quantization. The Sequence of quantized parameters from the entire word is processed with the use of HMM (Hidden Markov Models).

The parameterization is done with the use of a “bank of filters” to measure the energy of different frequencies within the frame.

The Training algorithm uses these parameters to create a “phoneme dictionary” of 1024 phonemes using a simple clustering method. The recognition algorithm quantizes each vector of parameters by designating the closest entry in the phoneme dictionary. The HMM processing in the training algorithm is done separately for each word in the vocabulary. The sequences of quantized parameters of each word are processed with the use of the Baum-Welch algorithm. A statistical model — a Markov chain — is created to represent each word based on the given sequences. This model will give a high score to any new sequence representing the word it was trained to recognize, and a low score to any other. The recognition algorithm extracts a sequence of quantized parameters from a recording and then measures the statistical probability of that sequence being created by the model of each word in the vocabulary. The word is recognized as the vocabulary entry corresponding to the model with the highest probability.

 

Results

The amount of time required for the recognition process is too long to be considered “real time” because the word must be fully recorded before it can be processed, yet the processing time is about 15msec per word. The training Algorithm, on the other hand, is much more complex and requires much more time. We succeeded in reducing this time to approximately 25 minutes for a vocabulary of 32 words. The time needed increases in a manner approximately linear to the number of words in the dictionary.
The success rate for a single speaker and a dictionary consisting of ten Hebrew digits was 94%.
The success rate for two male speakers and a dictionary consisting of ten Hebrew digits and 22 Hebrew letters was 85%.
Tools

This project was programmed with the use of Matlab and run on a personal computer with a Pentium 4 — 2.4GHz processor. The speech was recorded using home equipment.

 

Conclusions

We found that it is not only possible to create highly accurate algorithms for speech recognition but it is also possible to do so within a reasonable processing time.

 

Acknowledgment

We are grateful to our project supervisor Dori Peleg for his help and guidance throughout this work.
We are also grateful to the Ollendorff Minerva Center Fund for supporting this project.