The aim of this project is to develop algorithm for lip reading. Meaning, given silent video of speaking person, the algorithms output will be the spoken word (label).
Abstract
The aim of this project is to develop algorithm for lip reading. Meaning, given silent video of speaking person, the algorithms output will be the spoken word (label). The approach for such a problem is learning, which requires paired set of video and its label examples. Searching the web for such set, came across as an obstacle. However, one can easily find set of paired examples of audio and its label as well as set of video and its audio. So how can we utilize those sets for our goal? For estimation of label from silent video frames, we will use audio as an auxiliary variable. We will present the solution for this problem and its implementation.
Lip Reading
How can the robot identify what being said with all the noise?
Using Lip Reading!
Flowchart
Results
- A similar experiment described at “Multimodal Deep Learning” article
- Their results for lips reading of digits are: Label estimation by video using audio: 29.4%
Conclusions and improvement suggestions
- Best Video characteristic which gave the best result was matrix histogram, but using it consumes the larges amount of computer effort
- Using more sophisticated algorithm to choose number of frames, or use varies number of frames can improve the results
- Using Non Parametric Regression instead of Linear Parametric Regression can find estimation function which is more accurate




