Multi-view automatic lip-reading using neural network
Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei …, 2017•Springer
It is well known that automatic lip-reading (ALR), also known as visual speech recognition
(VSR), enhances the performance of speech recognition in a noisy environment and also
has applications itself. However, ALR is a challenging task due to various lip shapes and
ambiguity of visemes (the basic unit of visual speech information). In this paper, we tackle
ALR as a classification task using end-to-end neural network based on convolutional neural
network and long short-term memory architecture. We conduct single, cross, and multi-view …
(VSR), enhances the performance of speech recognition in a noisy environment and also
has applications itself. However, ALR is a challenging task due to various lip shapes and
ambiguity of visemes (the basic unit of visual speech information). In this paper, we tackle
ALR as a classification task using end-to-end neural network based on convolutional neural
network and long short-term memory architecture. We conduct single, cross, and multi-view …
Abstract
It is well known that automatic lip-reading (ALR), also known as visual speech recognition (VSR), enhances the performance of speech recognition in a noisy environment and also has applications itself. However, ALR is a challenging task due to various lip shapes and ambiguity of visemes (the basic unit of visual speech information). In this paper, we tackle ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture. We conduct single, cross, and multi-view experiments in speaker independent setting with various network configuration to integrate the multi-view data. We achieve 77.9%, 83.8%, and 78.6% classification accuracies in average on single, cross, and multi-view respectively. This result is better than the best score (76%) of preliminary single-view results given by ACCV 2016 workshop on multi-view lip-reading/audio-visual challenges. It also shows that additional view information helps to improve the performance of ALR with neural network architecture.
Springer
Showing the best result for this search. See all results