Cross database training of audio-visual hidden Markov models for phone recognition

S Kalantari, D Dean, H Ghaemmaghami… - Proceedings of the …, 2015 - eprints.qut.edu.au
Proceedings of the 16th Annual Conference of the International …, 2015eprints.qut.edu.au
Speech recognition can be improved by using visual information in the form of lip
movements of the speaker in addition to audio information. To date, state-of-the-art
techniques for audio-visual speech recognition continue to use audio and visual data of the
same database for training their models. In this paper, we present a new approach to make
use of one modality of an external dataset in addition to a given audio-visual dataset. By so
doing, it is possible to create more powerful models from other extensive audio-only …
Speech recognition can be improved by using visual information in the form of lip movements of the speaker in addition to audio information. To date, state-of-the-art techniques for audio-visual speech recognition continue to use audio and visual data of the same database for training their models. In this paper, we present a new approach to make use of one modality of an external dataset in addition to a given audio-visual dataset. By so doing, it is possible to create more powerful models from other extensive audio-only databases and adapt them on our comparatively smaller multi-stream databases. Results show that the presented approach outperforms the widely adopted synchronous hidden Markov models (HMM) trained jointly on audio and visual data of a given audio-visual database for phone recognition by 29% relative. It also outperforms the external audio models trained on extensive external audio datasets and also internal audio models by 5.5% and 46% relative respectively. We also show that the proposed approach is beneficial in noisy environments where the audio source is affected by the environmental noise.
eprints.qut.edu.au
Showing the best result for this search. See all results