Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Makino, Takaki; Liao, Hank; Assael, Yannis; Shillingford, Brendan; Garcia, Basilio; Braga, Otavio; Siohan, Olivier

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:1911.04890 (eess)

[Submitted on 8 Nov 2019]

Title:Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Authors:Takaki Makino (1), Hank Liao (1), Yannis Assael (2), Brendan Shillingford (2), Basilio Garcia (1), Otavio Braga (1), Olivier Siohan (1) ((1) Google Inc. (2) DeepMind)

View PDF

Abstract:This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

Comments:	Will be presented in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:1911.04890 [eess.AS]
	(or arXiv:1911.04890v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.1911.04890

Submission history

From: Takaki Makino [view email]
[v1] Fri, 8 Nov 2019 22:01:42 UTC (195 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators