Multimodal Speech Recognition with Unstructured Audio Masking

Srinivasan, Tejas; Sanabria, Ramon; Metze, Florian; Elliott, Desmond

Computer Science > Computation and Language

arXiv:2010.08642 (cs)

[Submitted on 16 Oct 2020]

Title:Multimodal Speech Recognition with Unstructured Audio Masking

Authors:Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

View PDF

Abstract:Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.

Comments:	Accepted to NLP Beyond Text workshop, EMNLP 2020
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2010.08642 [cs.CL]
	(or arXiv:2010.08642v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2010.08642

Submission history

From: Tejas Srinivasan [view email]
[v1] Fri, 16 Oct 2020 21:49:20 UTC (9,027 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2020-10

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Tejas Srinivasan
Ramon Sanabria
Florian Metze
Desmond Elliott

export BibTeX citation

Computer Science > Computation and Language

Title:Multimodal Speech Recognition with Unstructured Audio Masking

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multimodal Speech Recognition with Unstructured Audio Masking

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators