MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Chen, Junkun; Ma, Mingbo; Zheng, Renjie; Huang, Liang

Computer Science > Computation and Language

arXiv:2010.11445 (cs)

[Submitted on 22 Oct 2020 (v1), last revised 8 Feb 2021 (this version, v2)]

Title:MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Authors:Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

View PDF

Abstract:End-to-end Speech-to-text Translation (E2E-ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique termed Masked Acoustic Modeling (MAM), not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals (including non-speech ones) without annotation. We conduct our experiments over 8 different translation directions. In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training. Pre-training of MAM with arbitrary acoustic signals also has an average improvement with +1.6 BLEU for those languages. Compared with ASR multi-task learning solution, which replies on transcription during training, our pre-trained MAM model, which does not use transcription, achieves similar accuracy.

Comments:	12 pages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2010.11445 [cs.CL]
	(or arXiv:2010.11445v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2010.11445

Submission history

From: Mingbo Ma [view email]
[v1] Thu, 22 Oct 2020 05:02:06 UTC (43,580 KB)
[v2] Mon, 8 Feb 2021 20:36:39 UTC (27,233 KB)

Computer Science > Computation and Language

Title:MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators