Google Scholar

Unsupervised pre-training for sequence to sequence speech recognition

Z Fan, S Zhou, B Xu - arXiv preprint arXiv:1910.12418, 2019 - arxiv.org

Z Fan, S Zhou, B Xu

arXiv preprint arXiv:1910.12418, 2019•arxiv.org

This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing and linguistic pre-training. In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with its context. In the linguistic pre-training stage, we generate synthesized speech from a large number of transcripts using a single-speaker text to speech (TTS) system, and use the synthesized paired data to pre-train decoder. This two-stage pre-training method integrates rich acoustic and linguistic knowledge into seq2seq model, which will benefit downstream automatic speech recognition (ASR) tasks. The unsupervised pre-training is finished on AISHELL-2 dataset and we apply the pre-trained model to multiple paired data ratios of AISHELL-1 and HKUST. We obtain relative character error rate reduction (CERR) from 38.24% to 7.88% on AISHELL-1 and from 12.00% to 1.20% on HKUST. Besides, we apply our pretrained model to a cross-lingual case with CALLHOME dataset. For all six languages in CALLHOME dataset, our pre-training method makes model outperform baseline consistently.

arxiv.org

Show moreShow less

Save Cite Cited by 20 Related articles All 2 versions View as HTML

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

Unsupervised pre-training for sequence to sequence speech recognition