An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Fan, Ruchao; Chu, Wei; Chang, Peng; Xiao, Jing; Alwan, Abeer

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2106.09885 (eess)

[Submitted on 18 Jun 2021 (v1), last revised 22 Jul 2021 (this version, v2)]

Title:An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Authors:Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan

View PDF

Abstract:Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT). In this work, we propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses. First, convolution augmented self-attention blocks are applied to both the encoder and decoder modules. Second, we propose to expand the trigger mask (acoustic boundary) for each token to increase the robustness of CTC alignments. In addition, iterated loss functions are used to enhance the gradient update of low-layer parameters. Without using an external language model, the WERs of the improved CASS-NAT, when using the three methods, are 3.1%/7.2% on Librispeech test clean/other sets and the CER is 5.4% on the Aishell1 test set, achieving a 7%~21% relative WER/CER improvement. For the analyses, we plot attention weight distributions in the decoders to visualize the relationships between token-level acoustic embeddings. When the acoustic embeddings are visualized, we find that they have a similar behavior to word embeddings, which explains why the improved CASS-NAT performs similarly to AT.

Comments:	Accepted to Interspeech2021
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2106.09885 [eess.AS]
	(or arXiv:2106.09885v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2106.09885

Submission history

From: Ruchao Fan [view email]
[v1] Fri, 18 Jun 2021 02:58:30 UTC (538 KB)
[v2] Thu, 22 Jul 2021 00:56:50 UTC (538 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators