End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

Wang, Jixuan; Radfar, Martin; Wei, Kai; Chung, Clement

Computer Science > Computation and Language

arXiv:2305.02937 (cs)

[Submitted on 4 May 2023 (v1), last revised 2 Jun 2023 (this version, v2)]

Title:End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

Authors:Jixuan Wang, Martin Radfar, Kai Wei, Clement Chung

View PDF

Abstract:It is challenging to extract semantic meanings directly from audio signals in spoken language understanding (SLU), due to the lack of textual information. Popular end-to-end (E2E) SLU models utilize sequence-to-sequence automatic speech recognition (ASR) models to extract textual embeddings as input to infer semantics, which, however, require computationally expensive auto-regressive decoding. In this work, we leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification (CTC) to extract textual embeddings and use joint CTC and SLU losses for utterance-level SLU tasks. Experiments show that our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset and 1.3% absolute improvement over the SOTA SLU model on the SLURP dataset.

Comments:	ICASSP 2023
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2305.02937 [cs.CL]
	(or arXiv:2305.02937v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.02937

Submission history

From: Jixuan Wang [view email]
[v1] Thu, 4 May 2023 15:36:37 UTC (2,400 KB)
[v2] Fri, 2 Jun 2023 13:25:06 UTC (1,201 KB)

Computer Science > Computation and Language

Title:End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators