Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

Gonen, Hila; Goldberg, Yoav

Computer Science > Computation and Language

arXiv:1810.11895 (cs)

[Submitted on 28 Oct 2018 (v1), last revised 10 Nov 2019 (this version, v3)]

Title:Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

Authors:Hila Gonen, Yoav Goldberg

View PDF

Abstract:We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we explore a variety of training protocols and verify the effectiveness of training with large amounts of monolingual data followed by fine-tuning with small amounts of code-switched data, for both the generative and discriminative cases.

Comments:	EMNLP 2019
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1810.11895 [cs.CL]
	(or arXiv:1810.11895v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1810.11895

Submission history

From: Hila Gonen [view email]
[v1] Sun, 28 Oct 2018 22:15:32 UTC (38 KB)
[v2] Tue, 24 Sep 2019 11:38:52 UTC (40 KB)
[v3] Sun, 10 Nov 2019 07:11:04 UTC (39 KB)

Computer Science > Computation and Language

Title:Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators