A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Godard, P.; Adda, G.; Adda-Decker, M.; Benjumea, J.; Besacier, L.; Cooper-Leavitt, J.; Kouarata, G-N.; Lamel, L.; Maynard, H.; Mueller, M.; Rialland, A.; Stueker, S.; Yvon, F.; Zanon-Boito, M.

Computer Science > Computation and Language

arXiv:1710.03501 (cs)

[Submitted on 10 Oct 2017 (v1), last revised 15 Feb 2018 (this version, v3)]

Title:A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Authors:P. Godard, G. Adda, M. Adda-Decker, J. Benjumea, L. Besacier, J. Cooper-Leavitt, G-N. Kouarata, L. Lamel, H. Maynard, M. Mueller, A. Rialland, S. Stueker, F. Yvon, M. Zanon-Boito

View PDF

Abstract:Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.

Comments:	accepted to LREC 2018
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1710.03501 [cs.CL]
	(or arXiv:1710.03501v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1710.03501

Submission history

From: Laurent Besacier [view email]
[v1] Tue, 10 Oct 2017 10:39:20 UTC (215 KB)
[v2] Wed, 11 Oct 2017 08:27:29 UTC (106 KB)
[v3] Thu, 15 Feb 2018 06:32:01 UTC (178 KB)

Computer Science > Computation and Language

Title:A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators