A four-dialect treebank for Occitan: Building process and parsing experiments

A Miletić, M Bras, M Vergez-Couret… - Proceedings of the …, 2020 - aclanthology.org
A Miletić, M Bras, M Vergez-Couret, L Esher, C Poujade, J Sibille
Proceedings of the 7th Workshop on NLP for Similar Languages …, 2020aclanthology.org
Occitan is a Romance language spoken mainly in the south of France. It has no official
status in the country, it is not standardized and displays important diatopic variation resulting
in a rich system of dialects. Recently, a first treebank for this language was created.
However, this corpus is based exclusively on texts in the Lengadocian dialect. Our paper
describes the work aimed at extending the existing corpus with content in three new dialects,
namely Gascon, Provençau and Lemosin. We describe both the annotation of initial content …
Abstract
Occitan is a Romance language spoken mainly in the south of France. It has no official status in the country, it is not standardized and displays important diatopic variation resulting in a rich system of dialects. Recently, a first treebank for this language was created. However, this corpus is based exclusively on texts in the Lengadocian dialect. Our paper describes the work aimed at extending the existing corpus with content in three new dialects, namely Gascon, Provençau and Lemosin. We describe both the annotation of initial content in these new varieties of Occitan and experiments allowing us to identify the most efficient method for further enrichment of the corpus. We observe that parsing models trained on Occitan dialects achieve better results than a delexicalized model trained on other Romance languages despite the latter training corpus being much larger (20K vs 900K tokens). The results of the native Occitan models show an important impact of cross-dialectal lexical variation, whereas syntactic variation seems to affect the systems less. We hope that the resulting corpus, incorporating several Occitan varieties, will facilitate the training of robust NLP tools, capable of processing all kinds of Occitan texts.
aclanthology.org
Showing the best result for this search. See all results