Comparing the Performance of CNNs and Shallow Models for Language Identification

Andrea Ceolin


Abstract
In this work we compare the performance of convolutional neural networks and shallow models on three out of the four language identification shared tasks proposed in the VarDial Evaluation Campaign 2021. In our experiments, convolutional neural networks and shallow models yielded comparable performance in the Romanian Dialect Identification (RDI) and the Dravidian Language Identification (DLI) shared tasks, after the training data was augmented, while an ensemble of support vector machines and Naïve Bayes models was the best performing model in the Uralic Language Identification (ULI) task. While the deep learning models did not achieve state-of-the-art performance at the tasks and tended to overfit the data, the ensemble method was one of two methods that beat the existing baseline for the first track of the ULI shared task.
Anthology ID:
2021.vardial-1.12
Volume:
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
April
Year:
2021
Address:
Kiyv, Ukraine
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer, Tommi Jauhiainen
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
102–112
Language:
URL:
https://aclanthology.org/2021.vardial-1.12
DOI:
Bibkey:
Cite (ACL):
Andrea Ceolin. 2021. Comparing the Performance of CNNs and Shallow Models for Language Identification. In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 102–112, Kiyv, Ukraine. Association for Computational Linguistics.
Cite (Informal):
Comparing the Performance of CNNs and Shallow Models for Language Identification (Ceolin, VarDial 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.vardial-1.12.pdf
Code
 andreaceolin/vardial2021
Data
MOROCO