Multi-dialect arabic bert for country-level dialect identification

B Talafha, M Ali, ME Za'ter, H Seelawi, I Tuffaha… - arXiv preprint arXiv …, 2020 - arxiv.org
arXiv preprint arXiv:2007.05612, 2020arxiv.org
Arabic dialect identification is a complex problem for a number of inherent properties of the
language itself. In this paper, we present the experiments conducted, and the models
developed by our competing team, Mawdoo3 AI, along the way to achieving our winning
solution to subtask 1 of the Nuanced Arabic Dialect Identification (NADI) shared task. The
dialect identification subtask provides 21,000 country-level labeled tweets covering all 21
Arab countries. An unlabeled corpus of 10M tweets from the same domain is also presented …
Arabic dialect identification is a complex problem for a number of inherent properties of the language itself. In this paper, we present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI, along the way to achieving our winning solution to subtask 1 of the Nuanced Arabic Dialect Identification (NADI) shared task. The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries. An unlabeled corpus of 10M tweets from the same domain is also presented by the competition organizers for optional use. Our winning solution itself came in the form of an ensemble of different training iterations of our pre-trained BERT model, which achieved a micro-averaged F1-score of 26.78% on the subtask at hand. We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model, for any interested researcher out there.
arxiv.org
Showing the best result for this search. See all results