Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Hande, Adeep; Puranik, Karthik; Yasaswini, Konthala; Priyadharshini, Ruba; Thavareesan, Sajeetha; Sampath, Anbukkarasi; Shanmugavadivel, Kogilavani; Thenmozhi, Durairaj; Chakravarthi, Bharathi Raja

Computer Science > Computation and Language

arXiv:2108.12177 (cs)

[Submitted on 27 Aug 2021]

Title:Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Authors:Adeep Hande, Karthik Puranik, Konthala Yasaswini, Ruba Priyadharshini, Sajeetha Thavareesan, Anbukkarasi Sampath, Kogilavani Shanmugavadivel, Durairaj Thenmozhi, Bharathi Raja Chakravarthi

View PDF

Abstract:Social media has effectively become the prime hub of communication and digital marketing. As these platforms enable the free manifestation of thoughts and facts in text, images and video, there is an extensive need to screen them to protect individuals and groups from offensive content targeted at them. Our work intends to classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam. We intend to improve offensive language identification by generating pseudo-labels on the dataset. A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language, either Kannada, Malayalam, or Tamil and then generating pseudo-labels for the transliterated dataset. The two datasets are combined using the generated pseudo-labels to create a custom dataset called CMTRA. As Dravidian languages are under-resourced, our approach increases the amount of training data for the language models. We fine-tune several recent pretrained language models on the newly constructed dataset. We extract the pretrained language embeddings and pass them onto recurrent neural networks. We observe that fine-tuning ULMFiT on the custom dataset yields the best results on the code-mixed test sets of all three languages. Our approach yields the best results among the benchmarked models on Tamil-English, achieving a weighted F1-Score of 0.7934 while scoring competitive weighted F1-Scores of 0.9624 and 0.7306 on the code-mixed test sets of Malayalam-English and Kannada-English, respectively.

Comments:	27 pages, 12 figures, 10 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2108.12177 [cs.CL]
	(or arXiv:2108.12177v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2108.12177

Submission history

From: Adeep Hande [view email]
[v1] Fri, 27 Aug 2021 08:43:08 UTC (641 KB)

Computer Science > Computation and Language

Title:Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators