SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

Rosenthal, Sara; Atanasova, Pepa; Karadzhov, Georgi; Zampieri, Marcos; Nakov, Preslav

Computer Science > Computation and Language

arXiv:2004.14454 (cs)

[Submitted on 29 Apr 2020 (v1), last revised 24 Sep 2021 (this version, v2)]

Title:SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

Authors:Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri, Preslav Nakov

View PDF

Abstract:The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited in size and it might be biased towards offensive language as it was collected using keywords. In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner. SOLID contains over nine million English tweets labeled in a semi-supervised fashion. We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models, especially for the lower levels of the taxonomy.

Comments:	offensive language, hate speech, cyberbullying, cyber-aggression, taxonomy for offensive language identification
Subjects:	Computation and Language (cs.CL)
MSC classes:	68T50, 68T07
ACM classes:	F.2.2; I.2.7
Cite as:	arXiv:2004.14454 [cs.CL]
	(or arXiv:2004.14454v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2004.14454
Journal reference:	ACL-2021 (Findings)

Submission history

From: Preslav Nakov [view email]
[v1] Wed, 29 Apr 2020 20:02:58 UTC (185 KB)
[v2] Fri, 24 Sep 2021 16:36:35 UTC (178 KB)

Computer Science > Computation and Language

Title:SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators