Training Dynamic based data filtering may not work for NLP datasets

Talukdar, Arka; Dagar, Monika; Gupta, Prachi; Menon, Varun

Computer Science > Computation and Language

arXiv:2109.09191 (cs)

[Submitted on 19 Sep 2021]

Title:Training Dynamic based data filtering may not work for NLP datasets

Authors:Arka Talukdar, Monika Dagar, Prachi Gupta, Varun Menon

View PDF

Abstract:The recent increase in dataset size has brought about significant advances in natural language understanding. These large datasets are usually collected through automation (search engines or web crawlers) or crowdsourcing which inherently introduces incorrectly labeled data. Training on these datasets leads to memorization and poor generalization. Thus, it is pertinent to develop techniques that help in the identification and isolation of mislabelled data. In this paper, we study the applicability of the Area Under the Margin (AUM) metric to identify and remove/rectify mislabelled examples in NLP datasets. We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points and leads to the loss of a large amount of relevant language information. We show that models rely on the distributional information instead of relying on syntactic and semantic representations.

Comments:	EMNLP BlackBoxNLP 2021
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2109.09191 [cs.CL]
	(or arXiv:2109.09191v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2109.09191

Submission history

From: Varun Menon [view email]
[v1] Sun, 19 Sep 2021 18:50:45 UTC (2,056 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-09

Change to browse by:

cs
cs.AI

References & Citations

DBLP - CS Bibliography

listing | bibtex

export BibTeX citation

Computer Science > Computation and Language

Title:Training Dynamic based data filtering may not work for NLP datasets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Training Dynamic based data filtering may not work for NLP datasets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators