Competency Problems: On Finding and Removing Artifacts in Language Data

Gardner, Matt; Merrill, William; Dodge, Jesse; Peters, Matthew E.; Ross, Alexis; Singh, Sameer; Smith, Noah A.

Computer Science > Computation and Language

arXiv:2104.08646 (cs)

[Submitted on 17 Apr 2021 (v1), last revised 28 Dec 2021 (this version, v3)]

Title:Competency Problems: On Finding and Removing Artifacts in Language Data

Authors:Matt Gardner, William Merrill, Jesse Dodge, Matthew E. Peters, Alexis Ross, Sameer Singh, Noah A. Smith

View PDF

Abstract:Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have "spurious" instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word "amazing" on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.

Comments:	EMNLP 2021. This version fixes an error in Proposition 1 and adds discussion (the EMNLP camera ready version is unfixed) (and v3 adds the acknowledgements that we forgot to put into v2)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2104.08646 [cs.CL]
	(or arXiv:2104.08646v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2104.08646

Submission history

From: Matt Gardner [view email]
[v1] Sat, 17 Apr 2021 21:34:10 UTC (970 KB)
[v2] Tue, 30 Nov 2021 21:00:33 UTC (1,007 KB)
[v3] Tue, 28 Dec 2021 20:03:28 UTC (1,007 KB)

Computer Science > Computation and Language

Title:Competency Problems: On Finding and Removing Artifacts in Language Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Competency Problems: On Finding and Removing Artifacts in Language Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators