Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Elazar, Yanai; Kassner, Nora; Ravfogel, Shauli; Feder, Amir; Ravichander, Abhilasha; Mosbach, Marius; Belinkov, Yonatan; Schütze, Hinrich; Goldberg, Yoav

Computer Science > Computation and Language

arXiv:2207.14251 (cs)

[Submitted on 28 Jul 2022 (v1), last revised 24 Mar 2023 (this version, v2)]

Title:Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Authors:Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, Yoav Goldberg

View PDF

Abstract:Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone. Addressing the problem of extracting factual knowledge from pretrained language models (PLMs), we focus on simple data statistics such as co-occurrence counts and show that these statistics do influence the predictions of PLMs, suggesting that such models rely on shallow heuristics. Our causal framework and our results demonstrate the importance of studying datasets and the benefits of causality for understanding NLP models.

Comments:	We received a criticism regarding the validity of the causal formulation in this paper. We will address them in an upcoming version
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2207.14251 [cs.CL]
	(or arXiv:2207.14251v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2207.14251

Submission history

From: Yanai Elazar [view email]
[v1] Thu, 28 Jul 2022 17:36:24 UTC (327 KB)
[v2] Fri, 24 Mar 2023 07:18:59 UTC (327 KB)

Computer Science > Computation and Language

Title:Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators