Task Contamination: Language Models May Not Be Few-Shot Anymore
Abstract
Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs’ training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.
1 Introduction
Recently there has been much interest in few-shot methods, in particular in-context learning (ICL, Brown et al. 2020) with large language models. In-context learning has the benefit of yielding excellent performance while requiring very little data, sometimes relying on only a few examples for the task. These promising results have led to an explosion of work on in-context learning methods across a wide variety of tasks (Schick and Schütze 2021a, b; Poesia et al. 2022; Hu et al. 2022b), including prompt tuning methods (Qin and Eisner 2021; Lester, Al-Rfou, and Constant 2021), chain-of-thought methods (Wei et al. 2022; Wang, Deng, and Sun 2022; Wang et al. 2023; Aiyappa et al. 2023), tool-based methods (Schick et al. 2023; Yang et al. 2023).
However, along with this explosion of work in ICL, many have raised concerns about data contamination (Brown et al. 2020; Jacovi et al. 2023), that is, prior knowledge of data or a task which is thought to be unseen by the model. Data contamination can happen in multiple ways. One common contaminant is test data contamination, the inclusion of test data examples and labels in the pre-training data. Another contaminant for zero or few-shot methods, which we call task contamination, is the inclusion of task training examples in the pre-training data, effectively making the evaluation no longer zero or few-shot.111Zero-shot evaluation is evaluation where a model has seen zero examples for the task. Few-shot, or -shot, where is a small number, is where the model has seen examples for the task. Prior work has sometimes defined zero-shot for multi-class classification as predicting classes that have never been seen during training, but most recent work does not use this definition.
Simply evaluating the scope of this contamination is difficult to do (Magar and Schwartz 2022; Jacovi et al. 2023). Closed models do not release their pre-training data. While open models give the sources, crawling the sites to obtain that data is non-trivial, especially if the data has changed from when it was crawled. For models that are pre-trained on freely available pre-training corpora, simply grepping for examples in the pre-training corpora may not be reliable due to differences in data formatting (such as XML vs CVS, etc) or differences in text normalization and tokenization.
In this paper we empirically measure the scope of task contamination for few-shot methods across various models and tasks. To the best of our knowledge, we are the first to systematically analyze this problem. We evaluate 12 different models, ranging from closed GPT-3 series models (OpenAI 2023b) to open models including Fairseq MoE (Artetxe et al. 2022), GPT-J (Wang and Komatsuzaki 2021), Bloom (Scao et al. 2022), OPT (Zhang et al. 2022) , LLaMA (Touvron et al. 2023), Alpaca (Taori et al. 2023), and Vicuna (Chiang et al. 2023) on 16 classification tasks and 1 semantic parsing task.
We analyze each model on datasets created before its training data was crawled on the internet versus datasets created afterward. We find that datasets created before the LLM training data was collected have a significantly higher chance of having performance higher than the majority baseline (Fig. 1).
We perform training data inspection and task example extraction to look for possible task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, models rarely demonstrate statistically significant improvements over simple majority baselines across a range of tasks, in both zero and few-shot settings (Fig. 2).
As a case study, we also attempt to conduct a membership inference attack for a semantic parsing task (Spider, Yu et al. 2019) for all models in our analysis. We find a strong correlation (R=.88) between the number of extracted examples and the accuracy of the model on the final task (Fig. 6). This is strong evidence that the performance increase in zero-shot performance on this task is due to task contamination.
Additionally, we look closely at the GPT-3 series models. We find that training examples can be extracted from the GPT-3 models, and that the number of extractable training examples increased from each version from davinci to GPT-3.5-turbo, and closely tracks the increase in zero-shot performance of the GPT-3 models on that task (Fig. 2). This is strong evidence that the increase in performance on these tasks across GPT-3 models from davinci to GPT-3.5-turbo is due to task contamination.
2 Overview
We employ four methods of measuring task contamination.
-
1.
Training data inspection: Search through the training data to find task training examples.
-
2.
Task example extraction: Extract task examples from an existing model. Extraction is only possible with instruction-tuned models. This analysis can also be done for training data or testing data extraction (Sainz et al. 2023b). Note: For the purposes of detecting task contamination, the extracted task examples need not exactly match existing training data examples. Any examples demonstrating the task indicate possible contamination for zero and few-shot learning.
-
3.
Membership inference: This method only applies to generation tasks. Check if the model generated content for an input instance is exactly the same as the original dataset (Hu et al. 2022a). If there is an exact match, we can infer it is a member of the LLM’s training data. This differs from task example extraction because generated output is checked for an exact match. Exact matches for an open-ended generation task strongly indicate the model has seen those examples during training. The model is not just good, it is psychic: it has knowledge of the exact phrasing used in the data. Note: this can only be used for generation tasks.222Exact matches for the input do not indicate task contamination because the input text could have been seen, but it needs to be paired with the output label for task contamination.
-
4.
Chronological analysis: For a set of models whose training data has been collected at a range of known times, measure performance on a dataset with a known release date, and check for evidence of contamination using chronological evidence.
The first three methods have high precision, but suffer from low recall. If data is found in the training data for the task, then it is certain that it has seen examples. But because of data formatting variations, variations in keywords used to define the task, and the size of the dataset, the absence of evidence for contamination using the first three methods is not evidence of absence.
The fourth method, chronological analysis, is high recall, but low precision. If the performance is high due to task contamination, then a chronological analysis will have a high chance of catching it. But other factors could also contribute to increased performance over time, so the precision is low.
Due to their inherent trade-offs, we employ all four methods for detecting task contamination. With all four methods, we find strong evidence of task contamination for some combinations of models and datasets. We begin with a chronological analysis for all models and datasets we tested, since it has the highest potential for catching possible contamination (§4). We then look for further evidence of task contamination using training data inspection (§5) and task example extraction (§6). Next we look at the performance of LLMs on tasks without contamination (§7), and conclude with additional analysis using a membership inference attack (§8).
3 Models and Datasets
Models
We experimented with 12 models. Table 1 lists these models, along with the collection dates of the training data and release dates for each model.333GPT-3 series training data collection dates are obtained from https://platform.openai.com/docs/models/overview The 12 models we use can be further categorized into two broad groups: (1) five proprietary GPT-3 series models (”closed”) and (2) seven open models with free access to their weights (”open”). Comparing models from these two groups yields valuable insights into the difference between proprietary, high-performance models like those from the GPT-3 series and more accessible, community-driven open models. More information about hyperparameters for these models is given in the Appendix A.
Model | Training data | Release |
---|---|---|
davinci | Up to Oct 2019 | May 2020 |
davinci-001 | Up to Oct 2019 | Jun 2020 |
davinci-002 | Up to Jun 2021 | Jan 2022 |
davinci-003 | Up to Jun 2021 | Nov 2022 |
GPT-3.5-T | Up to Sep 2021 | Mar 2023 |
Model | Training data | Release |
---|---|---|
Fairseq MoE | Up to Feb 2019 | Dec 2021 |
GPT-J | Up to 2020 | Jun 2021 |
OPT | Up to Oct 2021 | May 2022 |
BLOOM | Prior Aug 2022 | Nov 2022 |
LLaMA | Up to Aug 2022 | Feb 2023 |
Alpaca | From davinci-003 | Mar 2023 |
Vicuna | From ChatGPT | Mar 2023 |
Datasets
Zero-shot and few-shot evaluations involve models making predictions on tasks that they have never seen or seen only a few times during training. The key premise is that the models have no prior exposure to the particular task at hand, ensuring a fair evaluation of their learning capacity. Contaminated models, however, give a false impression of its zero- or few-shot competency, as they have already been trained on task examples during pretraining. Detecting such inconsistencies would be relatively easier in a chronologically ordered dataset, where any overlap or anomaly would stand out. Based on this narrative, we split the datasets into two categories: datasets released before or after January 1st, 2021, identified as pre-2021 datasets and post-2021 datasets. We use this division to analyze the zero-shot or few-shot performance difference between older datasets and newer ones, with the same division applied for all LLMs. We also use the per-LLM division pre-collection and post-collection datasets, which distinguishes datasets that the model was possibly trained on (pre-collection datasets) from the datasets it could not have been trained on (post-collection datasets). Table 1 presents the creation time of the training data for each model. Information about the datasets can be found in the Appendix B, while release dates for each dataset are listed in Table 2.
Pre-2021 | Post-2021 | ||
---|---|---|---|
Dataset | Year | Dataset | Year |
RTE | 2009 | StrategyQA | 2021 |
WNLI | 2011 | NewsMTSC-MT | 2021 |
COPA | 2011 | NewsMTSC-RW | 2021 |
SST-2 | 2013 | NLI4Wills | 2022 |
MRPC | 2015 | CREPE | 2023 |
QNLI | 2018 | FOMC | 2023 |
CB | 2019 | NewsMet | 2023 |
WiC | 2019 | ||
BoolQ | 2019 |
4 Chronological Analysis
We start with a chronological analysis. This allows us to detect patterns of possible task contamination across the LLMs and datasets we examine.
Analysis of Pre- and Post-collection Datasets
We perform a global chronological analysis across all datasets and LLMs. We look at the difference between performance on datasets released before the training data collection date for the LLM (pre-collection) versus after the training data collection date (post-collection). Specifically, we focus on whether the model is above the majority baseline.444The majority baseline for a classification task is the performance of a model that labels every example with the label that occurs most frequently in the dataset. In this section we use this measure, instead of averaging the performance across datasets, to avoid datasets with large performance differences dominating the analysis.
With 12 models and 16 datasets, we have 192 model/dataset combinations. Of these combinations, 136 the datasets were released before the LLM training data collection date (pre-collection) and 56 the dataset were release after (post-collection). For both sets, we compute the percentage of model/dataset combinations for which the model beats the majority baseline, both zero-shot and few-shot. The results are shown in Fig. 1. We find that for datasets released prior to the creation of the LLM, it is more likely the LLM beats the majority baseline for both zero and few-shot settings. Using the Mann-Whitney U test (Mann and Whitney 1947), we find the difference in those above the majority baseline between pre- and post-collection populations to be statistically significant at the 99% confidence level for both zero and few shot settings.
For some model/dataset combinations, the performance difference above the majority baseline is small, so we also we compute the percentage of model/dataset combinations and for which the model beats the majority baseline and the difference above the majority baseline is statistically significant at the 99% level, calculated using the student t-test (Student 1908) (Fig. 1, darker). Again, we find that for datasets released prior to the creation of the LLM, it is far more likely the LLM beats the majority baseline with statistical significance for both zero and few-shot settings. Similarly, the Mann-Whitney U test indicates these differences between pre and post are statistically significant at the 99% confidence level for both zero and few shot settings.
These results indicate the possibility of task contamination for open LLMs and GPT-3 series LLMs.
Caveats
There are two considerations we need to make in the global chronological analysis.
First, datasets may have become more difficult over time, meaning LLMs are less likely to outperform the majority baseline despite the lack of task contamination. To account for this, we carefully review the tasks and remove tasks known to be difficult for LLMs, such as GSM8K (Cobbe et al. 2021) and TrackingShuffledObjects (Srivastava et al. 2023). The remaining datasets all have acceptable performance using fine-tuned pretrained language models (PLMs), and, importantly, there is no correlation between release date and the performance of fine-tuned PLMs () on our datasets, as shown in Fig. 4.
Secondly, post-collection datasets, despite being released after data collection, may still suffer from contamination. For example, the FOMC dataset (Shah, Paturi, and Chava 2023) was officially released post-collection for the GPT-3 series, but the performance of subsequent versions of GPT-3 is notably high. This may be the result of the authors’ preliminary experimentation with the GPT-3 series (as stated in their paper), as OpenAI may have then utilized their experimental data for model updates.
Analysis of Pre- and Post-collection for Individual LLMs
In this section, we consider the performance on pre- and post-collection datasets for each LLM individually (see Fig. 2). We find the difference in performance between the two categories to be statistically significant at 95% confidence according to the paired sign test (Dixon and Mood 1946).
We plot the percentage of datasets larger than the majority baseline as in the last section, but for each LLM individually. The results are shown in Fig. 2. We observe that the global trend from the previous section has remained true across models with the full range of dates, further indicating that the absolute date of the dataset is not the main factor, but rather the date of the dataset relative to the training data collection date for the LLM is the more important factor. (Note: because of the recency of BLOOM, LLaMA, Alpaca, and Vicuna, we have fewer datasets in our experiments post their training data collection date). The results indicate the possibility of task contamination for both open LLMs and GPT-3 series LLMs, with a stronger indication of contamination in the GPT-3 series with davinci-001 and after.
Performance over Time
Next we perform a chronological analysis that examines the change in average performance over time for both GPT-3 series and open LLMs (Fig. 3). In the axis, LLMs are ordered chronologically by training data collection date. To also be sensitive to time of the datasets, we split our datasets into two sets: datasets released before or after January 1st, 2021, identified as pre-2021 datasets and post-2021 datasets, respectively.
Pre-2021 Datasets
For open LLMs, on pre-2021 datasets, we see a slight increase over time for open LLMs (Fig. 2(c)). We find that the performance hovers around the majority baseline for both zero and few-shot settings, and does not increase very much from LLM data collection dates ranging from 2019 to 2022.
For the GPT-3 series, on the other hand, the trend on pre-2021 datasets is particularly suspect (Fig. 2(a)). We see that for prior GPT-3 datasets, the performance has increased dramatically over time, with later davinci models much higher than the majority baseline for both zero and few-shot settings. The comparison to open LLMs indicates that zero and few-shot evaluations may have task contamination issues due to data collected from user inputs.
Post-2021 Datasets
For post-2021 datasets, GPT-3 average performance has also increased over time (Fig. 2(b)), particularly in the zero-shot setting. This makes sense, as many of the post-2021 datasets are released prior the training data collection date for the later davinci models. (To see which datasets are pre- or post- training data collection time, see the line separating pre- and post- collection datasets in Table 4.) Open LLMs average performance also increased over time, but they remain lower than the majority baseline and the GPT-3 series.
One could hypothesize that the high performance of the GPT-3 series is due to instruction tuning (Ouyang et al. 2022), however we do not believe this is the case. While we observe an increase in performance from davinci-001 to davinci-002 on pre-2021 datasets, there is a corresponding decrease in performance on post-2021 datasets, which we measure with the sign test to be statistically significant at the 95%. This demonstrates that the GPT-3 series instruction tuning is specific to certain earlier datasets, and suggests dataset contamination for zero and few-shot evaluation of GPT-3 series.
5 Training Data Inspection
To search for direct evidence of task contamination, we conduct training data inspection on two instruction fine-tuned open LLMs (Alpaca and Vicuna) for all experimented classification tasks. We search for task-related instruction patterns in the training data, and manually inspect them to see if they contain task training examples. Because we must check manually, we can perform this analysis only for the small fine-tuning datasets of Alpaca and Vicuna. We then compare the performance to see if more task-specific training examples has boosted performance.
Table 3 shows the number of task examples on Alpaca and Vicuna, as well as the change in performance over LLaMA averaged over zero and few-shot settings and all tasks. We find that performance has improved for Alpaca and Vicuna over the original LLaMA model for tasks with more than one task example. Because Alpaca and Vicuna are fine-tuned LLaMA models, this indicates that the performance can be improved with small sets of task examples in the training data, which can compromise zero-shot or few-shot evaluation.
Dataset | Alpaca | Vicuna |
---|---|---|
RTE | 0, +3.1% | 33, +10.6% |
WNLI | 0, -1.4% | 33, +7.7% |
COPA | ?, 0% | ?, +10% |
SST-2 | 8, +14.6% | 0, -1.0% |
MRPC | 0, -0.7% | 0, -8.0% |
QNLI | 0, -0.4% | 28, +10.0% |
CB | 0, +9.8% | 0, -23.2% |
WiC | 0, -4.9% | 0, -2.5% |
BoolQ | ?, +1.9% | ?, +4.0% |
StrategyQA | 0, -3.3% | 0, +10.3% |
MTSC-RW | ?, +9.6% | ?, +11.3% |
MTSC-MT | ?, +6.9% | ?, +8.0% |
NLI4Wills | 0, -13.5% | 0, -11.6% |
CREPE | 0, +24.2% | 0, -0.4% |
FOMC | 0, -5.7% | 1, -5.4% |
NewsMet | 4, +7.2% | 0, -11.4% |
Task | Davinci | davinci-001 | davinci-002 | davinci-003 | GPT-3.5-T | MoE | GPT-J | OPT | Bloom | LLaMA | Alpaca | Vicuna |
RTE | X | X | X | X | X | |||||||
WNLI | X | X | X | X | X | |||||||
COPA | X | X | ||||||||||
SST-2 | X | X | X | |||||||||
MRPC | X | X | ||||||||||
QNLI | X | X | X | |||||||||
CB | X | X | X | X | ||||||||
WiC | X | X | X | |||||||||
BoolQ | X | X | ||||||||||
StrategyQA | ||||||||||||
NewsMTSC-MT | X | X | X | |||||||||
NewsMTSC-RW | X | X | X | |||||||||
NLI4Wills | ||||||||||||
CREPE | ||||||||||||
FOMC | X | X | X | |||||||||
NewsMet | X | X |
6 Task Example Extraction
We test for task data contamination by attempting to extract task examples from the LLM. Prior work (Sainz et al. 2023b) has tested if there exists testing data contamination by prompting an LLM to generate examples for a task. If the LLM can generate examples that exactly match examples in the test data, it is evidence that the test set of the task has been seen during training by the LLM. Inspired by their method, we adopt a similar approach to test for task contamination. Instead of attempting to generate test data, we prompt the model to generate training examples, since for zero- or few-shot evaluation, the model should not be trained on any task examples. If an LLM can generate training examples based on the prompt, this is evidence of task contamination. Note we do not require an exact match of the generated examples with the training data for the task, since any examples for the task seen during training indicate possible task contamination. Our prompts for task example extraction are given in Appendix H.
Table 4 shows the task example extraction results on all tasks across all models. For all pre-collection datasets, GPT-3 series models starting from davinci-001 can generate task specific training examples. There are some post-collection datasets that have evidence of contamination for the GPT-3 series. These datasets may have been contaminated if the authors of these datasets experimented with the GPT-3 series before releasing the dataset. For example, the FOMC paper (Shah, Paturi, and Chava 2023) states they tested with the GPT-3 series, which could have caused contamination. For open LLMs, almost no models can generate training examples of specific tasks except for Vicuna, which is fine-tuned on the ChatGPT data. Note models without instruction tuning cannot follow the instructions directing them to generate task examples, so this analysis is not conclusive for these models.
Comparison to Training Data Inspection
Comparing Tables 3 and 4, we find that training data inspection (TDI) and task example extraction (TEE) both suffer from low recall. TDI has demonstrated task contamination in Alpaca for SST-2 and NewsMet datasets, but TEE failed to catch this contamination. Similarly, TEE has demonstrated task contamination for Vicuna for NewsMTSC, but TDI has failed to catch it. Both suffer from low recall, and highlight the difficulties of employing these methods for detecting task contamination.
7 LLM Performance on Tasks With No Contamination
We find that for tasks without demonstrated possibility of task contamination, LLMs rarely show statistically significant improvements over majority baselines. In Table 4, for the model/dataset combinations that are post-collection and have no extracted task examples, only 1 out of 51, or , demonstrate a statistically significant improvements over the majority baseline for either zero or few-shot settings. This combination is davinci-001 on MTSC-RW, which shows a statistically significant improvement over the majority baseline (Tables 8 and 9 in the Appendix) but does not generate task examples with our prompt. This dataset is found by cross-referencing Table 4 and Tables 8 and 9 in the Appendix, and looking for datasets which are post-collection and not marked X in Table 4, and are bold in either Table 8 or 9.
8 Membership Inference
To further examine the effect of training data contamination, we apply a membership inference attack (Hu et al. 2022a), which checks if model generated content exactly matches the examples in the dataset. While this test is possible for generation tasks, it is not possible for classification tasks, since inputs may be in the training data of LLMs (and likely are, for many datasets), but we do not know for certain if the inputs are also paired with the labels without looking at the training data. We use Spider, a semantic parsing and text-to-SQL generation task, (Yu et al. 2018) as our target for analysis.
Fig. 4(a) and Fig. 4(b) show how many generated examples from the sampled training set and full development set are exactly the same over versions of the GPT-3 series and recent open sourced LLMs, respectively. The database schemas are not in the zero-shot prompts, so if the model can generate exactly the same table name or field name as found in the training or development data, there must be contamination. As shown in Fig. 5, the number of exact matched generated examples increases over time, which indicates the extent of the task contamination on Spider is increasing.
We also compute the execution accuracy after adding the schema in the prompts, and plot it against the number of exact matched generations (Fig. 6). We find a strong positive correlation between the number of exact matched generated examples and execution accuracy (), strongly indicating increased contamination is related to increased performance. However, we still cannot determine the extent of the contamination’s effect on performance improvement. We leave this for future work.
9 Take-Aways
We now share some takeaways which our experiments have brought to light:
-
•
Due to task contamination, closed-sourced models may demonstrate inflated performance in zero-shot or few-shot evaluation, and are therefore not trustworthy baselines in these settings, especially those including instruction fine-tuning or reinforcement learning with human feedback (RLHF). The extent of this contamination is still unknown, and we therefore recommend caution.
-
•
In our experiments, for classification tasks without demonstrated possibility of task contamination, LLMs rarely show statistically significant improvements over majority baselines, in both zero and few-shot settings.
-
•
The observed increase over time of GPT-3 series models for zero-shot or few-shot performance for many downstream tasks is likely due to task contamination.
-
•
Inspection for task contamination of training data even for open-sourced LLMs can be difficult for several reasons. First, determining membership is difficult unless the processed dataset used for training the LLM is released (e.g., OPT and LLaMA did not release the data they used to train the model, but Alpaca and Vicuna did, so we can obtain more definite information). Second, we cannot always rely on the model to reproduce evidence of contamination even if it exists. And third, formatting differences (such as CSV and JSON) of a dataset complicate analysis.
-
•
We encourage publicly releasing training datasets to allow for easier diagnosis of contamination issues.
10 Related Work
The investigation into potential data contamination in large language models (LLMs) has recently been gaining attention in the research community. Brown et al. (2020), in their work with GPT-3, presented an in-depth analysis of data contamination. Although they acknowledged the presence of a bug that led to data contamination in multiple datasets, their position was that it did not affect the overall performance of the model. Intriguingly, they noted that contaminated datasets outperformed the uncontaminated ones which, in a way, contradicted their original assertion. Magar and Schwartz (2022) extracted training data from GPT-2 and indicated potential leaks of private data in the pre-trained language model. Chang et al. (2023) discovered that OpenAI models were memorizing substantial amounts of copyrighted materials, which increased concern over data contamination. Aiyappa et al. (2023) highlighted the severity and scope of data contamination problems for ChatGPT evaluations. Highlighting the need for strategic interventions to address these issues, Jacovi et al. (2023) proposed several strategies for mitigating testing data contamination. Additional work has further looked into test data contamination (Sainz et al. 2023b; Zhou et al. 2023; Golchin and Surdeanu 2023; Sainz et al. 2023a; Deng et al. 2023; Oren et al. 2023; Li 2023).
The previous work listed above has investigated test data contamination, but has not considered task contamination for zero-shot or few-shot settings. Prior work has noticed our proposed task contamination problem for zero-shot or few-shot learning (Blevins, Gonen, and Zettlemoyer 2023; Briakou, Cherry, and Foster 2023), but did not systematically analyze it. Our work seeks to add to the existing knowledge by providing an exhaustive evaluation of task contamination for few-shot or zero-shot learning scenarios.
11 Conclusion and Future Work
We investigate task contamination for LLMs, and conduct a chronological analysis, training data inspection, task example extraction, and a membership inference attack to analyze it. We find evidence that some LLMs have seen task examples during pre-training for a range of tasks, and are therefore no longer zero or few-shot for these tasks. Additionally, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings. We recommend additional research be conducted on task contamination for zero and few-shot settings to reveal the extent and impact of task contamination for large language models in these settings.
Acknowledgements
We are grateful for valuable feedback from Nilay Patel on an earlier version of this draft. We are thankful for the computing resources provided by the Pacific Research Platform’s Nautilus cluster, supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019, the University of California Office of the President, and the University of California San Diego’s California Institute for Telecommunications and Information Technology/Qualcomm Institute. Thanks to CENIC for the 100Gbps networks.
References
- Aiyappa et al. (2023) Aiyappa, R.; An, J.; Kwak, H.; and Ahn, Y.-Y. 2023. Can we trust the evaluation on ChatGPT? arXiv:2303.12767.
- Artetxe et al. (2022) Artetxe, M.; Bhosale, S.; Goyal, N.; Mihaylov, T.; Ott, M.; Shleifer, S.; Lin, X. V.; Du, J.; Iyer, S.; Pasunuru, R.; Anantharaman, G.; Li, X.; Chen, S.; Akin, H.; Baines, M.; Martin, L.; Zhou, X.; Koura, P. S.; O’Horo, B.; Wang, J.; Zettlemoyer, L.; Diab, M.; Kozareva, Z.; and Stoyanov, V. 2022. Efficient Large Scale Language Modeling with Mixtures of Experts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11699–11732. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
- Bang et al. (2023) Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; Do, Q. V.; Xu, Y.; and Fung, P. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity.
- Blevins, Gonen, and Zettlemoyer (2023) Blevins, T.; Gonen, H.; and Zettlemoyer, L. 2023. Prompting Language Models for Linguistic Structure. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 6649–6663. Toronto, Canada: Association for Computational Linguistics.
- Briakou, Cherry, and Foster (2023) Briakou, E.; Cherry, C.; and Foster, G. 2023. Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 9432–9452. Toronto, Canada: Association for Computational Linguistics.
- Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners.
- Chang et al. (2023) Chang, K. K.; Cramer, M.; Soni, S.; and Bamman, D. 2023. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. arXiv:2305.00118.
- Chiang et al. (2023) Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Clark et al. (2019) Clark, C.; Lee, K.; Chang, M.-W.; Kwiatkowski, T.; Collins, M.; and Toutanova, K. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2924–2936. Minneapolis, Minnesota: Association for Computational Linguistics.
- Cobbe et al. (2021) Cobbe, K.; Kosaraju, V.; Bavarian, M.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168.
- de Marneffe, Simons, and Tonhauser (2019) de Marneffe, M.-C.; Simons, M.; and Tonhauser, J. 2019. The CommitmentBank: Investigating projection in naturally occurring discourse.
- Demszky, Guu, and Liang (2018) Demszky, D.; Guu, K.; and Liang, P. 2018. Transforming Question Answering Datasets Into Natural Language Inference Datasets. ArXiv, abs/1809.02922.
- Deng et al. (2023) Deng, C.; Zhao, Y.; Tang, X.; Gerstein, M.; and Cohan, A. 2023. Investigating Data Contamination in Modern Benchmarks for Large Language Models. arXiv:2311.09783.
- Dixon and Mood (1946) Dixon, W. J.; and Mood, A. M. 1946. The Statistical Sign Test. Journal of the American Statistical Association, 41(236): 557–566.
- Dolan and Brockett (2005) Dolan, W. B.; and Brockett, C. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP2005).
- Geva et al. (2021) Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; and Berant, J. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics, 9: 346–361.
- Giampiccolo et al. (2008) Giampiccolo, D.; Dang, H. T.; Magnini, B.; Dagan, I.; Cabrio, E.; and Dolan, W. B. 2008. The Fourth PASCAL Recognizing Textual Entailment Challenge. In Text Analysis Conference.
- Golchin and Surdeanu (2023) Golchin, S.; and Surdeanu, M. 2023. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv:2308.08493.
- Hamborg and Donnay (2021) Hamborg, F.; and Donnay, K. 2021. NewsMTSC: A Dataset for (Multi-)Target-dependent Sentiment Classification in Political News Articles. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1663–1675. Online: Association for Computational Linguistics.
- Hu et al. (2022a) Hu, H.; Salcic, Z.; Sun, L.; Dobbie, G.; Yu, P. S.; and Zhang, X. 2022a. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s): 1–37.
- Hu et al. (2022b) Hu, Y.; Lee, C.-H.; Xie, T.; Yu, T.; Smith, N. A.; and Ostendorf, M. 2022b. In-Context Learning for Few-Shot Dialogue State Tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2627–2643. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
- Jacovi et al. (2023) Jacovi, A.; Caciularu, A.; Goldman, O.; and Goldberg, Y. 2023. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. arXiv:2305.10160.
- Joseph et al. (2023) Joseph, R.; Liu, T.; Ng, A. B.; See, S.; and Rai, S. 2023. NewsMet : A ‘do it all’ Dataset of Contemporary Metaphors in News Headlines. In Findings of the Association for Computational Linguistics: ACL 2023, 10090–10104. Toronto, Canada: Association for Computational Linguistics.
- Kwak et al. (2022) Kwak, A.; Israelsen, J.; Morrison, C.; Bambauer, D.; and Surdeanu, M. 2022. Validity Assessment of Legal Will Statements as Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, 6047–6056. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
- Lester, Al-Rfou, and Constant (2021) Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045–3059. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Levesque, Davis, and Morgenstern (2012) Levesque, H. J.; Davis, E.; and Morgenstern, L. 2012. The Winograd schema challenge. KR, 2012: 13th.
- Li (2023) Li, Y. 2023. Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation. arXiv:2309.10677.
- Magar and Schwartz (2022) Magar, I.; and Schwartz, R. 2022. Data Contamination: From Memorization to Exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 157–165. Dublin, Ireland: Association for Computational Linguistics.
- Mann and Whitney (1947) Mann, H. B.; and Whitney, D. R. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1): 50–60.
- OpenAI (2023a) OpenAI. 2023a. OpenAI Examples.
- OpenAI (2023b) OpenAI. 2023b. OpenAI Models.
- Oren et al. (2023) Oren, Y.; Meister, N.; Chatterji, N.; Ladhak, F.; and Hashimoto, T. B. 2023. Proving Test Set Contamination in Black Box Language Models. arXiv:2310.17623.
- Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Gray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano, P.; Leike, J.; and Lowe, R. 2022. Training language models to follow instructions with human feedback. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
- Pilehvar and Camacho-Collados (2019) Pilehvar, M. T.; and Camacho-Collados, J. 2019. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1267–1273. Minneapolis, Minnesota: Association for Computational Linguistics.
- Poesia et al. (2022) Poesia, G.; Polozov, A.; Le, V.; Tiwari, A.; Soares, G.; Meek, C.; and Gulwani, S. 2022. Synchromesh: Reliable Code Generation from Pre-trained Language Models. In International Conference on Learning Representations.
- Qin et al. (2023) Qin, C.; Zhang, A.; Zhang, Z.; Chen, J.; Yasunaga, M.; and Yang, D. 2023. Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
- Qin and Eisner (2021) Qin, G.; and Eisner, J. 2021. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5203–5212. Online: Association for Computational Linguistics.
- Roemmele, Bejan, and Gordon (2011) Roemmele, M.; Bejan, C. A.; and Gordon, A. S. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, 90–95.
- Sainz et al. (2023a) Sainz, O.; Campos, J.; García-Ferrero, I.; Etxaniz, J.; de Lacalle, O. L.; and Agirre, E. 2023a. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Linguistics: EMNLP 2023.
- Sainz et al. (2023b) Sainz, O.; Campos, J. A.; García-Ferrero, I.; Etxaniz, J.; and Agirr, E. 2023b. Did ChatGPT cheat on your test? https://hitz-zentroa.github.io/lm-contamination/blog/.
- Scao et al. (2022) Scao, T. L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A. S.; Yvon, F.; Gallé, M.; et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Schick et al. (2023) Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; and Scialom, T. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools.
- Schick and Schütze (2021a) Schick, T.; and Schütze, H. 2021a. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 255–269. Online: Association for Computational Linguistics.
- Schick and Schütze (2021b) Schick, T.; and Schütze, H. 2021b. Few-Shot Text Generation with Natural Language Instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 390–402. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Shah, Paturi, and Chava (2023) Shah, A.; Paturi, S.; and Chava, S. 2023. Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 6664–6679. Toronto, Canada: Association for Computational Linguistics.
- Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A.; and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642. Seattle, Washington, USA: Association for Computational Linguistics.
- Srivastava et al. (2023) Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A. A. M.; Abid, A.; Fisch, A.; Brown, A. R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; Kluska, A.; Lewkowycz, A.; Agarwal, A.; Power, A.; Ray, A.; Warstadt, A.; Kocurek, A. W.; Safaya, A.; Tazarv, A.; Xiang, A.; Parrish, A.; Nie, A.; Hussain, A.; Askell, A.; Dsouza, A.; Slone, A.; Rahane, A.; Iyer, A. S.; Andreassen, A. J.; Madotto, A.; Santilli, A.; Stuhlmüller, A.; Dai, A. M.; La, A.; Lampinen, A.; Zou, A.; Jiang, A.; Chen, A.; Vuong, A.; Gupta, A.; Gottardi, A.; Norelli, A.; Venkatesh, A.; Gholamidavoodi, A.; Tabassum, A.; Menezes, A.; Kirubarajan, A.; Mullokandov, A.; Sabharwal, A.; Herrick, A.; Efrat, A.; Erdem, A.; Karakaş, A.; Roberts, B. R.; Loe, B. S.; Zoph, B.; Bojanowski, B.; Özyurt, B.; Hedayatnia, B.; Neyshabur, B.; Inden, B.; Stein, B.; Ekmekci, B.; Lin, B. Y.; Howald, B.; Orinion, B.; Diao, C.; Dour, C.; Stinson, C.; Argueta, C.; Ferri, C.; Singh, C.; Rathkopf, C.; Meng, C.; Baral, C.; Wu, C.; Callison-Burch, C.; Waites, C.; Voigt, C.; Manning, C. D.; Potts, C.; Ramirez, C.; Rivera, C. E.; Siro, C.; Raffel, C.; Ashcraft, C.; Garbacea, C.; Sileo, D.; Garrette, D.; Hendrycks, D.; Kilman, D.; Roth, D.; Freeman, C. D.; Khashabi, D.; Levy, D.; González, D. M.; Perszyk, D.; Hernandez, D.; Chen, D.; Ippolito, D.; Gilboa, D.; Dohan, D.; Drakard, D.; Jurgens, D.; Datta, D.; Ganguli, D.; Emelin, D.; Kleyko, D.; Yuret, D.; Chen, D.; Tam, D.; Hupkes, D.; Misra, D.; Buzan, D.; Mollo, D. C.; Yang, D.; Lee, D.-H.; Schrader, D.; Shutova, E.; Cubuk, E. D.; Segal, E.; Hagerman, E.; Barnes, E.; Donoway, E.; Pavlick, E.; Rodolà, E.; Lam, E.; Chu, E.; Tang, E.; Erdem, E.; Chang, E.; Chi, E. A.; Dyer, E.; Jerzak, E.; Kim, E.; Manyasi, E. E.; Zheltonozhskii, E.; Xia, F.; Siar, F.; Martínez-Plumed, F.; Happé, F.; Chollet, F.; Rong, F.; Mishra, G.; Winata, G. I.; de Melo, G.; Kruszewski, G.; Parascandolo, G.; Mariani, G.; Wang, G. X.; Jaimovitch-Lopez, G.; Betz, G.; Gur-Ari, G.; Galijasevic, H.; Kim, H.; Rashkin, H.; Hajishirzi, H.; Mehta, H.; Bogar, H.; Shevlin, H. F. A.; Schuetze, H.; Yakura, H.; Zhang, H.; Wong, H. M.; Ng, I.; Noble, I.; Jumelet, J.; Geissinger, J.; Kernion, J.; Hilton, J.; Lee, J.; Fisac, J. F.; Simon, J. B.; Koppel, J.; Zheng, J.; Zou, J.; Kocon, J.; Thompson, J.; Wingfield, J.; Kaplan, J.; Radom, J.; Sohl-Dickstein, J.; Phang, J.; Wei, J.; Yosinski, J.; Novikova, J.; Bosscher, J.; Marsh, J.; Kim, J.; Taal, J.; Engel, J.; Alabi, J.; Xu, J.; Song, J.; Tang, J.; Waweru, J.; Burden, J.; Miller, J.; Balis, J. U.; Batchelder, J.; Berant, J.; Frohberg, J.; Rozen, J.; Hernandez-Orallo, J.; Boudeman, J.; Guerr, J.; Jones, J.; Tenenbaum, J. B.; Rule, J. S.; Chua, J.; Kanclerz, K.; Livescu, K.; Krauth, K.; Gopalakrishnan, K.; Ignatyeva, K.; Markert, K.; Dhole, K.; Gimpel, K.; Omondi, K.; Mathewson, K. W.; Chiafullo, K.; Shkaruta, K.; Shridhar, K.; McDonell, K.; Richardson, K.; Reynolds, L.; Gao, L.; Zhang, L.; Dugan, L.; Qin, L.; Contreras-Ochando, L.; Morency, L.-P.; Moschella, L.; Lam, L.; Noble, L.; Schmidt, L.; He, L.; Oliveros-Colón, L.; Metz, L.; Senel, L. K.; Bosma, M.; Sap, M.; Hoeve, M. T.; Farooqi, M.; Faruqui, M.; Mazeika, M.; Baturan, M.; Marelli, M.; Maru, M.; Ramirez-Quintana, M. J.; Tolkiehn, M.; Giulianelli, M.; Lewis, M.; Potthast, M.; Leavitt, M. L.; Hagen, M.; Schubert, M.; Baitemirova, M. O.; Arnaud, M.; McElrath, M.; Yee, M. A.; Cohen, M.; Gu, M.; Ivanitskiy, M.; Starritt, M.; Strube, M.; Sw\kedrowski, M.; Bevilacqua, M.; Yasunaga, M.; Kale, M.; Cain, M.; Xu, M.; Suzgun, M.; Walker, M.; Tiwari, M.; Bansal, M.; Aminnaseri, M.; Geva, M.; Gheini, M.; T, M. V.; Peng, N.; Chi, N. A.; Lee, N.; Krakover, N. G.-A.; Cameron, N.; Roberts, N.; Doiron, N.; Martinez, N.; Nangia, N.; Deckers, N.; Muennighoff, N.; Keskar, N. S.; Iyer, N. S.; Constant, N.; Fiedel, N.; Wen, N.; Zhang, O.; Agha, O.; Elbaghdadi, O.; Levy, O.; Evans, O.; Casares, P. A. M.; Doshi, P.; Fung, P.; Liang, P. P.; Vicol, P.; Alipoormolabashi, P.; Liao, P.; Liang, P.; Chang, P. W.; Eckersley, P.; Htut, P. M.; Hwang, P.; Miłkowski, P.; Patil, P.; Pezeshkpour, P.; Oli, P.; Mei, Q.; Lyu, Q.; Chen, Q.; Banjade, R.; Rudolph, R. E.; Gabriel, R.; Habacker, R.; Risco, R.; Millière, R.; Garg, R.; Barnes, R.; Saurous, R. A.; Arakawa, R.; Raymaekers, R.; Frank, R.; Sikand, R.; Novak, R.; Sitelew, R.; Bras, R. L.; Liu, R.; Jacobs, R.; Zhang, R.; Salakhutdinov, R.; Chi, R. A.; Lee, S. R.; Stovall, R.; Teehan, R.; Yang, R.; Singh, S.; Mohammad, S. M.; Anand, S.; Dillavou, S.; Shleifer, S.; Wiseman, S.; Gruetter, S.; Bowman, S. R.; Schoenholz, S. S.; Han, S.; Kwatra, S.; Rous, S. A.; Ghazarian, S.; Ghosh, S.; Casey, S.; Bischoff, S.; Gehrmann, S.; Schuster, S.; Sadeghi, S.; Hamdan, S.; Zhou, S.; Srivastava, S.; Shi, S.; Singh, S.; Asaadi, S.; Gu, S. S.; Pachchigar, S.; Toshniwal, S.; Upadhyay, S.; Debnath, S. S.; Shakeri, S.; Thormeyer, S.; Melzi, S.; Reddy, S.; Makini, S. P.; Lee, S.-H.; Torene, S.; Hatwar, S.; Dehaene, S.; Divic, S.; Ermon, S.; Biderman, S.; Lin, S.; Prasad, S.; Piantadosi, S.; Shieber, S.; Misherghi, S.; Kiritchenko, S.; Mishra, S.; Linzen, T.; Schuster, T.; Li, T.; Yu, T.; Ali, T.; Hashimoto, T.; Wu, T.-L.; Desbordes, T.; Rothschild, T.; Phan, T.; Wang, T.; Nkinyili, T.; Schick, T.; Kornev, T.; Tunduny, T.; Gerstenberg, T.; Chang, T.; Neeraj, T.; Khot, T.; Shultz, T.; Shaham, U.; Misra, V.; Demberg, V.; Nyamai, V.; Raunak, V.; Ramasesh, V. V.; vinay uday prabhu; Padmakumar, V.; Srikumar, V.; Fedus, W.; Saunders, W.; Zhang, W.; Vossen, W.; Ren, X.; Tong, X.; Zhao, X.; Wu, X.; Shen, X.; Yaghoobzadeh, Y.; Lakretz, Y.; Song, Y.; Bahri, Y.; Choi, Y.; Yang, Y.; Hao, Y.; Chen, Y.; Belinkov, Y.; Hou, Y.; Hou, Y.; Bai, Y.; Seid, Z.; Zhao, Z.; Wang, Z.; Wang, Z. J.; Wang, Z.; and Wu, Z. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Student (1908) Student. 1908. The probable error of a mean. Biometrika, 1–25.
- Taori et al. (2023) Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca.
- Touvron et al. (2023) Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
- Wang et al. (2019) Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Red Hook, NY, USA: Curran Associates Inc.
- Wang et al. (2018) Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355. Brussels, Belgium: Association for Computational Linguistics.
- Wang, Deng, and Sun (2022) Wang, B.; Deng, X.; and Sun, H. 2022. Iteratively Prompt Pre-trained Language Models for Chain of Thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2714–2730. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
- Wang and Komatsuzaki (2021) Wang, B.; and Komatsuzaki, A. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Wang et al. (2023) Wang, X.; Wei, J.; Schuurmans, D.; Le, Q. V.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
- Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.; and Zhou, D. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Yang et al. (2023) Yang, R.; Song, L.; Li, Y.; Zhao, S.; Ge, Y.; Li, X.; and Shan, Y. 2023. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. arXiv:2305.18752.
- Yu et al. (2018) Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; Zhang, Z.; and Radev, D. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3911–3921. Brussels, Belgium: Association for Computational Linguistics.
- Yu et al. (2023) Yu, X.; Min, S.; Zettlemoyer, L.; and Hajishirzi, H. 2023. CREPE: Open-Domain Question Answering with False Presuppositions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 10457–10480. Toronto, Canada: Association for Computational Linguistics.
- Zhang et al. (2022) Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X. V.; Mihaylov, T.; Ott, M.; Shleifer, S.; Shuster, K.; Simig, D.; Koura, P. S.; Sridhar, A.; Wang, T.; and Zettlemoyer, L. 2022. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068.
- Zhou et al. (2023) Zhou, K.; Zhu, Y.; Chen, Z.; Chen, W.; Zhao, W. X.; Chen, X.; Lin, Y.; Wen, J.-R.; and Han, J. 2023. Don’t Make Your LLM an Evaluation Benchmark Cheater. arXiv:2311.01964.
Appendix A Hyperparameters
We use greedy decoding to ensure a fair comparison for all approaches. For GPT-3 series models, we set the temperature as 0 to ensure deterministic results. For few-shot learning, we use the same few-shot examples across models for each instance in a task. We run open sourced models on an NVIDIA A100 GPU.
Appendix B Datasets
The pre-2021 datasets are common GLUE (Wang et al. 2018) and Super GLUE (Wang et al. 2019) tasks: MRPC (Dolan and Brockett 2005), boolq (Clark et al. 2019), SST-2 (Socher et al. 2013), QNLI (Demszky, Guu, and Liang 2018), WNLI (Levesque, Davis, and Morgenstern 2012), RTE (Giampiccolo et al. 2008), CB (de Marneffe, Simons, and Tonhauser 2019), COPA (Roemmele, Bejan, and Gordon 2011), WiC (Pilehvar and Camacho-Collados 2019). The post-2021 datasets are StrategyQA (Geva et al. 2021), NLI4Wills (Kwak et al. 2022), NewsMTSC (Hamborg and Donnay 2021), CREPE (Yu et al. 2023), FOMC (Shah, Paturi, and Chava 2023) and NewsMet (Joseph et al. 2023).
Dataset | Year | Test set size |
---|---|---|
RTE | 2009 | 277 |
WNLI | 2011 | 71 |
COPA | 2011 | 100 |
SST-2 | 2013 | 872 |
MRPC | 2015 | 408 |
QNLI | 2018 | 5463 |
CB | 2019 | 56 |
WiC | 2019 | 638 |
BoolQ | 2019 | 3270 |
StrategyQA | 2021 | 229 |
NewsMTSC-mt | 2021 | 1476 |
NewsMTSC-rw | 2021 | 1146 |
NLI4Wills | 2022 | 255 |
CREPE | 2023 | 2000 |
FOMC | 2023 | 496 |
NewsMet | 2023 | 554 |
Appendix C Prompt Sources
The prompts for these tasks are taken from previous research (Bang et al. 2023; Qin et al. 2023) that use them as evaluation benchmarks and OpenAI (2023a) Examples or designed based on the related tasks from these sources. Table 6 shows prompt source for each dataset. Appendix G lists example prompts for each task.
Dataset | Prompt source |
---|---|
RTE | Bang et al. (2023)* |
WNLI | Bang et al. (2023)* |
COPA | Bang et al. (2023)* |
SST-2 | OpenAI (2023a) |
MRPC | OpenAI (2023a)* |
QNLI | Bang et al. (2023)* |
CB | Bang et al. (2023)* |
WiC | OpenAI (2023a)* |
BoolQ | Qin et al. (2023)* |
StrategyQA | Qin et al. (2023) |
Newsmtsc-mt | OpenAI (2023a)* |
Newsmtsc-rw | OpenAI (2023a)* |
NLI4Wills | Bang et al. (2023)* |
CREPE | OpenAI (2023a)* |
FOMC | Shah, Paturi, and Chava (2023) |
NewsMet | Bang et al. (2023)* |
Appendix D Training Data Inspection Details
We manually inspect training examples found using regular expressions for each task. Our regular expression or string search pattern for each task are listed in Table 7. Some tasks such as COPA and BoolQ do not have a specific pattern that can be matched. We count an example if it is directly related to the task and contains the input and output for the task. We do not count examples that talk about the task without giving input and output examples.
Dataset | RE pattern |
---|---|
RTE | [Ee]ntailment |
WNLI | [Ee]ntailment |
COPA | – |
SST-2 | [cC]lassify the sentiment |
MRPC | [Pp]paraphrase |
QNLI | [Ee]ntailment |
CB | [Ee]ntailment |
WiC | [Ww]ord sense |
BoolQ | – |
StrategyQA | ([tT]he answer is)*([Yy]es—[Nn]o) |
NLI4Wills | [sS]upport—[Rr]efute |
MTSC-RW | – |
MTSC-MT | – |
CREPE | presupposition |
FOMC | ”hawkish” or ”dovish” |
NewsMet | ”metaphorical” |
Appendix E Detailed Results Tables
In this section, we report the performance numbers for all models and datasets in our experiments with confidence intervals.
Dataset | Majority | davinci | davinci-001 | davinci-002 | davinci-003 | GPT-3.5-T | MoE-7B | GPT-J-6B | OPT-6.7B | BLOOM-7B | LLama-7B | Alpaca-7B | Vicuna-7B |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RTE | 52.7 | 29.62.9 | 57.43.5 | 75.12.6 | 83.81.9 | 72.62.8 | 61.73.3 | 53.13.5 | 53.13.5 | 52.73.5 | 63.23.3 | 54.93.5 | 60.73.4 |
WNLI | 56.3 | 33.86.4 | 43.77.0 | 66.26.4 | 60.66.8 | 66.26.4 | 45.17.1 | 43.77.0 | 43.77.0 | 43.77.0 | 46.57.1 | 43.77.0 | 43.77.0 |
COPA | 55.0 | 66.05.4 | 70.05.0 | 89.02.3 | 93.01.6 | 82.03.5 | 56.05.9 | 50.06.0 | 53.05.9 | 53.05.9 | 55.05.9 | 58.05.8 | 72.04.8 |
SST-2 | 50.9 | 0.30.0 | 58.01.9 | 85.11.0 | 73.41.5 | 81.81.2 | 5.40.4 | 49.12.0 | 34.71.8 | 53.42.0 | 57.81.9 | 87.30.9 | 62.01.9 |
MRPC | 68.4 | 9.31.0 | 68.42.5 | 68.42.5 | 72.52.3 | 69.92.4 | 34.82.6 | 69.92.4 | 55.62.9 | 31.62.5 | 68.92.5 | 68.42.5 | 68.42.5 |
QNLI | 50.5 | 28.00.6 | 49.50.8 | 57.20.8 | 84.60.4 | 85.10.4 | 55.00.8 | 49.70.8 | 53.00.8 | 49.50.8 | 51.50.8 | 49.60.8 | 59.00.8 |
CB | 50.0 | 35.77.5 | 75.06.1 | 75.06.1 | 76.85.8 | 75.06.1 | 26.86.4 | 44.68.1 | 41.17.9 | 50.08.1 | 41.17.9 | 48.28.1 | 12.53.6 |
WiC | 50.0 | 16.31.2 | 45.52.2 | 48.92.2 | 60.52.1 | 54.42.2 | 50.32.2 | 51.32.2 | 55.32.2 | 50.52.2 | 59.62.2 | 50.32.2 | 52.72.2 |
BoolQ | 62.2 | 19.60.6 | 78.70.6 | 83.50.5 | 85.00.5 | 87.10.4 | 55.80.9 | 60.10.9 | 59.50.9 | 44.60.9 | 66.50.8 | 74.90.7 | 76.30.7 |
StrategyQA | 53.3 | 31.93.4 | 55.93.8 | 53.73.9 | 62.03.7 | 65.13.5 | 46.73.9 | 23.62.8 | 12.21.7 | 24.02.8 | 36.23.6 | 21.82.7 | 53.33.9 |
MTSC-MT | 50.7 | 3.30.2 | 48.81.5 | 34.81.4 | 63.81.4 | 67.11.3 | 0.00.0 | 4.20.2 | 2.60.2 | 3.30.2 | 2.20.1 | 5.10.3 | 12.30.7 |
MTSC-RW | 39.7 | 4.50.3 | 50.41.7 | 34.81.6 | 60.91.6 | 69.21.5 | 0.00.0 | 4.30.3 | 3.10.2 | 3.30.2 | 2.30.2 | 7.80.5 | 10.70.7 |
NLI4Wills | 55.7 | 17.62.1 | 23.12.6 | 15.71.9 | 33.73.3 | 41.63.6 | 14.51.8 | 14.51.8 | 2.00.3 | 3.50.5 | 7.11.0 | 19.22.3 | 21.62.5 |
CREPE | 72.8 | 20.50.9 | 40.11.3 | 28.11.1 | 42.11.3 | 69.31.1 | 4.10.2 | 16.50.7 | 44.31.3 | 68.51.1 | 20.40.8 | 67.21.1 | 18.10.8 |
FOMC | 49.4 | 33.32.3 | 52.62.6 | 61.52.5 | 54.02.6 | 59.52.5 | 11.11.0 | 24.21.9 | 11.51.1 | 25.02.0 | 39.12.5 | 25.02.0 | 28.42.1 |
NewsMet | 52.3 | 20.41.6 | 50.92.5 | 57.02.4 | 50.22.5 | 51.12.5 | 7.80.7 | 47.52.5 | 34.82.3 | 36.12.3 | 31.02.1 | 46.92.5 | 8.70.8 |
Dataset | Majority | davinci | davinci-001 | davinci-002 | davinci-003 | GPT-3.5-T | MoE-7B | GPT-J-6B | OPT-6.7B | BLOOM-7B | LLama-7B | Alpaca-7B | Vicuna-7B |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RTE | 52.7 | 50.53.5 | 65.03.2 | 83.42.0 | 85.61.7 | 84.81.8 | 46.63.5 | 46.63.5 | 62.83.3 | 51.63.5 | 48.03.5 | 62.53.3 | 71.82.9 |
WNLI | 56.3 | 57.77.0 | 46.57.1 | 60.66.8 | 71.85.8 | 85.93.5 | 56.37.0 | 46.57.1 | 43.77.0 | 52.17.1 | 46.57.1 | 46.57.1 | 64.86.5 |
COPA | 55.0 | 47.05.9 | 83.03.4 | 96.00.9 | 96.00.9 | 97.00.7 | 90.02.1 | 45.05.9 | 54.05.9 | 45.05.9 | 69.05.1 | 66.05.4 | 72.04.8 |
SST-2 | 50.9 | 91.70.6 | 92.70.5 | 92.20.6 | 78.21.3 | 90.10.7 | 1.70.1 | 79.51.3 | 87.40.9 | 84.71.0 | 93.60.5 | 93.20.5 | 87.30.9 |
MRPC | 68.4 | 52.72.9 | 69.12.5 | 71.62.4 | 77.02.1 | 72.82.3 | 31.62.5 | 85.31.5 | 67.22.6 | 31.62.5 | 69.42.5 | 68.42.5 | 53.92.9 |
QNLI | 50.5 | 51.70.8 | 59.00.8 | 79.00.5 | 79.90.5 | 84.40.4 | 50.60.8 | 49.50.8 | 55.60.8 | 52.10.8 | 57.70.8 | 58.80.8 | 70.30.7 |
CB | 50.0 | 50.08.1 | 80.45.1 | 78.65.5 | 78.65.5 | 80.45.1 | 0.00.0 | 44.68.1 | 41.17.9 | 41.17.9 | 71.46.6 | 83.94.4 | 53.68.1 |
WiC | 50.0 | 51.12.2 | 55.62.2 | 57.22.2 | 66.52.0 | 63.22.1 | 50.02.2 | 54.92.2 | 50.22.2 | 51.32.2 | 50.52.2 | 49.82.2 | 52.42.2 |
BoolQ | 62.2 | 55.80.9 | 79.50.6 | 87.10.4 | 88.40.4 | 85.10.5 | 37.90.9 | 62.90.9 | 66.90.8 | 52.61.0 | 77.80.7 | 73.20.7 | 76.00.7 |
StrategyQA | 53.3 | 52.43.9 | 58.53.8 | 62.43.6 | 70.33.2 | 69.03.3 | 48.53.9 | 45.03.8 | 52.83.9 | 49.83.9 | 53.33.9 | 61.13.7 | 56.83.8 |
MTSC-MT | 50.7 | 40.01.5 | 43.21.5 | 61.01.4 | 68.41.3 | 70.71.3 | 0.10.0 | 36.71.4 | 24.11.1 | 2.90.2 | 48.31.5 | 59.21.5 | 54.31.5 |
MTSC-RW | 39.7 | 33.21.5 | 52.91.7 | 66.81.5 | 64.61.6 | 69.41.5 | 0.10.0 | 31.01.5 | 30.81.5 | 3.10.2 | 41.41.7 | 55.21.7 | 55.71.7 |
NLI4Wills | 55.7 | 47.13.7 | 30.23.1 | 5.10.7 | 28.23.0 | 36.53.4 | 0.40.1 | 21.62.5 | 24.32.7 | 54.93.6 | 56.93.6 | 17.62.1 | 19.22.3 |
CREPE | 72.8 | 60.91.2 | 44.91.3 | 73.81.0 | 70.91.1 | 62.21.2 | 67.71.1 | 72.81.0 | 72.81.0 | 14.80.7 | 71.21.1 | 72.81.0 | 72.81.0 |
FOMC | 49.4 | 40.72.5 | 54.42.6 | 55.22.6 | 61.72.5 | 63.52.4 | 25.02.0 | 49.42.6 | 49.42.6 | 42.32.6 | 50.22.6 | 52.82.6 | 50.02.6 |
NewsMet | 52.3 | 48.02.5 | 51.32.5 | 49.52.5 | 50.22.5 | 56.02.4 | 39.42.4 | 47.72.5 | 52.52.5 | 47.72.5 | 52.32.5 | 50.92.5 | 52.02.5 |
Appendix F Additional Figures
Appendix G Prompt Examples for Each Task
In this section we give examples of zero-shot prompts for each task.
Task: MRPC | |||
Prompting Inputs: | |||
|
|||
Expected Outputs: | |||
Yes |
Task: BOOLQ | ||||||||||||||||||
Prompting Inputs: | ||||||||||||||||||
|
||||||||||||||||||
Expected Outputs: | ||||||||||||||||||
No |
Task: SST | |
Prompting Inputs: | |
|
|
Expected Outputs: | |
Positive |
Task: QQP | ||
Prompting Inputs: | ||
|
||
Expected Outputs: | ||
No |
Task: QNLI | ||||
Prompting Inputs: | ||||
|
||||
Expected Outputs: | ||||
Yes |
Task: WNLI | |||
Prompting Inputs: | |||
|
|||
Expected Outputs: | |||
No |
Task: RTE | ||||
Prompting Inputs: | ||||
|
||||
Expected Outputs: | ||||
No |
Task: CB | |||||||
Prompting Inputs: | |||||||
|
|||||||
Expected Outputs: | |||||||
No |
Task: COPA | ||
Prompting Inputs: | ||
|
||
Expected Outputs: | ||
2 |
Task: WIC | |||
Prompting Inputs: | |||
|
|||
Expected Outputs: | |||
No |
Task: STRATEGYQA | |||
Prompting Inputs: | |||
|
|||
Expected Outputs: | |||
No |
Task: NLI4WILLS | |||||||||||||||||
Prompting Inputs: | |||||||||||||||||
|
|||||||||||||||||
Expected Outputs: | |||||||||||||||||
Refute |
Task: NEWSMTSC-RW | |||
Prompting Inputs: | |||
|
|||
Expected Outputs: | |||
negative |
Task: NEWSMTSC-MT | ||||||
Prompting Inputs: | ||||||
|
||||||
Expected Outputs: | ||||||
negative |
Task: Spider without schema | ||
Prompting Inputs: | ||
|
||
Expected Outputs: | ||
SELECT count(*) FROM singer |
Task: Spider with schema | |||||||||
Prompting Inputs: | |||||||||
|
|||||||||
Expected Outputs: | |||||||||
SELECT count(*) FROM singer |
Task: FOMC | |||||
Prompting Inputs: | |||||
|
|||||
Expected Outputs: | |||||
Hawkish |
Task: CREPE | ||||
Prompting Inputs: | ||||
|
||||
Expected Outputs: | ||||
No |
Task: NewsMet | ||||
Prompting Inputs: | ||||
|
||||
Expected Outputs: | ||||
metaphorical |
Appendix H Prompts for Task Example Extraction
Task |
Prompt used |
---|---|
RTE |
Generate several training examples for Recognizing Textual Entailment dataset including premise and hypothesis with entailment and not_entailment as labels. |
WNLI |
Generate several training examples for Winograd Schema Natural Language Inference dataset including premise and hypothesis with entailment and not_entailment as labels. |
COPA |
Generate several training examples for Choice of Plausible Alternatives (COPA) dataset including premise and choices as input with 0 or 1 as labels. |
SST-2 |
Generate several training examples for sentiment analysis task with positve and negative as labels |
MRPC |
Generate several training examples for Microsoft Research Paraphrase Corpus task. |
QNLI |
Generate several training examples for Question answering Natural Language Inference dataset using question answer pairs with entailment and not_entailment as labels. |
CB |
Generate several training examples for CommitmentBank Natural Language Inference dataset including premise and hypothesis as input with entailment, neutral, as contradiction labels. |
WiC |
Generate several training examples for The Word-in-Context (WiC) Dataset task including 2 sentences and a word in both sentences as input with true or false as labels. |
BoolQ |
Generate several training examples for BoolQ dataset which is a question answering dataset for yes/no questions including passage and question as input with yes or no as labels. |
StrategyQA |
Generate several training examples for StrategyQA task which is a question-answering task focusing on open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy. Generate with a question and reasoning steps as input and Yes or No as Labels. |
NewsMTSC |
Generate several training examples for Multi-Target-dependent Sentiment Classification in Political News Articles including a sentence and a target in the sentence as input with positive and negative as labels. |
NLI4Wills |
Generate several training examples for the validity evaluation of the legal will statements including statement, conditions and law as input with support, refute, or unrelated as labels. |
CREPE |
Generate several training examples for a QA task containing a natural distribution of presupposition failures for questions with whether there is any false presuppositions including question and comment as input with true or false as labels |
FOMC |
Generate several training examples for Federal Open Market Committee (FOMC) dataset for a measure of monetary policy stance task including sentence from FOMC document as input with Dovish, Hawkish or Neutral as labels. |
NewsMet |
Generate several training examples from NewsMet, a large high-quality contemporary dataset of news headlines hand-annotated with metaphorical verbs with a task to detect if the headline is metaphorical including a headline sentence as input with 0 or 1 as labels to represent metaphorical or not metaphorical. |