Generating Datasets With Pretrained Language Models
Generating Datasets With Pretrained Language Models
Abstract Task: Write two sentences that mean the same thing.
S-BERT (base) – 70.97 76.53 73.19 79.09 74.30 77.03 72.91 74.89
S-RoBERTa (base) – 71.54 72.49 70.80 78.74 73.69 77.77 74.46 74.21
Avg. GloVe – 55.14 70.66 59.73 68.25 63.66 58.02 53.76 61.32
Avg. BERT – 38.78 57.98 57.98 63.15 61.06 46.35 58.40 54.81
BERT CLS – 20.16 30.01 20.09 36.88 38.08 16.50 42.63 29.19
unsup.
Zhang et al. (2020) NLI 56.77 69.24 61.21 75.23 70.16 69.21 64.25 66.58
Li et al. (2020) NLI 59.54 64.69 64.66 72.92 71.84 58.56 65.44 65.38
Li et al. (2020) STS 63.48 72.14 68.42 73.77 75.37 70.72 63.11 69.57
D INO (STS- -x1 x2 ) – 64.87 78.30 66.38 79.60 76.47 76.51 74.26 73.77
D INO (STS- -x2 ) STS 70.27 81.26 71.25 80.49 77.18 77.82 68.09 75.20
Table 1: Spearman’s rank correlation on STS12–16, STSb and SICK without finetuning on task-specific examples
for models with NLI supervision (“sup.”) and fully unsupervised (“unsup.”) models using the same evaluation
setup as Reimers and Gurevych (2019). The second column shows which unlabeled data (“UD”) is used by
unsupervised approaches in addition to original pretraining data; the final column shows average performance.
Results for all baselines except Zhang et al. (2020) and Li et al. (2020) are from Reimers and Gurevych (2019). The
best unsupervised result is shown in bold, the best overall result is underlined. D INO outperforms all unsupervised
approaches and, surprisingly, also supervised approaches on four out of six STS datasets.
Human Labels
0.0 95% 15% 0% 0%
y=1
x1 = US closes embassy in Syria
0.9 0% 0% 29% 47% 7
x2 = US Embassy in Syria
x1 = A man is playing the cello.
Table 3: Comparison of similarity scores in STS- -x2 7
x2 = The cello is playing the man.
to human judgments for 100 examples. Examples are
chosen randomly from the version of STS- -x2 used x1 = A plane is taking off.
7
x2 = I want to be a pilot.
for training (including label smoothing, augmentation
with random pairs and removal of examples where x1 = A woman is seasoning a piece of meat.
3
y = 0.5
x2 = A man is cooking the meat and adding spices [...]
x1 = x2 ). For column i and row j, the value shown
is the percentage of examples generated by D INO for x1 = Second day of Egyptian presidential election
3
similarity score i that were assigned score j in our hu- x2 = The first night of the election.
man evaluation. x1 = A white bus with the word Julia is near water [...]
3
x2 = There is an open beach in my hometown.
x1 = Strong earthquake in Mexico
posed to be on completely different topics, many 3
x2 = It’s the best time to get a job
(41%) still have a certain similarity according to
y=0 x1 = Closed roads in Armenia
7
human judgment. In contrast, randomly sampled x2 = Open roads in Azerbaijan
pairs are indeed on completely different topics in x1 = The man is playing the guitar.
7
almost all cases. Moreover, we can see that GPT2- x2 = I’m not a guitar player.
XL has particular difficulty in generating pairs of x1 = A man is playing a large flute.
7
non-identical sentences that really mean the same x2 = A man is listening to a small flute.
thing: Only 47% of all examples that should have
Table 4: A selection of high-quality (3) and low-
the same meaning do actually mean (almost) the
quality (7) examples in STS- -x2 . Many sentence
same thing. However, the strong performance of pairs for y = 1 are not similar and have quite differ-
S-RoBERTa trained on STS- -x2 suggests that, ent meanings. Some sentence pairs for y = 0 are not
despite this noise, there is sufficient signal in this on completely different topics.
dataset for successful training.
We finally take a qualitative look at both positive
examples where D INO is able to create high-quality et al. (2021). With appropriate measures for han-
text pairs and at some typical errors found in many dling noisy data, models trained on datasets gener-
of the generated examples. As shown in Table 4, for ated with D INO achieve strong results on several
y = 1 the PLM sometimes comes up with decent semantic textual similarity datasets.
paraphrases (e.g. “notches a victory” 7→ “wins”) or For future work, it would be interesting to see
substitutes with very similar meaning (“cutting” 7→ whether the noise in datasets generated with D INO
“slicing”), but more often it generates sentences that can further be reduced, e.g., by using different
either omit or mix up important information, and sets of instructions (Jiang et al., 2020; Schick and
sometimes it produces sentences with an entirely Schütze, 2021a) or by supplementing our pipeline
different meaning. Whereas sentences generated with some additional filtering steps.
for y = 0.5 by and large look reasonable, for y = 0 Acknowledgments This work was funded by the
the PLM often simply flips words (“closed” 7→ European Research Council (ERC #740516). We
“open”, “large” 7→ “small”) instead of producing thank the anonymous reviewers for their helpful
sentences on completely different topics. comments.
5 Conclusion
We have introduced D INO, a method for using large References
PLMs to generate entire datasets of labeled sen- Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
tence pairs from scratch, requiring no labeled data Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada
and no parameter updates. This is achieved by Mihalcea, German Rigau, Larraitz Uria, and Janyce
providing instructions in natural language, com- Wiebe. 2015. SemEval-2015 task 2: Semantic tex-
bined with the self-debiasing method of Schick tual similarity, English, Spanish and pilot on inter-
pretability. In Proceedings of the 9th International Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Workshop on Semantic Evaluation (SemEval 2015), Amanda Askell, Sandhini Agarwal, Ariel Herbert-
pages 252–263, Denver, Colorado. Association for Voss, Gretchen Krueger, Tom Henighan, Rewon
Computational Linguistics. Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
Clemens Winter, Chris Hesse, Mark Chen, Eric
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Jack Clark, Christopher Berner, Sam McCandlish,
Guo, Rada Mihalcea, German Rigau, and Janyce Alec Radford, Ilya Sutskever, and Dario Amodei.
Wiebe. 2014. SemEval-2014 task 10: Multilingual 2020. Language models are few-shot learners. In
semantic textual similarity. In Proceedings of the Advances in Neural Information Processing Systems,
8th International Workshop on Semantic Evaluation volume 33, pages 1877–1901. Curran Associates,
(SemEval 2014), pages 81–91, Dublin, Ireland. As- Inc.
sociation for Computational Linguistics.
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Gazpio, and Lucia Specia. 2017. SemEval-2017
Aitor Gonzalez-Agirre, Rada Mihalcea, German task 1: Semantic textual similarity multilingual and
Rigau, and Janyce Wiebe. 2016. SemEval-2016 crosslingual focused evaluation. In Proceedings
task 1: Semantic textual similarity, monolingual of the 11th International Workshop on Semantic
and cross-lingual evaluation. In Proceedings of the Evaluation (SemEval-2017), pages 1–14, Vancouver,
10th International Workshop on Semantic Evalua- Canada. Association for Computational Linguistics.
tion (SemEval-2016), pages 497–511, San Diego,
California. Association for Computational Linguis- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
tics. Nicole Limtiaco, Rhomni St. John, Noah Constant,
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-
Brian Strope, and Ray Kurzweil. 2018. Universal
Agirre, and Weiwei Guo. 2013. *SEM 2013 shared
sentence encoder for English. In Proceedings of
task: Semantic textual similarity. In Second Joint
the 2018 Conference on Empirical Methods in Nat-
Conference on Lexical and Computational Seman-
ural Language Processing: System Demonstrations,
tics (*SEM), Volume 1: Proceedings of the Main
pages 169–174, Brussels, Belgium. Association for
Conference and the Shared Task: Semantic Textual
Computational Linguistics.
Similarity, pages 32–43, Atlanta, Georgia, USA. As-
sociation for Computational Linguistics. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc
Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Barrault, and Antoine Bordes. 2017. Supervised
Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pi- learning of universal sentence representations from
lot on semantic textual similarity. In Proceedings natural language inference data. In Proceedings of
of the First Joint Conference on Lexical and Com- the 2017 Conference on Empirical Methods in Nat-
putational Semantics - Volume 1: Proceedings of ural Language Processing, pages 670–680, Copen-
the Main Conference and the Shared Task, and Vol- hagen, Denmark. Association for Computational
ume 2: Proceedings of the Sixth International Work- Linguistics.
shop on Semantic Evaluation, SemEval ’12, page Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
385–393, USA. Association for Computational Lin- Kristina Toutanova. 2019. BERT: Pre-training of
guistics. deep bidirectional transformers for language under-
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, standing. In Proceedings of the 2019 Conference
Amir Kantor, George Kour, Segev Shlomov, Naama of the North American Chapter of the Association
Tepper, and Naama Zwerdling. 2020. Do not have for Computational Linguistics: Human Language
enough data? Deep learning to the rescue! Pro- Technologies, Volume 1 (Long and Short Papers),
ceedings of the AAAI Conference on Artificial Intel- pages 4171–4186, Minneapolis, Minnesota. Associ-
ligence, 34(05):7383–7390. ation for Computational Linguistics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Avia Efrat and Omer Levy. 2020. The turking test: Can
Tomas Mikolov. 2017. Enriching word vectors with language models understand instructions? Comput-
subword information. Transactions of the Associa- ing Research Repository, arXiv:2010.11982.
tion for Computational Linguistics, 5:135–146.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
Samuel R. Bowman, Gabor Angeli, Christopher Potts, erarchical neural story generation. In Proceedings
and Christopher D. Manning. 2015. A large anno- of the 56th Annual Meeting of the Association for
tated corpus for learning natural language inference. Computational Linguistics (Volume 1: Long Papers),
In Proceedings of the 2015 Conference on Empiri- pages 889–898, Melbourne, Australia. Association
cal Methods in Natural Language Processing, pages for Computational Linguistics.
632–642, Lisbon, Portugal. Association for Compu-
tational Linguistics. William Fedus, Barret Zoph, and Noam Shazeer. 2021.
Switch transformers: Scaling to trillion parameter
Tom Brown, Benjamin Mann, Nick Ryder, Melanie models with simple and efficient sparsity. Comput-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, ing Research Repository, arXiv:2101.03961.
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Marco Marelli, Stefano Menini, Marco Baroni, Luisa
Making pre-trained language models better few-shot Bentivogli, Raffaella Bernardi, and Roberto Zampar-
learners. In Proceedings of the 59th Annual Meet- elli. 2014. A SICK cure for the evaluation of compo-
ing of the Association for Computational Linguistics sitional distributional semantic models. In Proceed-
and the 11th International Joint Conference on Nat- ings of the Ninth International Conference on Lan-
ural Language Processing (Volume 1: Long Papers), guage Resources and Evaluation (LREC’14), pages
pages 3816–3830, Online. Association for Computa- 216–223, Reykjavik, Iceland. European Language
tional Linguistics. Resources Association (ELRA).
John M. Giorgi, Osvald Nitski, Gary D. Bader, and Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Bo Wang. 2020. DeCLUTR: Deep contrastive learn- Dean. 2013. Efficient estimation of word represen-
ing for unsupervised textual representations. Com- tations in vector space. Computing Research Repos-
puting Research Repository, arXiv:2006.03659. itory, arXiv:1301.3781.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Biswesh Mohapatra, Gaurav Pandey, Danish Contrac-
Yejin Choi. 2020. The curious case of neural text de- tor, and Sachindra Joshi. 2020. Simulated chats for
generation. In International Conference on Learn- task-oriented dialog: Learning to generate conversa-
ing Representations. tions from instructions. Computing Research Repos-
itory, arXiv:2010.10216.
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine
Bosselut, David Golub, and Yejin Choi. 2018. Yannis Papanikolaou and Andrea Pierleoni. 2020.
Learning to write with cooperative discriminators. DARE: Data augmented relation extraction
In Proceedings of the 56th Annual Meeting of the As- with GPT-2. Computing Research Repository,
sociation for Computational Linguistics (Volume 1: arXiv:2004.13845.
Long Papers), pages 1638–1649, Melbourne, Aus-
Adam Paszke, Sam Gross, Soumith Chintala, Gregory
tralia. Association for Computational Linguistics.
Chanan, Edward Yang, Zachary DeVito, Zeming
Lin, Alban Desmaison, Luca Antiga, and Adam
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham
Lerer. 2017. Automatic differentiation in PyTorch.
Neubig. 2020. How can we know what language
In NIPS Autodiff Workshop.
models know? Transactions of the Association for
Computational Linguistics, 8:423–438. Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. GloVe: Global vectors for word
Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, representation. In Proceedings of the 2014 Confer-
Richard Zemel, Raquel Urtasun, Antonio Torralba, ence on Empirical Methods in Natural Language
and Sanja Fidler. 2015. Skip-thought vectors. In Processing (EMNLP), pages 1532–1543, Doha,
Advances in Neural Information Processing Systems, Qatar. Association for Computational Linguistics.
volume 28. Curran Associates, Inc.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Gardner, Christopher Clark, Kenton Lee, and Luke
2021. Data augmentation using pre-trained trans- Zettlemoyer. 2018. Deep contextualized word rep-
former models. Computing Research Repository, resentations. In Proceedings of the 2018 Confer-
arXiv:2003.02245. ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan-
Quoc Le and Tomas Mikolov. 2014. Distributed repre- guage Technologies, Volume 1 (Long Papers), pages
sentations of sentences and documents. In Proceed- 2227–2237, New Orleans, Louisiana. Association
ings of the 31st International Conference on Ma- for Computational Linguistics.
chine Learning, volume 32 of Proceedings of Ma-
chine Learning Research, pages 1188–1196, Bejing, Nina Pörner and Hinrich Schütze. 2019. Multi-
China. PMLR. view domain adapted sentence embeddings for low-
resource unsupervised duplicate question detection.
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, In Proceedings of the 2019 Conference on Empiri-
Yiming Yang, and Lei Li. 2020. On the sentence cal Methods in Natural Language Processing and
embeddings from pre-trained language models. In the 9th International Joint Conference on Natural
Proceedings of the 2020 Conference on Empirical Language Processing, EMNLP-IJCNLP 2019, Hong
Methods in Natural Language Processing (EMNLP), Kong, China, November 3-7, 2019, pages 1630–
pages 9119–9130, Online. Association for Computa- 1641. Association for Computational Linguistics.
tional Linguistics.
Nina Pörner, Ulli Waltinger, and Hinrich Schütze. 2020.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Sentence meta-embeddings for unsupervised seman-
Mandar Joshi, Danqi Chen, Omer Levy, Mike tic textual similarity. In Proceedings of the 58th An-
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. nual Meeting of the Association for Computational
2019. RoBERTa: A robustly optimized BERT pre- Linguistics, ACL 2020, Online, July 5-10, 2020,
training approach. Computing Research Repository, pages 7027–7034. Association for Computational
arXiv:1907.11692. Linguistics.
Raul Puri and Bryan Catanzaro. 2019. Zero-shot Orion Weller, Nicholas Lourie, Matt Gardner, and
text classification with generative language models. Matthew Peters. 2020. Learning from task descrip-
Computing Research Repository, arXiv:1912.10165. tions. Proceedings of the 2020 Conference on Em-
pirical Methods in Natural Language Processing
Alec Radford, Karthik Narasimhan, Tim Salimans, and (EMNLP).
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training. John Wieting and Kevin Gimpel. 2018. ParaNMT-
50M: Pushing the limits of paraphrastic sentence em-
Alec Radford, Jeff Wu, Rewon Child, David Luan, beddings with millions of machine translations. In
Dario Amodei, and Ilya Sutskever. 2019. Language Proceedings of the 56th Annual Meeting of the As-
models are unsupervised multitask learners. Techni- sociation for Computational Linguistics (Volume 1:
cal report. Long Papers), pages 451–462, Melbourne, Australia.
Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
ine Lee, Sharan Narang, Michael Matena, Yanqi Adina Williams, Nikita Nangia, and Samuel Bowman.
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring 2018. A broad-coverage challenge corpus for sen-
the limits of transfer learning with a unified text-to- tence understanding through inference. In Proceed-
text transformer. Journal of Machine Learning Re- ings of the 2018 Conference of the North American
search, 21(140):1–67. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume
Nils Reimers and Iryna Gurevych. 2019. Sentence- 1 (Long Papers), pages 1112–1122. Association for
BERT: Sentence embeddings using Siamese BERT- Computational Linguistics.
networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
and the 9th International Joint Conference on Natu- Chaumond, Clement Delangue, Anthony Moi, Pier-
ral Language Processing (EMNLP-IJCNLP), pages ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
3982–3992, Hong Kong, China. Association for icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Computational Linguistics. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Timo Schick and Hinrich Schütze. 2020. Few-shot text Quentin Lhoest, and Alexander Rush. 2020. Trans-
generation with pattern-exploiting training. Comput- formers: State-of-the-art natural language process-
ing Research Repository, arXiv:2012.11926. ing. In Proceedings of the 2020 Conference on Em-
pirical Methods in Natural Language Processing:
Timo Schick and Hinrich Schütze. 2021a. Exploit- System Demonstrations, pages 38–45, Online. Asso-
ing cloze questions for few shot text classification ciation for Computational Linguistics.
and natural language inference. In Proceedings of
the 16th Conference of the European Chapter of Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian
the Association for Computational Linguistics, Kyiv, Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: Con-
Ukraine (Online). International Committee on Com- trastive learning for sentence representation. Com-
putational Linguistics. puting Research Repository, arXiv:2012.15466.
Timo Schick and Hinrich Schütze. 2021b. It’s not just Yiben Yang, Chaitanya Malaviya, Jared Fernandez,
size that matters: Small language models are also Swabha Swayamdipta, Ronan Le Bras, Ji-Ping
few-shot learners. In Proceedings of the 2021 Con- Wang, Chandra Bhagavatula, Yejin Choi, and Doug
ference of the North American Chapter of the Asso- Downey. 2020. Generative data augmentation for
ciation for Computational Linguistics: Human Lan- commonsense reasoning. In Findings of the Associ-
guage Technologies, pages 2339–2352, Online. As- ation for Computational Linguistics: EMNLP 2020,
sociation for Computational Linguistics. pages 1008–1025, Online. Association for Computa-
tional Linguistics.
Timo Schick, Sahana Udupa, and Hinrich Schütze.
2021. Self-diagnosis and self-debiasing: A proposal Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim,
for reducing corpus-based bias in NLP. Transac- and Lidong Bing. 2020. An unsupervised sentence
tions of the Association for Computational Linguis- embedding method by mutual information maxi-
tics. mization. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and ing (EMNLP), pages 1601–1610, Online. Associa-
Z. Wojna. 2016. Rethinking the inception architec- tion for Computational Linguistics.
ture for computer vision. In 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition
(CVPR), pages 2818–2826.
C Additional Results
Our main results do not include scores for De-
CLUTR (Giorgi et al., 2020) and CLEAR (Wu
et al., 2020) – two recent approaches using con-
trastive learning – as their evaluation setup dif-
fers from that described in Reimers and Gurevych
(2019) (and used by all other baselines) in the fol-
lowing respects: