2 Ruf
2 Ruf
2 Ruf
59
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, pages 59 - 69
May 27, 2022 ©2022 Association for Computational Linguistics
2 Related Work abilities from a BERT model. Sheng et al. (2019)
defined and measured a concept of regard and sen-
Bias Issues in Machine Learning Unfair be- timent for GPT-2 output. Finally, Nadeem et al.
haviors have been found in many machine learning (2021) proposed a new benchmark called StereoSet.
and artificial intelligence applications, including fa- It includes sentence- and discourse-level measure-
cial recognition (Raji and Buolamwini, 2019), rec- ments that cover bias among genders, races, pro-
ommendation systems (Schnabel et al., 2016), and fessions, and religions. In this work, we applied
speech recognition (Koenecke et al., 2020). One StereoSet to evaluate our models.
major source of bias comes from training datasets
that render models to behave negatively towards
underrepresented groups (Mehrabi et al., 2021).
For example, Shankar et al. (2017) found that Im- Mitigating Bias in NLP Models Bolukbasi
ageNet (Russakovsky et al., 2015) and the Open et al. (2016) mitigated bias by subtracting the pro-
Images dataset (Krasin et al., 2017) disproportion- jected gender direction from words that should be
ately represented people from North America and gender-neutral while also maintaining equal dis-
Europe. To mitigate biased behaviors in machine tance between non-gendered words and pairs of
learning models, researchers have proposed meth- gendered words. Zhao et al. (2018b) reserved cer-
ods targeting different tasks and domains, such as tain dimensions of embedding vectors for gender in-
classification (Menon and Williamson, 2018; Roh formation, where gender-neutral words were made
et al., 2021), regression (Agarwal et al., 2019; Berk orthogonal to the gender direction. Gonen and
et al., 2017), and adversarial learning (Xu et al., Goldberg (2016) pointed out a limitation in the
2018). two previous methods that the relative similarity
among words still exists; i.e., words that are biased
Bias Issues in NLP Models Traditional static towards the same group remain close to each other.
word embedding models are no exception to this Zhao et al. (2018a) and Zhao et al. (2019) used data
trend and also demonstrate gender bias. Bolukbasi augmentation to replace gendered words with their
et al. (2016) showed that in word2vec (Mikolov opposites in the original training corpus, and they
et al., 2013), the embedding vector “doctor” is trained a new model on the union of both corpora.
closer to “male” than to “female.” Similarly, However, this method requires re-training that is ex-
Caliskan et al. (2017) found that GloVe (Penning- pensive with large-scale neural networks. Finally,
ton et al., 2014) and word2vec (Mikolov et al., Peng et al. (2020) applied normative fine-tuning on
2013) contained the same stereotype associations GPT-2 to reduce the frequency of non-normative
found in classic human psychology studies (Green- output.
wald et al., 1998). Sheng et al. (2019) and May
et al. (2019) revealed harmful stereotypes in pre-
trained language models and their contextual word
embeddings such as ELMo (Peters et al., 2018), Transfer Learning and Fine-Tuning Trans-
GPT-2 (Radford et al., 2019), and BERT (Devlin fer learning studies how to transfer machine-
et al., 2019). learned knowledge to different but related domains
Early works measured bias at the word level us- (Zhuang et al., 2020). Fine-tuning, one approach
ing the cosine similarity between embedding vec- of transfer learning, has been widely used for
tors such as Bolukbasi et al. (2016) and the Word neural network models (Ge and Yu, 2017; Jung
Embedding Association Tests (WEAT) (Caliskan et al., 2015; Maqsood et al., 2019; Shin et al.,
et al., 2017). May et al. (2019) extended WEAT 2016). Specifically in the field of NLP, fine-tuning
to the Sentence Encoder Association Test (SEAT) can transfer language models such as transform-
to measure bias in ELMo (Peters et al., 2018) and ers (Vaswani et al., 2017) into various other task
BERT (Devlin et al., 2019). However, they found modalities (Abramson et al., 2020; Dosovitskiy
inconsistencies in such cosine-based measurements et al., 2020; Lu et al., 2021; Radford et al., 2021).
applied to contextual word embeddings. Later, Ku- For example, Lu et al. (2021) fine-tuned transform-
rita et al. (2019) proposed a more consistent met- ers pre-trained on English text to perform well on
ric by masking combinations of target words and sequence classification tasks in the domains of nu-
attributes and measuring the predicted token prob- merical computation, vision, and biology.
60
3 Method 3.3 StereoSet Benchmark
3.1 Dataset StereoSet (Nadeem et al., 2021) provides a quanti-
We curated a fine-tuning dataset by combining the tative assessment regarding how prone a language
WinoBias (Zhao et al., 2018a) and CrowS-Pairs model is to stereotypical bias. The benchmark con-
(Nangia et al., 2020) datasets to obtain a total of sists of various fill-in-the-blank tests (called Con-
4,600 sentences, further split into training (80%), text Association Tests or CATs) with three multiple
cross-validation (10%), and testing sets (10%). We choice answers. A CAT prompt partially describes
describe the contents of each dataset below. a person or situation. The model in question must
complete the prompt with one of three given op-
3.1.1 WinoBias tions. One response reflects a traditional stereo-
The WinoBias dataset provided by Zhao et al. type; another response reflects the opposite of that
(2018a) contains 1,584 training sentences involving stereotype, and the last response is nonsensical.
both genders and professions such that professions StereoSet contains two types of tasks: intrasen-
are described with an equal distribution of mascu- tence and intersentence. Intrasentence prompts con-
line and feminine pronouns. sist of one sentence with the final word redacted,
and the model must complete that sentence. In-
3.1.2 CrowS-Pairs
tersentence prompts begin with one complete sen-
Additionally, we incorporated the CrowS-Pairs tence, and the model must choose the logical next
dataset provided by Nangia et al. (2020), containing sentence. While the original StereoSet work used
1,508 pairs of sentences. The first sentence of each both intrasentence and intersentence tasks, we fo-
pair targets a stereotype of a historically marginal- cused only on intrasentence.
ized group; the second sentence is a minor edit of StereoSet calculates three scores according to
the first, but it targets a different demographic or how the model completes the prompts. The lan-
attribute. We use both the stereotyped and anti- guage modeling score (LMS) represents the per-
stereotyped sentences to remain impartial towards centage of tests when the model picks a logical
each demographic. answer (either the stereotyped or anti-stereotyped
3.2 Fine-Tuning answer) over the nonsensical answer. For the ideal
language model, its LMS would be 100. The
We modified the GPT-2 small model publicly avail-
stereotype score (SS) represents the percentage
able via the Hugging Face Transformers library.3
of tests where the model picks a stereotyped an-
For each experiment, we froze the entire model and
swer over the anti-stereotyped answer. An ideal
applied one or more of the following modifications:
language model’s SS would be 50, where the model
1. Unfreezing the layer norm parameters prefers both the stereotyped and anti-stereotyped
response with equal probability. StereoSet makes
2. Unfreezing the word embeddings
the assumption that both of these answers should be
3. Unfreezing the word positioning embeddings equally likely, despite any real-world context such
as the actual gender distribution across professions.
4. Adding a linear input transformation
Finally, the Idealized CAT score (ICAT) is a com-
5. Adding a linear output transformation bination of the LMS and SS with the following
formula:
The linear input and output transformation layers
are initialized as an identity matrix with unfrozen min(SS, 100 − SS)
parameters. ICAT = LMS ·
50
We trained the models with a cross-entropy loss
and a batch size of 50. See Table 3 for the learning The ICAT score has the following properties: it
rate and training epochs of each model combina- reaches 100 when the LMS is 100 and the SS is
tion. After fine-tuning each altered model with 50, representing the perfect ideal model; when
optimized hyperparameters according to the cross- the model always picks the stereotyped or anti-
validation dataset, we applied the StereoSet bench- stereotyped answer (representing an SS of 100 or
mark. 0, respectively), then the ICAT will be 0; finally,
3
https://huggingface.co/docs/ a completely random model will have an ICAT of
transformers/model_doc/gpt2 50.
61
S TEREO S ET I NTRASENTENCE S CORES
OVERALL G ENDER P ROFESSION R ACE R ELIGION
M ODIFICATIONS LM SS ICAT LM SS ICAT LM SS ICAT LM SS ICAT LM SS ICAT
BASELINE 91.11 61.93 69.37 93.28 62.67 69.65 92.29 63.97 66.50 89.76 60.35 71.18 88.46 58.02 74.27
( UNMODIFIED )
LN 92.32 61.24 71.57 92.62 60.07 73.96 93.61 61.30 72.45 91.47 61.73 70.01 88.74 58.57 73.51
LN + WPE 92.31 61.04 71.93 92.61 60.34 73.45 93.77 61.17 72.81 91.33 61.38 70.54 88.45 57.91 74.45
LN + WPE + WTE 90.18 60.89 70.54 91.60 64.71 64.64 91.71 61.12 71.31 88.90 60.04 71.05 85.54 56.05 75.20
LN + WPE + WTE 90.79 60.88 71.03 91.08 66.08 61.79 92.15 60.69 72.45 89.72 60.10 71.60 89.05 54.85 80.45
+ I NPUT /O UTPUT
L AYER
F ULL M ODEL 91.22 61.41 70.40 92.53 61.47 71.31 92.80 62.46 69.67 89.89 60.87 70.34 87.04 57.27 74.38
U NFROZEN
Table 1: Various model combinations and their corresponding StereoSet Intrasentence scores. The baseline is an
unmodified GPT-2 model. Models with LN fine-tune the layer norm parameters. Models with WPE fine-tune the
word positioning embeddings. Models with WTE fine-tune the word embeddings. Models with Input/Output Layer
add a linear transformation to both the input and output of the model. All other parameters in the modified models
remained frozen. Each experiment was run n=10 times, with their average displayed in the table. The best score for
each column is bold. See Table 4 for the standard deviations of each cell.
62
ize any groups. We propose a method of mitigating Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
gender bias in a GPT-2 language model by fine- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
tuning less than 1% of the original model on a cu-
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
rated training set of only 3,680 sentences. Through Gretchen Krueger, Tom Henighan, Rewon Child,
the StereoSet quantitative benchmark, we demon- Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
strate that fine-tuning can help to reduce model Clemens Winter, Christopher Hesse, Mark Chen, Eric
prejudice at scale while preventing catastrophic Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
forgetting. Future work may look at reducing prej- Alec Radford, Ilya Sutskever, and Dario Amodei.
udice in other demographics beyond the four types 2020. Language models are few-shot learners.
tested in StereoSet. We may also look into how
Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan.
much training data is required to effectively miti- 2017. Semantics derived automatically from lan-
gate bias in these language models and what types guage corpora contain human-like biases. Science,
of training data work best. Finally, we want to 356(6334):183–186.
investigate the limitations of such methods and in-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
quire if any prejudice is embedded in the model Kristina Toutanova. 2019. BERT: Pre-training of
beyond what we measured in our initial experi- deep bidirectional transformers for language under-
ments. standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Acknowledgements Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
This work was supported in part by NSF/Intel Part- 4171–4186, Minneapolis, Minnesota. Association for
nership on Machine Learning for Wireless Net- Computational Linguistics.
working Program under Grant No. CNS-2003129, Alexey Dosovitskiy, Lucas Beyer, Alexander
and the Understanding and Reducing Inequalities Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Initiative of the University of Wisconsin–Madison, Thomas Unterthiner, Mostafa Dehghani, Matthias
Office of the Vice Chancellor for Research and Minderer, Georg Heigold, Sylvain Gelly, et al. 2020.
An image is worth 16x16 words: Transformers
Graduate Education with funding from the Wiscon- for image recognition at scale. arXiv preprint
sin Alumni Research Foundation. arXiv:2010.11929.
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
Venkatesh Saligrama, and Adam T Kalai. 2016. Man Joel Veness, Guillaume Desjardins, Andrei A. Rusu,
is to computer programmer as woman is to home- Kieran Milan, John Quan, Tiago Ramalho, Ag-
maker? debiasing word embeddings. Advances in nieszka Grabska-Barwinska, Demis Hassabis, Clau-
neural information processing systems, 29. dia Clopath, Dharshan Kumaran, and Raia Hadsell.
63
2017. Overcoming catastrophic forgetting in neural Aditya Krishna Menon and Robert C Williamson. 2018.
networks. Proceedings of the National Academy of The cost of fairness in binary classification. In Pro-
Sciences, 114(13):3521–3526. ceedings of the 1st Conference on Fairness, Account-
ability and Transparency, volume 81 of Proceed-
Allison Koenecke, Andrew Nam, Emily Lake, Joe ings of Machine Learning Research, pages 107–118.
Nudell, Minnie Quartey, Zion Mengesha, Connor PMLR.
Toups, John R Rickford, Dan Jurafsky, and Sharad
Goel. 2020. Racial disparities in automated speech Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
recognition. Proceedings of the National Academy frey Dean. 2013. Efficient estimation of word
of Sciences, 117(14):7684–7689. representations in vector space. arXiv preprint
arXiv:1301.3781.
Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari,
Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Moin Nadeem, Anna Bethke, and Siva Reddy. 2021.
Jasper Uijlings, Stefan Popov, Andreas Veit, Serge StereoSet: Measuring stereotypical bias in pretrained
Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, language models. In Proceedings of the 59th Annual
Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Meeting of the Association for Computational Lin-
Narayanan, and Kevin Murphy. 2017. Openimages: guistics and the 11th International Joint Conference
A public dataset for large-scale multi-label and multi- on Natural Language Processing (Volume 1: Long
class image classification. Dataset available from Papers), pages 5356–5371, Online. Association for
https://github.com/openimages. Computational Linguistics.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Xiangyu Peng, Siyan Li, Spencer Frazier, and Mark
Kevin Gimpel, Piyush Sharma, and Radu Soricut. Riedl. 2020. Reducing non-normative text genera-
2020. Albert: A lite bert for self-supervised learning tion from language models. In Proceedings of the
of language representations. 13th International Conference on Natural Language
Generation, pages 374–383, Dublin, Ireland. Associ-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- ation for Computational Linguistics.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Jeffrey Pennington, Richard Socher, and Christopher
Roberta: A robustly optimized bert pretraining ap- Manning. 2014. GloVe: Global vectors for word
proach. representation. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language Pro-
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mor- cessing (EMNLP), pages 1532–1543, Doha, Qatar.
datch. 2021. Pretrained transformers as universal Association for Computational Linguistics.
computation engines.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Muazzam Maqsood, Faria Nazir, Umair Khan, Farhan Zettlemoyer. 2018. Deep contextualized word repre-
Aadil, Habibullah Jamal, Irfan Mehmood, and Oh- sentations. In Proceedings of the 2018 Conference of
young Song. 2019. Transfer learning assisted classi- the North American Chapter of the Association for
fication and detection of alzheimer’s disease stages Computational Linguistics: Human Language Tech-
using 3d mri scans. Sensors, 19(11):2645. nologies, Volume 1 (Long Papers), pages 2227–2237,
New Orleans, Louisiana. Association for Computa-
Chandler May, Alex Wang, Shikha Bordia, Samuel R. tional Linguistics.
Bowman, and Rachel Rudinger. 2019. On measuring
social biases in sentence encoders. In Proceedings Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
of the 2019 Conference of the North American Chap- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
ter of the Association for Computational Linguistics: try, Amanda Askell, Pamela Mishkin, Jack Clark,
Human Language Technologies, Volume 1 (Long and et al. 2021. Learning transferable visual models
Short Papers), pages 622–628, Minneapolis, Min- from natural language supervision. In International
nesota. Association for Computational Linguistics. Conference on Machine Learning, pages 8748–8763.
PMLR.
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena,
Kristina Lerman, and Aram Galstyan. 2021. A sur- Alec Radford, Jeff Wu, Rewon Child, David Luan,
vey on bias and fairness in machine learning. ACM Dario Amodei, and Ilya Sutskever. 2019. Language
Computing Surveys (CSUR), 54(6):1–35. models are unsupervised multitask learners.
64
Inioluwa Deborah Raji and Joy Buolamwini. 2019. Ac- Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou,
tionable auditing: Investigating the impact of publicly and Dawei Song. 2022. A survey of controllable
naming biased performance results of commercial ai text generation using transformer-based pre-trained
products. In Proceedings of the 2019 AAAI/ACM language models.
Conference on AI, Ethics, and Society, AIES ’19,
page 429–435, New York, NY, USA. Association for Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell,
Computing Machinery. Vicente Ordonez, and Kai-Wei Chang. 2019. Gender
bias in contextualized word embeddings. In Proceed-
Yuji Roh, Kangwook Lee, Steven Euijong Whang, and ings of the 2019 Conference of the North American
Changho Suh. 2021. Fairbatch: Batch selection for Chapter of the Association for Computational Lin-
model fairness. guistics: Human Language Technologies, Volume
1 (Long and Short Papers), pages 629–634, Min-
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, neapolis, Minnesota. Association for Computational
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An- Linguistics.
drej Karpathy, Aditya Khosla, Michael Bernstein,
et al. 2015. Imagenet large scale visual recognition Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-
challenge. International journal of computer vision, donez, and Kai-Wei Chang. 2018a. Gender bias
115(3):211–252. in coreference resolution: Evaluation and debiasing
methods. In Proceedings of the 2018 Conference of
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, the North American Chapter of the Association for
Navin Chandak, and Thorsten Joachims. 2016. Rec- Computational Linguistics: Human Language Tech-
ommendations as treatments: Debiasing learning and nologies, Volume 2 (Short Papers), pages 15–20, New
evaluation. In Proceedings of The 33rd International Orleans, Louisiana. Association for Computational
Conference on Machine Learning, volume 48 of Pro- Linguistics.
ceedings of Machine Learning Research, pages 1670–
Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-
1679, New York, New York, USA. PMLR.
Wei Chang. 2018b. Learning gender-neutral word
Shreya Shankar, Yoni Halpern, Eric Breck, James At- embeddings. In Proceedings of the 2018 Conference
wood, Jimbo Wilson, and D Sculley. 2017. No classi- on Empirical Methods in Natural Language Process-
fication without representation: Assessing geodiver- ing, pages 4847–4853, Brussels, Belgium. Associa-
sity issues in open data sets for the developing world. tion for Computational Linguistics.
arXiv preprint arXiv:1711.08536. Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi,
Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, He. 2020. A comprehensive survey on transfer learn-
and Nanyun Peng. 2019. The woman worked as ing. Proceedings of the IEEE, 109(1):43–76.
a babysitter: On biases in language generation. In
Proceedings of the 2019 Conference on Empirical A Appendix
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan- A.1 Hyperparameters
guage Processing (EMNLP-IJCNLP), pages 3407–
3412, Hong Kong, China. Association for Computa-
M ODIFICATIONS L EARNING T RAINING DATASET
tional Linguistics.
R ATE E POCHS T EST L OSS
Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, BASELINE - - 4.22
Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mol- ( UNMODIFIED )
lura, and Ronald M Summers. 2016. Deep convolu- LN 0.003 6 3.32
tional neural networks for computer-aided detection: LN + WPE 0.003 6 3.32
Cnn architectures, dataset characteristics and trans- LN + WPE + WTE 0.0005 2 3.33
fer learning. IEEE transactions on medical imaging, LN + WPE + WTE 0.0006 2 3.23
35(5):1285–1298. + I NPUT /O UTPUT
L AYER
Irene Solaiman and Christy Dennison. 2021. Process F ULL M ODEL 0.0002 2 3.06
for adapting language models to society (palms) with U NFROZEN
values-targeted datasets.
Table 3: Fine-tuning hyperparameters for the models.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz The average of n=10 trials was taken to measure the
Kaiser, and Illia Polosukhin. 2017. Attention is all dataset test loss.
you need. Advances in neural information processing
systems, 30.
A.2 Example Output
Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu.
2018. Fairgan: Fairness-aware generative adversarial
Prompts taken from the StereoSet intrasentence
networks. In 2018 IEEE International Conference benchmark. WARNING: The following are
on Big Data (Big Data), pages 570–575. IEEE. model outputs which contain offensive material.
65
S TEREO S ET I NTRASENTENCE S CORES WITH S TANDARD D EVIATIONS
OVERALL G ENDER P ROFESSION R ACE R ELIGION
M ODIFICATIONS LM SS ICAT LM SS ICAT LM SS ICAT LM SS ICAT LM SS ICAT
BASELINE 91.11 61.93 69.37 93.28 62.67 69.65 92.29 63.97 66.50 89.76 60.35 71.18 88.46 58.02 74.27
( UNMODIFIED ) ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00 ±0.00
LN 92.32 61.24 71.57 92.62 60.07 73.96 93.61 61.30 72.45 91.47 61.73 70.01 88.74 58.57 73.51
±0.16 ±0.45 ±0.83 ±0.48 ±1.29 ±2.43 ±0.13 ±0.80 ±1.49 ±0.27 ±0.52 ±1.07 ±0.93 ±1.94 ±3.26
LN + WPE 92.31 61.04 71.93 92.61 60.34 73.45 93.77 61.17 72.81 91.33 61.38 70.54 88.45 57.91 74.45
±0.22 ±0.57 ±1.01 ±0.29 ±1.51 ±2.72 ±0.33 ±0.85 ±1.57 ±0.25 ±0.83 ±1.52 ±0.63 ±1.97 ±3.32
LN + WPE + WTE 90.18 60.89 70.54 91.60 64.71 64.64 91.71 61.12 71.31 88.90 60.04 71.05 85.54 56.05 75.20
±0.11 ±0.32 ±0.56 ±0.56 ±0.84 ±1.35 ±0.22 ±0.40 ±0.71 ±0.21 ±0.39 ±0.65 ±0.55 ±1.18 ±2.09
LN + WPE + WTE 90.79 60.88 71.03 91.08 66.08 61.79 92.15 60.69 72.45 89.72 60.10 71.60 89.05 54.85 80.45
+ I NPUT /O UTPUT ±0.20 ±0.28 ±0.53 ±0.32 ±1.02 ±1.90 ±0.28 ±0.40 ±0.82 ±0.34 ±0.74 ±1.24 ±1.33 ±1.98 ±4.37
L AYER
F ULL M ODEL 91.22 61.41 70.40 92.53 61.47 71.31 92.80 62.46 69.67 89.89 60.87 70.34 87.04 57.27 74.38
U NFROZEN ±0.36 ±0.52 ±0.91 ±0.71 ±0.79 ±1.82 ±0.40 ±0.95 ±1.61 ±0.42 ±0.54 ±0.93 ±0.94 ±2.27 ±3.86
Table 4: Same results as Table 1 but with standard deviations. Models with LN fine-tune the layer norm parameters.
Models with WPE fine-tune the word positioning embeddings. Models with WTE fine-tune the word embeddings.
Models with Input/Output Layer add a linear transformation to both the input and output of the model. All other
parameters in the modified models remained frozen. Each experiment was run n=10 times, with their average
displayed in the table. The best score for each column is bold.
69