Universal Sentence Encoder
Universal Sentence Encoder
Universal Sentence Encoder
Daniel Cera , Yinfei Yanga , Sheng-yi Konga , Nan Huaa , Nicole Limtiacob ,
Rhomni St. Johna , Noah Constanta , Mario Guajardo-Céspedesa , Steve Yuanc ,
Chris Tara , Yun-Hsuan Sunga , Brian Stropea , Ray Kurzweila
a b c
Google Research Google Research Google
Mountain View, CA New York, NY Cambridge, MA
Abstract
We present models for encoding sentences
arXiv:1803.11175v2 [cs.CL] 12 Apr 2018
Table 2: Model performance on transfer tasks. USE T is the universal sentence encoder (USE) using
Transformer. USE D is the universal encoder DAN model. Models tagged with w2v w.e. make use of
pre-training word2vec skip-gram embeddings for the transfer task model, while models tagged with lrn
w.e. use randomly initialized word embeddings that are learned only on the transfer task data. Accuracy
is reported for all evaluations except STS Bench where we report the Pearson correlation of the similar-
ity scores with human judgments. Pairwise similarity scores are computed directly using the sentence
embeddings from the universal sentence encoder as in Eq. (1).
to the transfer task classification layers. For com- To assess bias in our encoding models, we eval-
pleteness, we also explore concatenating the rep- uate the strength of various associations learned
resentations from sentence level transfer models by our model on WEAT word lists. We compare
with the baseline models that do not make use of our result to those of Caliskan et al. (2017) who
word level transfer learning. discovered that word embeddings could be used to
reproduce human performance on implicit associ-
6 Experiments ation tasks for both benign and potentially unde-
sirable associations.
Transfer task model hyperparamaters are tuned
using a combination of Vizier (Golovin et al.) 7 Results
and light manual tuning. When available, model
hyperparameters are tuned using task dev sets. Transfer task performance is summarized in Ta-
Otherwise, hyperparameters are tuned by cross- ble 2. We observe that transfer learning from the
validation on the task training data when available transformer based sentence encoder usually per-
or the evaluation test data when neither training forms as good or better than transfer learning from
nor dev data are provided. Training repeats ten the DAN encoder. Hoewver, transfer learning us-
times for each transfer task model with different ing the simpler and fast DAN encoder can for
randomly initialized weights and we report evalu- some tasks perform as well or better than the more
ation results by averaging across runs. sophisticated transformer encoder. Models that
Transfer learning is critically important when make use of sentence level transfer learning tend
training data for a target task is limited. We ex- to perform better than models that only use word
plore the impact on task performance of varying level transfer. The best performance on most tasks
the amount of training data available for the task is obtained by models that make use of both sen-
both with and without the use of transfer learning. tence and word level transfer.
Contrasting the transformer and DAN based en- Table 3 illustrates transfer task performance for
coders, we demonstrate trade-offs in model com- varying amounts of training data. We observe that,
plexity and the amount of data required to reach a for smaller quantities of data, sentence level trans-
desired level of accuracy on a task. fer learning can achieve surprisingly good task
Model SST 1k SST 2k SST 4k SST 8k SST 16k SST 32k SST 67.3k
Sentence & Word Embedding Transfer Learning
USE D+DNN (w2v w.e.) 78.65 78.68 79.07 81.69 81.14 81.47 82.14
USE D+CNN (w2v w.e.) 77.79 79.19 79.75 82.32 82.70 83.56 85.29
USE T+DNN (w2v w.e.) 85.24 84.75 85.05 86.48 86.44 86.38 86.62
USE T+CNN (w2v w.e.) 84.44 84.16 84.77 85.70 85.22 86.38 86.69
Sentence Embedding Transfer Learning
USE D 77.47 76.38 77.39 79.02 78.38 77.79 77.62
USE T 84.85 84.25 85.18 85.63 85.83 85.59 85.38
USE D+DNN (lrn w.e.) 75.90 78.68 79.01 82.31 82.31 82.14 83.41
USE D+CNN (lrn w.e.) 77.28 77.74 79.84 81.83 82.64 84.24 85.27
USE T+DNN (lrn w.e.) 84.51 84.87 84.55 85.96 85.62 85.86 86.24
USE T+CNN (lrn w.e.) 82.66 83.73 84.23 85.74 86.06 86.97 87.21
Word Embedding Transfer Learning
DNN (w2v w.e.) 66.34 69.67 73.03 77.42 78.29 79.81 80.24
CNN (w2v w.e.) 68.10 71.80 74.91 78.86 80.83 81.98 83.74
Baselines with No Transfer Learning
DNN (lrn w.e.) 66.87 71.23 73.70 77.85 78.07 80.15 81.52
CNN (lrn w.e.) 67.98 71.81 74.90 79.14 81.04 82.72 84.90
Table 3: Task performance on SST for varying amounts of training data. SST 67.3k represents the full
training set. Using only 1,000 examples for training, transfer learning from USE T is able to obtain
performance that rivals many of the other models trained on the full 67.3 thousand example training set.
performance. As the training set size increases, resource requirements introduced by the different
models that do not make use of transfer learning models that could be used.
approach the performance of the other models.
Table 4 contrasts Caliskan et al. (2017)’s find- 8 Resource Usage
ings on bias within GloVe embeddings with the
This section describes memory and compute re-
DAN variant of the universal encoder. Similar
source usage for the transformer and DAN sen-
to GloVe, our model reproduces human associa-
tence encoding models for different sentence
tions between flowers vs. insects and pleasantness
lengths. Figure 2 plots model resource usage
vs. unpleasantness. However, our model demon-
against sentence length.
strates weaker associations than GloVe for probes
targeted at revealing at ageism, racism and sex- Compute Usage The transformer model time
ism.6 The differences in word association patterns complexity is O(n2 ) in sentence length, while the
can be attributed to differences in the training data DAN model is O(n). As seen in Figure 2 (a-b), for
composition and the mixture of tasks used to train short sentences, the transformer encoding model
the sentence embeddings. is only moderately slower than the much simpler
DAN model. However, compute time for trans-
7.1 Discussion former increases noticeably as sentence length in-
Transfer learning leads to performance improve- creases. In contrast, the compute time for the DAN
ments on many tasks. Using transfer learning is model stays nearly constant as sentence length is
more critical when less training data is available. increased. Since the DAN model is remarkably
When task performance is close, the correct mod- computational efficient, using GPUs over CPUs
eling choice should take into account engineer- will often have a much larger practical impact for
ing trade-offs regarding the memory and compute the transformer based encoder.
6
Researchers and developers are strongly encouraged to Memory Usage The transformer model space
independently verify whether biases in their overall model complexity also scales quadratically, O(n2 ), in
or model components impacts their use case. For resources
on ML fairness visit https://developers.google.com/machine- sentence length, while the DAN model space com-
learning/fairness-overview/. plexity is constant in the length of the sentence.
(a) CPU Time vs. Sentence Length (b) GPU Time vs. Sentence Length (c) Memory vs. Sentence Length
Figure 2: Model Resource Usage for both USE D and USE T at different batch sizes and sentence
lengths.
GloVe Uni. Enc. (DAN)
Target words Attrib. words Ref
d p d p
Eur.-American vs Afr.-American names Pleasant vs. Unpleasant 1 a 1.41 10−8 0.361 0.035
Eur.-American vs. Afr.-American names Pleasant vs. Unpleasant from (a) b 1.50 10−4 -0.372 0.87
Eur.-American vs. Afr.-American names Pleasant vs. Unpleasant from (c) b 1.28 10−3 0.721 0.015
Male vs. female names Career vs family c 1.81 10−3 0.0248 0.48
Math vs. arts Male vs. female terms c 1.06 0.018 0.588 0.12
Science vs. arts Male vs female terms d 1.24 10−2 0.236 0.32
Mental vs. physical disease Temporary vs permanent e 1.38 10−2 1.60 0.0027
Young vs old peoples names Pleasant vs unpleasant c 1.21 10−2 1.01 0.022
Flowers vs. insects Pleasant vs. Unpleasant a 1.50 10−7 1.38 10−7
Instruments vs. Weapons Pleasant vs Unpleasant a 1.53 10−7 1.44 10−7
Table 4: Word Embedding Association Tests (WEAT) for GloVe and the Universal Encoder. Effect size
is reported as Cohen’s d over the mean cosine similarity scores across grouped attribute words. Statistical
significance is reported for 1 tailed p-scores. The letters in the Ref column indicates the source of the IAT
word lists: (a) Greenwald et al. (1998) (b) Bertrand and Mullainathan (2004) (c) Nosek et al. (2002a)
(d) Nosek et al. (2002b) (e) Monteith and Pettit (2011).
Similar to compute usage, memory usage for the training data is available for the transfer task. The
transformer model increases quickly with sentence encoding models make different trade-offs regard-
length, while the memory usage for the DAN ing accuracy and model complexity that should
model remains constant. We note that, for the be considered when choosing the best model for
DAN model, memory usage is dominated by the a particular application. The pre-trained encod-
parameters used to store the model unigram and ing models will be made publicly available for
bigram embeddings. Since the transformer model research and use in applications that can benefit
only needs to store unigram embeddings, for short from a better understanding of natural language.
sequences it requires nearly half as much memory
as the DAN model. Acknowledgments
Samuel R. Bowman, Gabor Angeli, Christopher Potts, Lindsey L. Monteith and Jeremy W. Pettit. 2011. Im-
and Christopher D. Manning. 2015. A large anno- plicit and explicit stigmatizing attitudes and stereo-
tated corpus for learning natural language inference. types about depression. Journal of Social and Clin-
In Proceedings of EMNLP. ical Psychology, 30(5).
Aylin Caliskan, Joanna J. Bryson, and Arvind Brian A. Nosek, Mahzarin R. Banaji, and Anthony G.
Narayanan. 2017. Semantics derived automatically Greenwald. 2002a. Harvesting implicit group at-
from language corpora contain human-like biases. titudes and beliefs from a demonstration web site.
Science, 356(6334):183–186. Group Dynamics, 6(1).
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Brian A. Nosek, Mahzarin R. Banaji, and Anthony G
Gazpio, and Lucia Specia. 2017. Semeval-2017 Greenwald. 2002b. Math = male, me = female,
task 1: Semantic textual similarity multilingual and therefore math me. Journal of Personality and So-
crosslingual focused evaluation. In Proceedings of cial Psychology,, 83(1).
SemEval-2017.
Bo Pang and Lillian Lee. 2004. A sentimental educa-
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic tion: Sentiment analysis using subjectivity summa-
Barrault, and Antoine Bordes. 2017. Supervised rization based on minimum cuts. In Proceedings of
learning of universal sentence representations from the 42nd Meeting of the Association for Computa-
natural language inference data. arXiv preprint tional Linguistics (ACL’04), Main Volume.
arXiv:1705.02364.
Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, ploiting class relationships for sentiment categoriza-
Greg Kochanski, John Karro, and D. Sculley. tion with respect to rating scales. In Proceedings of
Google vizier: A service for black-box optimization. ACL’05.
In Proceedings of KDD ’17.
Jeffrey Pennington, Richard Socher, and Christo-
Anthony G. Greenwald, Debbie E. McGhee, and Jor- pher D. Manning. 2014. Glove: Global vectors for
dan L. K. Schwartz. 1998. Measuring individual word representation. In Proceeding of EMNLP.
differences in implicit cognition: the implicit asso-
ciation test. Journal of personality and social psy- Richard Socher, Alex Perelygin, Jean Wu, Jason
chology, 74(6). Chuang, Christopher D. Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models
Matthew Henderson, Rami Al-Rfou, Brian Strope, for semantic compositionality over a sentiment tree-
Yun-Hsuan Sung, László Lukács, Ruiqi Guo, San- bank. In Proceedings of EMNLP.
jiv Kumar, Balint Miklos, and Ray Kurzweil. 2017.
Efficient natural language response suggestion for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
smart reply. CoRR, abs/1705.00652. Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
you need. In Proceedings of NIPS.
short-term memory. Neural Comput., 9(8):1735–
1780. Janyce Wiebe, Theresa Wilson, and Claire Cardie.
Minqing Hu and Bing Liu. 2004. Mining and sum- 2005. Annotating expressions of opinions and emo-
marizing customer reviews. In Proceedings of KDD tions in language. Language Resources and Evalu-
’04. ation, 39(2):165–210.