Universal Sentence Encoder

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Universal Sentence Encoder

Daniel Cera , Yinfei Yanga , Sheng-yi Konga , Nan Huaa , Nicole Limtiacob ,
Rhomni St. Johna , Noah Constanta , Mario Guajardo-Céspedesa , Steve Yuanc ,
Chris Tara , Yun-Hsuan Sunga , Brian Stropea , Ray Kurzweila

a b c
Google Research Google Research Google
Mountain View, CA New York, NY Cambridge, MA

Abstract
We present models for encoding sentences
arXiv:1803.11175v2 [cs.CL] 12 Apr 2018

into embedding vectors that specifically


target transfer learning to other NLP tasks.
The models are efficient and result in
accurate performance on diverse transfer
tasks. Two variants of the encoding mod-
els allow for trade-offs between accuracy
and compute resources. For both vari-
ants, we investigate and report the rela-
tionship between model complexity, re-
source consumption, the availability of
transfer task training data, and task perfor-
mance. Comparisons are made with base- Figure 1: Sentence similarity scores using embed-
lines that use word level transfer learning dings from the universal sentence encoder.
via pretrained word embeddings as well
as baselines do not use any transfer learn- of pre-trained word embeddings such as those
ing. We find that transfer learning using produced by word2vec (Mikolov et al., 2013) or
sentence embeddings tends to outperform GloVe (Pennington et al., 2014). However, recent
word level transfer. With transfer learn- work has demonstrated strong transfer task per-
ing via sentence embeddings, we observe formance using pre-trained sentence level embed-
surprisingly good performance with min- dings (Conneau et al., 2017).
imal amounts of supervised training data In this paper, we present two models for produc-
for a transfer task. We obtain encourag- ing sentence embeddings that demonstrate good
ing results on Word Embedding Associ- transfer to a number of other of other NLP tasks.
ation Tests (WEAT) targeted at detecting We include experiments with varying amounts of
model bias. Our pre-trained sentence en- transfer task training data to illustrate the relation-
coding models are made freely available ship between transfer task performance and train-
for download and on TF Hub. ing set size. We find that our sentence embeddings
can be used to obtain surprisingly good task per-
1 Introduction
formance with remarkably little task specific train-
Limited amounts of training data are available for ing data. The sentence encoding models are made
many NLP tasks. This presents a challenge for publicly available on TF Hub.
data hungry deep learning methods. Given the Engineering characteristics of models used for
high cost of annotating supervised training data, transfer learning are an important consideration.
very large training sets are usually not available We discuss modeling trade-offs regarding mem-
for most research or industry NLP tasks. Many ory requirements as well as compute time on CPU
models address the problem by implicitly per- and GPU. Resource consumption comparisons are
forming limited transfer learning through the use made for sentences of varying lengths.
import tensorflow_hub as hub
3.1 Transformer
embed = hub.Module("https://tfhub.dev/google/" The transformer based sentence encoding model
"universal-sentence-encoder/1")
constructs sentence embeddings using the en-
embedding = embed([ coding sub-graph of the transformer architecture
"The quick brown fox jumps over the lazy dog."])
(Vaswani et al., 2017). This sub-graph uses at-
Listing 1: Python example code for using the tention to compute context aware representations
universal sentence encoder. of words in a sentence that take into account both
the ordering and identity of all the other words.
2 Model Toolkit The context aware word representations are con-
verted to a fixed length sentence encoding vector
We make available two new models for encoding by computing the element-wise sum of the repre-
sentences into embedding vectors. One makes use sentations at each word position.3 The encoder
of the transformer (Vaswani et al., 2017) architec- takes as input a lowercased PTB tokenized string
ture, while the other is formulated as a deep aver- and outputs a 512 dimensional vector as the sen-
aging network (DAN) (Iyyer et al., 2015). Both tence embedding.
models are implemented in TensorFlow (Abadi The encoding model is designed to be as gen-
et al., 2016) and are available to download from eral purpose as possible. This is accomplished
TF Hub:1 by using multi-task learning whereby a single
https://tfhub.dev/google/ encoding model is used to feed multiple down-
universal-sentence-encoder/1 stream tasks. The supported tasks include: a Skip-
The models take as input English strings and Thought like task (Kiros et al., 2015) for the un-
produce as output a fixed dimensional embedding supervised learning from arbitrary running text;
representation of the string. Listing 1 provides a a conversational input-response task for the in-
minimal code snippet to convert a sentence into clusion of parsed conversational data (Henderson
a tensor containing its sentence embedding. The et al., 2017); and classification tasks for train-
embedding tensor can be used directly or in- ing on supervised data. The Skip-Thought task
corporated into larger model graphs for specific replaces the LSTM (Hochreiter and Schmidhu-
tasks.2 ber, 1997) used in the original formulation with
As illustrated in Figure 1, the sentence embed- a model based on the Transformer architecture.
dings can be trivially used to compute sentence As will be shown in the experimental results
level semantic similarity scores that achieve ex- below, the transformer based encoder achieves
cellent performance on the semantic textual sim- the best overall transfer task performance. How-
ilarity (STS) Benchmark (Cer et al., 2017). When ever, this comes at the cost of compute time and
included within larger models, the sentence encod- memory usage scaling dramatically with sentence
ing models can be fine tuned for specific tasks us- length.
ing gradient based updates.
3.2 Deep Averaging Network (DAN)
3 Encoders The second encoding model makes use of a
deep averaging network (DAN) (Iyyer et al.,
We introduce the model architecture for our two 2015) whereby input embeddings for words and
encoding models in this section. Our two encoders bi-grams are first averaged together and then
have different design goals. One based on the passed through a feedforward deep neural network
transformer architecture targets high accuracy at (DNN) to produce sentence embeddings. Simi-
the cost of greater model complexity and resource lar to the Transformer encoder, the DAN encoder
consumption. The other targets efficient inference takes as input a lowercased PTB tokenized string
with slightly reduced accuracy. and outputs a 512 dimensional sentence embed-
1
The encoding model for the DAN based encoder is al- ding. The DAN encoder is trained similarly to the
ready available. The transformer based encoder will be made Transformer based encoder. We make use of mul-
available at a later point.
2 3
Visit https://colab.research.google.com/ to try the code We then divide by the square root of the length of the
snippet in Listing 1. Example code and documentation is sentence so that the differences between short sentences are
available on the universal encoder website provided above. not dominated by sentence length effects
titask learning whereby a single DAN encoder is WEAT : Word pairs from the psychology liter-
used to supply sentence embeddings for multiple ature on implicit association tests (IAT) that are
downstream tasks. used to characterize model bias (Caliskan et al.,
The primary advantage of the DAN encoder is 2017).
that compute time is linear in the length of the in-
Dataset Train Dev Test
put sequence. Similar to Iyyer et al. (2015), our re- SST 67,349 872 1,821
sults demonstrate that DANs achieve strong base- STS Bench 5,749 1,500 1,379
line performance on text classification tasks. TREC 5,452 - 500
MR - - 10,662
3.3 Encoder Training Data CR - - 3,775
SUBJ - - 10,000
Unsupervised training data for the sentence en- MPQA - - 10,606
coding models are drawn from a variety of web
sources. The sources are Wikipedia, web news, Table 1: Transfer task evaluation sets
web question-answer pages and discussion fo-
rums. We augment unsupervised learning with 5 Transfer Learning Models
training on supervised data from the Stanford Nat-
ural Language Inference (SNLI) corpus (Bowman For sentence classification transfer tasks, the out-
et al., 2015). Similar to the findings of Conneau put of the transformer and DAN sentence encoders
et al. (2017), we observe that training to SNLI im- are provided to a task specific DNN. For the pair-
proves transfer performance. wise semantic similarity task, we directly assess
the similarity of the sentence embeddings pro-
4 Transfer Tasks duced by our two encoders. As shown Eq. 1, we
first compute the cosine similarity of the two sen-
This section presents an overview of the data used
tence embeddings and then use arccos to convert
for the transfer learning experiments and the Word
the cosine similarity into an angular distance.5
Embedding Association Test (WEAT) data used to
characterize model bias.4 Table 1 summarizes the
number of samples provided by the test portion of u·v
   
sim(u, v) = 1 − arccos /π (1)
each evaluation set and, when available, the size ||u|| ||v||
of the dev and training data.
5.1 Baselines
MR : Movie review snippet sentiment on a five
star scale (Pang and Lee, 2005). For each transfer task, we include baselines that
only make use of word level transfer and baselines
CR : Sentiment of sentences mined from cus- that make use of no transfer learning at all. For
tomer reviews (Hu and Liu, 2004). word level transfer, we use word embeddings from
SUBJ : Subjectivity of sentences from movie re- a word2vec skip-gram model trained on a corpus
views and plot summaries (Pang and Lee, 2004). of news data (Mikolov et al., 2013). The pre-
trained word embeddings are included as input to
MPQA : Phrase level opinion polarity from two model types: a convolutional neural network
news data (Wiebe et al., 2005). models (CNN) (Kim, 2014); a DAN. The base-
lines that use pretrained word embeddings allow
TREC : Fine grained question classification
us to contrast word versus sentence level trans-
sourced from TREC (Li and Roth, 2002).
fer. Additional baseline CNN and DAN models
SST : Binary phrase level sentiment classifica- are trained without using any pretrained word or
tion (Socher et al., 2013). sentence embeddings.
STS Benchmark : Semantic textual similar- 5.2 Combined Transfer Models
ity (STS) between sentence pairs scored by Pear-
We explore combining the sentence and word level
son correlation with human judgments (Cer et al.,
transfer models by concatenating their representa-
2017).
tions prior to feeding the combined representation
4
For the datasets MR, CR, and SUBJ, SST, and TREC we
5
use the preparation of the data provided by Conneau et al. We find that using a similarity based on angular distance
(2017). performs better on average than raw cosine similarity.
STS Bench
Model MR CR SUBJ MPQA TREC SST
(dev / test)
Sentence & Word Embedding Transfer Learning
USE D+DAN (w2v w.e.) 77.11 81.71 93.12 87.01 94.72 82.14 –
USE D+CNN (w2v w.e.) 78.20 82.04 93.24 85.87 97.67 85.29 –
USE T+DAN (w2v w.e.) 81.32 86.66 93.90 88.14 95.51 86.62 –
USE T+CNN (w2v w.e.) 81.18 87.45 93.58 87.32 98.07 86.69 –
Sentence Embedding Transfer Learning
USE D 74.45 80.97 92.65 85.38 91.19 77.62 0.763 / 0.719 (r)
USE T 81.44 87.43 93.87 86.98 92.51 85.38 0.814 / 0.782 (r)
USE D+DAN (lrn w.e.) 77.57 81.93 92.91 85.97 95.86 83.41 –
USE D+CNN (lrn w.e.) 78.49 81.49 92.99 85.53 97.71 85.27 –
USE T+DAN (lrn w.e.) 81.36 86.08 93.66 87.14 96.60 86.24 –
USE T+CNN (lrn w.e.) 81.59 86.45 93.36 86.85 97.44 87.21 –
Word Embedding Transfer Learning
DAN (w2v w.e.) 74.75 75.24 90.80 81.25 85.69 80.24 –
CNN (w2v w.e.) 75.10 80.18 90.84 81.38 97.32 83.74 –
Baselines with No Transfer Learning
DAN (lrn w.e.) 75.97 76.91 89.49 80.93 93.88 81.52 –
CNN (lrn w.e.) 76.39 79.39 91.18 82.20 95.82 84.90 –

Table 2: Model performance on transfer tasks. USE T is the universal sentence encoder (USE) using
Transformer. USE D is the universal encoder DAN model. Models tagged with w2v w.e. make use of
pre-training word2vec skip-gram embeddings for the transfer task model, while models tagged with lrn
w.e. use randomly initialized word embeddings that are learned only on the transfer task data. Accuracy
is reported for all evaluations except STS Bench where we report the Pearson correlation of the similar-
ity scores with human judgments. Pairwise similarity scores are computed directly using the sentence
embeddings from the universal sentence encoder as in Eq. (1).

to the transfer task classification layers. For com- To assess bias in our encoding models, we eval-
pleteness, we also explore concatenating the rep- uate the strength of various associations learned
resentations from sentence level transfer models by our model on WEAT word lists. We compare
with the baseline models that do not make use of our result to those of Caliskan et al. (2017) who
word level transfer learning. discovered that word embeddings could be used to
reproduce human performance on implicit associ-
6 Experiments ation tasks for both benign and potentially unde-
sirable associations.
Transfer task model hyperparamaters are tuned
using a combination of Vizier (Golovin et al.) 7 Results
and light manual tuning. When available, model
hyperparameters are tuned using task dev sets. Transfer task performance is summarized in Ta-
Otherwise, hyperparameters are tuned by cross- ble 2. We observe that transfer learning from the
validation on the task training data when available transformer based sentence encoder usually per-
or the evaluation test data when neither training forms as good or better than transfer learning from
nor dev data are provided. Training repeats ten the DAN encoder. Hoewver, transfer learning us-
times for each transfer task model with different ing the simpler and fast DAN encoder can for
randomly initialized weights and we report evalu- some tasks perform as well or better than the more
ation results by averaging across runs. sophisticated transformer encoder. Models that
Transfer learning is critically important when make use of sentence level transfer learning tend
training data for a target task is limited. We ex- to perform better than models that only use word
plore the impact on task performance of varying level transfer. The best performance on most tasks
the amount of training data available for the task is obtained by models that make use of both sen-
both with and without the use of transfer learning. tence and word level transfer.
Contrasting the transformer and DAN based en- Table 3 illustrates transfer task performance for
coders, we demonstrate trade-offs in model com- varying amounts of training data. We observe that,
plexity and the amount of data required to reach a for smaller quantities of data, sentence level trans-
desired level of accuracy on a task. fer learning can achieve surprisingly good task
Model SST 1k SST 2k SST 4k SST 8k SST 16k SST 32k SST 67.3k
Sentence & Word Embedding Transfer Learning
USE D+DNN (w2v w.e.) 78.65 78.68 79.07 81.69 81.14 81.47 82.14
USE D+CNN (w2v w.e.) 77.79 79.19 79.75 82.32 82.70 83.56 85.29
USE T+DNN (w2v w.e.) 85.24 84.75 85.05 86.48 86.44 86.38 86.62
USE T+CNN (w2v w.e.) 84.44 84.16 84.77 85.70 85.22 86.38 86.69
Sentence Embedding Transfer Learning
USE D 77.47 76.38 77.39 79.02 78.38 77.79 77.62
USE T 84.85 84.25 85.18 85.63 85.83 85.59 85.38
USE D+DNN (lrn w.e.) 75.90 78.68 79.01 82.31 82.31 82.14 83.41
USE D+CNN (lrn w.e.) 77.28 77.74 79.84 81.83 82.64 84.24 85.27
USE T+DNN (lrn w.e.) 84.51 84.87 84.55 85.96 85.62 85.86 86.24
USE T+CNN (lrn w.e.) 82.66 83.73 84.23 85.74 86.06 86.97 87.21
Word Embedding Transfer Learning
DNN (w2v w.e.) 66.34 69.67 73.03 77.42 78.29 79.81 80.24
CNN (w2v w.e.) 68.10 71.80 74.91 78.86 80.83 81.98 83.74
Baselines with No Transfer Learning
DNN (lrn w.e.) 66.87 71.23 73.70 77.85 78.07 80.15 81.52
CNN (lrn w.e.) 67.98 71.81 74.90 79.14 81.04 82.72 84.90

Table 3: Task performance on SST for varying amounts of training data. SST 67.3k represents the full
training set. Using only 1,000 examples for training, transfer learning from USE T is able to obtain
performance that rivals many of the other models trained on the full 67.3 thousand example training set.

performance. As the training set size increases, resource requirements introduced by the different
models that do not make use of transfer learning models that could be used.
approach the performance of the other models.
Table 4 contrasts Caliskan et al. (2017)’s find- 8 Resource Usage
ings on bias within GloVe embeddings with the
This section describes memory and compute re-
DAN variant of the universal encoder. Similar
source usage for the transformer and DAN sen-
to GloVe, our model reproduces human associa-
tence encoding models for different sentence
tions between flowers vs. insects and pleasantness
lengths. Figure 2 plots model resource usage
vs. unpleasantness. However, our model demon-
against sentence length.
strates weaker associations than GloVe for probes
targeted at revealing at ageism, racism and sex- Compute Usage The transformer model time
ism.6 The differences in word association patterns complexity is O(n2 ) in sentence length, while the
can be attributed to differences in the training data DAN model is O(n). As seen in Figure 2 (a-b), for
composition and the mixture of tasks used to train short sentences, the transformer encoding model
the sentence embeddings. is only moderately slower than the much simpler
DAN model. However, compute time for trans-
7.1 Discussion former increases noticeably as sentence length in-
Transfer learning leads to performance improve- creases. In contrast, the compute time for the DAN
ments on many tasks. Using transfer learning is model stays nearly constant as sentence length is
more critical when less training data is available. increased. Since the DAN model is remarkably
When task performance is close, the correct mod- computational efficient, using GPUs over CPUs
eling choice should take into account engineer- will often have a much larger practical impact for
ing trade-offs regarding the memory and compute the transformer based encoder.
6
Researchers and developers are strongly encouraged to Memory Usage The transformer model space
independently verify whether biases in their overall model complexity also scales quadratically, O(n2 ), in
or model components impacts their use case. For resources
on ML fairness visit https://developers.google.com/machine- sentence length, while the DAN model space com-
learning/fairness-overview/. plexity is constant in the length of the sentence.
(a) CPU Time vs. Sentence Length (b) GPU Time vs. Sentence Length (c) Memory vs. Sentence Length

Figure 2: Model Resource Usage for both USE D and USE T at different batch sizes and sentence
lengths.
GloVe Uni. Enc. (DAN)
Target words Attrib. words Ref
d p d p
Eur.-American vs Afr.-American names Pleasant vs. Unpleasant 1 a 1.41 10−8 0.361 0.035
Eur.-American vs. Afr.-American names Pleasant vs. Unpleasant from (a) b 1.50 10−4 -0.372 0.87
Eur.-American vs. Afr.-American names Pleasant vs. Unpleasant from (c) b 1.28 10−3 0.721 0.015
Male vs. female names Career vs family c 1.81 10−3 0.0248 0.48
Math vs. arts Male vs. female terms c 1.06 0.018 0.588 0.12
Science vs. arts Male vs female terms d 1.24 10−2 0.236 0.32
Mental vs. physical disease Temporary vs permanent e 1.38 10−2 1.60 0.0027
Young vs old peoples names Pleasant vs unpleasant c 1.21 10−2 1.01 0.022
Flowers vs. insects Pleasant vs. Unpleasant a 1.50 10−7 1.38 10−7
Instruments vs. Weapons Pleasant vs Unpleasant a 1.53 10−7 1.44 10−7

Table 4: Word Embedding Association Tests (WEAT) for GloVe and the Universal Encoder. Effect size
is reported as Cohen’s d over the mean cosine similarity scores across grouped attribute words. Statistical
significance is reported for 1 tailed p-scores. The letters in the Ref column indicates the source of the IAT
word lists: (a) Greenwald et al. (1998) (b) Bertrand and Mullainathan (2004) (c) Nosek et al. (2002a)
(d) Nosek et al. (2002b) (e) Monteith and Pettit (2011).

Similar to compute usage, memory usage for the training data is available for the transfer task. The
transformer model increases quickly with sentence encoding models make different trade-offs regard-
length, while the memory usage for the DAN ing accuracy and model complexity that should
model remains constant. We note that, for the be considered when choosing the best model for
DAN model, memory usage is dominated by the a particular application. The pre-trained encod-
parameters used to store the model unigram and ing models will be made publicly available for
bigram embeddings. Since the transformer model research and use in applications that can benefit
only needs to store unigram embeddings, for short from a better understanding of natural language.
sequences it requires nearly half as much memory
as the DAN model. Acknowledgments

9 Conclusion We thank our teammates from Descartes, Ai.h and


other Google groups for their feedback and sug-
Both the transformer and DAN based universal en- gestions. Special thanks goes to Ben Packer and
coding models provide sentence level embeddings Yoni Halpern for implementing the WEAT assess-
that demonstrate strong transfer performance on a ments and discussions on model bias.
number of NLP tasks. The sentence level embed-
dings surpass the performance of transfer learn-
ing using word level embeddings alone. Models References
that make use of sentence and word level transfer Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng
achieve the best overall performance. We observe Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
that transfer learning is most helpful when limited Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
Manjunath Kudlur, Josh Levenberg, Rajat Monga, Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
Sherry Moore, Derek G. Murray, Benoit Steiner, Richard Zemel, Raquel Urtasun, Antonio Torralba,
Paul Tucker, Vijay Vasudevan, Pete Warden, Martin and Sanja Fidler. 2015. Skip-thought vectors. In In
Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Ten- Proceedings of NIPS.
sorflow: A system for large-scale machine learning.
In Proceedings of USENIX OSDI’16. Xin Li and Dan Roth. 2002. Learning question classi-
fiers. In Proceedings of COLING ’02.
Marianne Bertrand and Sendhil Mullainathan. 2004.
Are emily and greg more employable than lakisha Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
and jamal? a field experiment on labor market rado, and Jeffrey Dean. 2013. Distributed represen-
discrimination. The American Economic Review, tations of words and phrases and their composition-
94(4). ality. In Proceedings of NIPS’13.

Samuel R. Bowman, Gabor Angeli, Christopher Potts, Lindsey L. Monteith and Jeremy W. Pettit. 2011. Im-
and Christopher D. Manning. 2015. A large anno- plicit and explicit stigmatizing attitudes and stereo-
tated corpus for learning natural language inference. types about depression. Journal of Social and Clin-
In Proceedings of EMNLP. ical Psychology, 30(5).
Aylin Caliskan, Joanna J. Bryson, and Arvind Brian A. Nosek, Mahzarin R. Banaji, and Anthony G.
Narayanan. 2017. Semantics derived automatically Greenwald. 2002a. Harvesting implicit group at-
from language corpora contain human-like biases. titudes and beliefs from a demonstration web site.
Science, 356(6334):183–186. Group Dynamics, 6(1).
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Brian A. Nosek, Mahzarin R. Banaji, and Anthony G
Gazpio, and Lucia Specia. 2017. Semeval-2017 Greenwald. 2002b. Math = male, me = female,
task 1: Semantic textual similarity multilingual and therefore math me. Journal of Personality and So-
crosslingual focused evaluation. In Proceedings of cial Psychology,, 83(1).
SemEval-2017.
Bo Pang and Lillian Lee. 2004. A sentimental educa-
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic tion: Sentiment analysis using subjectivity summa-
Barrault, and Antoine Bordes. 2017. Supervised rization based on minimum cuts. In Proceedings of
learning of universal sentence representations from the 42nd Meeting of the Association for Computa-
natural language inference data. arXiv preprint tional Linguistics (ACL’04), Main Volume.
arXiv:1705.02364.
Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, ploiting class relationships for sentiment categoriza-
Greg Kochanski, John Karro, and D. Sculley. tion with respect to rating scales. In Proceedings of
Google vizier: A service for black-box optimization. ACL’05.
In Proceedings of KDD ’17.
Jeffrey Pennington, Richard Socher, and Christo-
Anthony G. Greenwald, Debbie E. McGhee, and Jor- pher D. Manning. 2014. Glove: Global vectors for
dan L. K. Schwartz. 1998. Measuring individual word representation. In Proceeding of EMNLP.
differences in implicit cognition: the implicit asso-
ciation test. Journal of personality and social psy- Richard Socher, Alex Perelygin, Jean Wu, Jason
chology, 74(6). Chuang, Christopher D. Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models
Matthew Henderson, Rami Al-Rfou, Brian Strope, for semantic compositionality over a sentiment tree-
Yun-Hsuan Sung, László Lukács, Ruiqi Guo, San- bank. In Proceedings of EMNLP.
jiv Kumar, Balint Miklos, and Ray Kurzweil. 2017.
Efficient natural language response suggestion for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
smart reply. CoRR, abs/1705.00652. Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
you need. In Proceedings of NIPS.
short-term memory. Neural Comput., 9(8):1735–
1780. Janyce Wiebe, Theresa Wilson, and Claire Cardie.
Minqing Hu and Bing Liu. 2004. Mining and sum- 2005. Annotating expressions of opinions and emo-
marizing customer reviews. In Proceedings of KDD tions in language. Language Resources and Evalu-
’04. ation, 39(2):165–210.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,


and Hal Daumé III. 2015. Deep unordered compo-
sition rivals syntactic methods for text classification.
In Proceedings of ACL/IJCNLP.
Yoon Kim. 2014. Convolutional neural networks for
sentence classification. In Proceedings of EMNLP.

You might also like