StarSpace - Embed All The Things

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

The Thirty-Second AAAI Conference

on Artificial Intelligence (AAAI-18)

StarSpace: Embed All The Things!

Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams,


Antoine Bordes Jason Weston
Facebook AI Research

Abstract hence the “star” (“*”, meaning all types) and “space” in the
name, and in that common space compares them against
We present StarSpace, a general-purpose neural embedding
each other. It learns to rank a set of entities, documents or
model that can solve a wide variety of problems: labeling
tasks such as text classification, ranking tasks such as in- objects given a query entity, document or object, where the
formation retrieval/web search, collaborative filtering-based query is not necessarily of the same type as the items in the
or content-based recommendation, embedding of multi- set.
relational graphs, and learning word, sentence or document We evaluate the quality of our approach on six different
level embeddings. In each case the model works by embed- tasks, namely text classification, link prediction in knowl-
ding those entities comprised of discrete features and com- edge bases, document recommendation, article search, sen-
paring them against each other – learning similarities depen- tence matching and learning general sentence embeddings.
dent on the task. Empirical results on a number of tasks show StarSpace is available as an open-source project at https:
that StarSpace is highly competitive with existing methods, //github.com/facebookresearch/Starspace.
whilst also being generally applicable to new cases where
those methods are not.
2 Related Work
Latent text representations, or embeddings, are vectorial rep-
1 Introduction resentations of words or documents, traditionally learned in
We introduce StarSpace, a neural embedding model that is an unsupervised way over large corpora. Work on neural
general enough to solve a wide variety of problems: embeddings in this domain includes (Bengio et al. 2003),
• Text classification, or other labeling tasks, e.g. sentiment (Collobert et al. 2011), word2vec (Mikolov et al. 2013) and
classification. more recently fastText (Bojanowski et al. 2017). In our ex-
periments we compare to word2vec and fastText as repre-
• Ranking of sets of entities, e.g. ranking web documents sentative scalable models for unsupervised embeddings; we
given a query. also compare on the SentEval tasks (Conneau et al. 2017)
• Collaborative filtering-based recommendation, e.g. rec- against a wide range of unsupervised models for sentence
ommending documents, music or videos. embedding.
• Content-based recommendation where content is defined In the domain of supervised embeddings, SSI (Bai et al.
with discrete features, e.g. words of documents. 2009) and WSABIE (Weston, Bengio, and Usunier 2011)
are early approaches that showed promise in NLP and infor-
• Embedding graphs, e.g. multi-relational graphs such as mation retrieval tasks ((Weston et al. 2013), (Hermann et al.
Freebase. 2014)). Several more recent works including (Tang, Qin, and
• Learning word, sentence or document embeddings. Liu 2015), (Zhang and LeCun 2015), (Conneau et al. 2016),
TagSpace (Weston, Chopra, and Adams 2014) and fastText
StarSpace can be viewed as a straight-forward and effi-
(Joulin et al. 2016) have yielded good results on classifica-
cient strong baseline for any of these tasks. In experiments
tion tasks such as sentiment analysis or hashtag prediction.
it is shown to be on par with or outperforming several com-
In the domain of recommendation, embedding models
peting methods, whilst being generally applicable to cases
have had a large degree of success, starting from SVD
where many of those methods are not.
(Goldberg et al. 2001) and its improvements such as SVD++
The method works by learning entity embeddings with
(Koren and Bell 2015), as well as a host of other tech-
discrete feature representations from relations among collec-
niques, e.g. (Rendle 2010; Lawrence and Urtasun 2009;
tions of those entities directly for the task of ranking or clas-
Shi et al. 2012). Many of those methods have focused on
sification of interest. In the general case, StarSpace embeds
the collaborative filtering setup where user IDs and movie
entities of different types into a vectorial embedding space,
IDs have individual embeddings, such as in the Netflix chal-
Copyright  c 2018, Association for the Advancement of Artificial lenge setup (see e.g., (Koren and Bell 2015), and so new
Intelligence (www.aaai.org). All rights reserved. users or items cannot naturally be incorporated. We show

5569
how StarSpace can naturally cater for both that setting and • The similarity function sim(·, ·). In our system, we have
the content-based setting where users and items are repre- implemented both cosine similarity and inner product,
sented as features, and hence have natural out-of-sample ex- and selected the choice as a hyperparameter. Generally,
tensions rather than considering only a fixed set. they work similarly well for small numbers of label fea-
Performing link prediction in knowledge bases (KBs) tures (e.g. for classification), while cosine works better for
with embedding-based methods has also shown promising larger numbers, e.g. for sentence or document similarity.
results in recent years. A series of work has been done in • The loss function Lbatch that compares the positive pair
this direction, such as (Bordes et al. 2013) and (Garcia- (a, b) with the negative pairs (a, b−
i ), i = 1, . . . , k. We
Duran, Bordes, and Usunier 2015). In our work, we show also implement two possibilities: margin ranking loss (i.e.
that StarSpace can be used for this task as well, outperform- max(0, μ − sim(a, b), where μ is the margin parameter),
ing several methods, and matching the TransE method pre- and negative log loss of softmax. All experiments use the
sented in (Bordes et al. 2013). former as it performed on par or better.
3 Model We optimize by stochastic gradient descent (SGD), i.e.,
each SGD step is one sample from E + in the outer sum,
The StarSpace model consists of learning entities, each of
using Adagrad (Duchi, Hazan, and Singer 2011) and hog-
which is described by a set of discrete features (bag-of-
wild (Recht et al. 2011) over multiple CPUs. We also apply
features) coming from a fixed-length dictionary. An entity
a max norm of the embeddings to restrict the vectors learned
such as a document or a sentence can be described by a bag
to lie in a ball of radius r in space Rd , as in other works, e.g.
of words or n-grams, an entity such as a user can be de-
(Weston, Bengio, and Usunier 2011).
scribed by the bag of documents, movies or items they have
At test time, one can use the learned function sim(·, ·) to
liked, and so forth. Importantly, the StarSpace model is free
measure similarity between entities. For example, for classi-
to compare entities of different kinds. For example, a user
fication, a label is predicted at test time for a given input a
entity can be compared with an item entity (recommenda-
tion), or a document entity with label entities (text classifi- using maxb̂ sim(a, b̂) over the set of possible labels b̂. Or in
cation), and so on. This is done by learning to embed them general, for ranking one can sort entities by their similarity.
in the same space such that comparisons are meaningful – Alternatively the embedding vectors can be used directly for
by optimizing with respect to the metric of interest. some other downstream task, e.g., as is typically done with
Denoting the dictionary of D features as F which is a D × word embedding models. However, if sim(·, ·) directly fits
d matrix, where Fi indexes the ith feature (row), yielding the needs of your application, this is recommended as this is
the objective that StarSpace is trained to be good at.
 d-dimensional embedding, we embed an entity a with
its
We now describe how this model can be applied to a wide
i∈a Fi .
That is, like other embedding models, our model starts by variety of tasks, in each case describing how the generators
assigning a d-dimensional vector to each of the discrete fea- E+ and E− work for that setting.
tures in the set that we want to embed directly (which we
call a dictionary, it can contain features like words, etc.). Multiclass Classification (e.g. Text Classification) The
Entities comprised of features (such as documents) are rep- positive pair generator comes directly from a training set of
resented by a bag-of-features of the features in the dictionary labeled data specifying (a, b) pairs where a are documents
and their embeddings are learned implicitly. Note an entity (bags-of-words) and b are labels (singleton features). Nega-
could consist of a single (unique) feature like a single word, tive entities b− are sampled from the set of possible labels.
name or user or item ID if desired.
To train our model, we need to learn to compare entities. Multilabel Classification In this case, each document a
Specifically, we want to minimize the following loss func- can have multiple positive labels, one of them is sampled as
tion: b at each SGD step to implement multilabel classification.
Lbatch (sim(a, b), sim(a, b− −
1 ), . . . , sim(a, bk ))
(a,b)∈E + Collaborative Filtering-based Recommendation The
b− ∈E − training data consists of a set of users, where each user is de-
There are several ingredients to this recipe: scribed by a bag of items (described as unique features from
• The generator of positive entity pairs (a, b) coming from the dictionary) that the user likes. The positive pair generator
the set E + . This is task dependent and will be described picks a user, selects a to be the unique singleton feature for
subsequently. that user ID, and a single item that they like as b. Negative
• The generator of negative entities b− entities b− are sampled from the set of possible items.
i coming from the set
E − . We utilize a k-negative sampling strategy (Mikolov
et al. 2013) to select k such negative pairs for each batch Collaborative Filtering-based Recommendation with
update. We select randomly from within the set of entities out-of-sample user extension One problem with classi-
that can appear in the second argument of the similarity cal collaborative filtering is that it does not generalize to new
function (e.g., for text labeling tasks a are documents and users, as a separate embedding is learned for each user ID.
b are labels, so we sample b− from the set of labels). An Using the same training data as before, one can learn an al-
analysis of the impact of k is given in Sec. 4. ternative model using StarSpace. The positive pair generator

5570
instead picks a user, selects a as all the items they like except a certain distance of each other if documents are very long).
one, and b as the left out item. That is, the model learns to es- Further, the embeddings will automatically be optimized for
timate if a user would like an item by modeling the user not sets of words of sentence length, so train time matches test
as a single embedding based on their ID, but by representing time, rather than training with short windows as typically
the user as the sum of embeddings of items they like. learned with word embeddings – window-based embeddings
can deteriorate when the sum of words in a sentence gets too
Content-based Recommendation This task consists of a large.
set of users, where each user is described by a bag of items,
where each item is described by a bag of features from the Multi-Task Learning Any of these tasks can be com-
dictionary (rather than being a unique feature). For exam- bined, and trained at the same time if they share some fea-
ple, for document recommendation, each user is described tures in the base dictionary F . For example one could com-
by the bag-of-documents they like, while each document is bine supervised classification with unsupervised word or
described by the bag-of-words it contains. Now a can be se- sentence embedding, to give semi-supervised learning.
lected as all of the items except one, and b as the left out
item. The system now extends to both new items and new 4 Experiments
users as both are featurized. Text Classification
We employ StarSpace for the task of text classification and
Multi-Relational Knowledge Graphs (e.g. Link Predic- compare it with a host of competing methods, including
tion) Given a graph of (h, r, t) triples, consisting of a head fastText, on three datasets which were all previously used
concept h, a relation r and a tail concept t, e.g. (Beyoncé, in (Joulin et al. 2016). To ensure fair comparison, we use
born-in, Houston), one can learn embeddings of that graph. an identical dictionary to fastText and use the same imple-
Instantiations of h, r and t are all defined as unique features mentation of n-grams and pruning (those features are im-
in the dictionary. We select uniformly at random either: (i) plemented in our open-source distribution of StarSpace). In
a consists of the bag of features h and r, while b consists these experiments we set the dimension of embeddings to be
only of t; or (ii) a consists of h, and b consists of r and t. 10, as in (Joulin et al. 2016).
Negative entities b− are sampled from the set of possible We use three datasets:
concepts. The learnt embeddings can then be used to answer • AG news1 is a 4 class text classification task given title
link prediction questions such as (Beyoncé, born-in, ?) or (?, and description fields as input. It consists of 120K training
born-in, Houston) via the learnt function sim(a, b). examples, 7600 test examples, 4 classes, ∼100K words
and 5M tokens in total.
Information Retrieval (e.g. Document Search) and Doc- • DBpedia (Lehmann et al. 2015) is a 14 class classification
ument Embeddings Given supervised training data con- problem given the title and abstract of Wikipedia articles
sisting of (search keywords, relevant document) pairs one as input. It consists of 560K training examples, 70k test
can directly train an information retrieval model: a contains examples, 14 classes, ∼800K words and 32M tokens in
the search keywords, b is a relevant document and b− are total.
other irrelevant documents. If only unsupervised training • The Yelp reviews dataset is obtained from the 2015 Yelp
data is available consisting of a set of unlabeled documents, Dataset Challenge2 . The task is to predict the full number
an alternative is to select a as random keywords from the of stars the user has given (from 1 to 5). It consists of 1.2M
document and b as the remaining words. Note that both these training examples, 157k test examples, 5 classes, ∼500K
approaches implicitly learn document embeddings which words and 193M tokens in total.
could be used for other purposes.
Results are given in Table 2. Baselines are quoted from the
literature (some methods are only reported on AG news and
Learning Word Embeddings We can also use StarSpace DBPedia, others only on Yelp15). StarSpace outperforms
to learn unsupervised word embeddings using training data a number of methods, and performs similarly to fastText.
consisting of raw text. We select a as a window of words We measure the training speed for n-grams > 1 in Table 3.
(e.g., four words, two either side of a middle word), and b as fastText and StarSpace are both efficient compared to deep
the middle word, following (Collobert et al. 2011; Mikolov learning approaches, e.g. (Zhang and LeCun 2015) takes 5h
et al. 2013; Bojanowski et al. 2017). per epoch on DBpedia, 375x slower than StarSpace. Still,
fastText is faster than StarSpace. However, as we will see in
Learning Sentence Embeddings Learning word embed-
the following sections, StarSpace is a more general system.
dings (e.g. as above) and using them to embed sentences
does not seem optimal when you can learn sentence em- Content-based Document Recommendation
beddings directly. Given a training set of unlabeled docu-
We consider the task of recommending new documents to
ments, each consisting of sentences, we select a and b as
a user given their past history of liked documents. We fol-
a pair of sentences both coming from the same document;
low a very similar process described in (Weston, Chopra,
b− are sentences coming from other documents. The intu-
ition is that semantic similarity between sentences is shared 1
http://www.di.unipi.it/œgulli/AG corpus of news articles.html
2
within a document (one can also only select sentences within https://www.yelp.com/dataset challenge

5571
Metric Hits@1 Hits@10 Hits@20 Mean Rank Training Time
Unsupervised methods
TFIDF 0.97% 3.3% 4.3% 3921.9 -
word2vec 0.5% 1.2% 1.7% 4161.3 -
fastText (public Wikipedia model) 0.5% 1.7% 2.5% 4154.4 -
fastText (our dataset) 0.79% 2.5% 3.7% 3910.9 4h30m
Tagspace† 1.1% 2.7% 4.1% 3455.6 -
Supervised methods
SVM Ranker: BoW features 0.99% 3.3% 4.6% 2440.1 -
SVM Ranker: fastText features (our dataset) 0.92% 3.3% 4.2% 3833.8 -
StarSpace 3.1% 12.6% 17.6% 1704.2 12h18m

Table 1: Test metrics and training time on the Content-based Document Recommendation task. † Tagspace training is supervised
but for another task (hashtag prediction) not our task of interest here.

Model AG news DBpedia Yelp15 Training time ag news dbpedia Yelp15


BoW* 88.8 96.6 - fastText (ngrams=2) 2s 10s
ngrams* 92.0 98.6 - StarSpace (ngrams=2) 4s 34s
ngrams TFIDF* 92.4 98.7 - fastText (ngrams=5) 2m01s
char-CNN* 87.2 98.3 - StarSpace (ngrams=5) 3m38s
char-CRNN 91.4 98.6 -
VDCNN 91.3 98.7 - Table 3: Training speed on the text classification tasks.
SVM+TF† - - 62.4
CNN† - - 61.5
Conv-GRNN† - - 66.0 can deal directly with this task, which is one of its major
LSTM-GRNN† - - 67.6 benefits. Following (Weston, Chopra, and Adams 2014), we
∗∗ hence use the following models as baselines:
fastText (ngrams=1)‡ 91.5 98.1 62.2
StarSpace (ngrams=1) 91.6 98.3 62.4 • Word2vec model. We use the publicly available word2vec
fastText (ngrams=2)‡ 92.5 98.6 - model trained on Google News articles3 , and use the word
StarSpace (ngrams=2) 92.7 98.6 - embeddings to generate article embeddings (by bag-of-
fastText (ngrams=5)‡ - - 66.6 words) and users’ embedding (by bag-of-articles in users’
StarSpace (ngrams=5) - - 65.3 click history). We then use cosine similarity for ranking.
• Unsupervised fastText model. We try both the previously
Table 2: Text classification test accuracy. * indicates mod-
trained publicly available model on Wikipedia4 , and train
els from (Zhang and LeCun 2015);  from (Xiao and Cho
on our own dataset. Unsupervised fastText is an enhance-
2016);  from (Conneau et al. 2016); † from (Tang, Qin, and
ment of word2Vec that also includes subwords.
Liu 2015); ‡ from (Joulin et al. 2016); ∗∗ we ran ourselves.
• Linear SVM ranker, using either bag-of-words features or
fastText embeddings (component-wise multiplication of
a’s and b’s features, which are of the same dimension).
and Adams 2014) in our experiment. The data for this task
is comprised of anonymized two-weeks long interaction his- • Tagspace model trained on a hashtag task, and then the
tories for a subset of people on a popular social networking embeddings are used for document recommendation, a re-
service. For each of the 641,385 people considered, we col- production of the setup in (Weston, Chopra, and Adams
lected the text of public articles that s/he clicked to read, 2014). In that work, the Tagspace model was shown to
giving a total of 3,119,909 articles. Given the person’s trail- outperform word2vec.
ing (n − 1) clicked articles, we use our model to predict the • TFIDF bag-of-words cosine similarity model.
n’th article by ranking it against 10,000 other unrelated ar-
ticles, and evaluate using ranking metrics. The score of the For fair comparison, we set the dimension of all embed-
n’th article is obtained by applying StarSpace: the input a is ding models to be 300. We show the results of our StarSpace
the previous (n−1) articles, and the output b is the n’th can- model comparing with the baseline models in Table 1. Train-
didate article. We measure the results by computing hits@k, ing time for StarSpace and fastText (Bojanowski et al. 2017)
i.e. the proportion of correct entities ranked in the top k for trained on our dataset is also provided.
k = 1, 10, 20, and the mean predicted rank of the clicked Tagspace was previously shown to provide superior per-
article among the 10,000 articles. formance to word2vec, and we observe the same result
here. Unsupervised FastText, which is an enhancement of
As this is not a classification task (i.e. there are not a fixed word2vec is also slightly inferior to Tagspace, but better than
set of labels to classify amongst, but a variable set of never
seen before documents to rank per user) we cannot use su- 3
https://code.google.com/archive/p/word2vec/
4
pervised classification models directly. Starspace however https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

5572
Metric Hits@10 r. Mean Rank r. Hits@10 f. Mean Rank f. Train Time
SE* (Bordes et al. 2011) 28.8% 273 39.8% 162 -
SME(LINEAR)* (Bordes et al. 2014) 30.7% 274 40.8% 154 -
SME(BILINEAR)* (Bordes et al. 2014) 31.3% 284 41.3% 158 -
LFM* (Jenatton et al. 2012) 26.0% 283 33.1% 164 -
RESCAL† (Nickel, Tresp, and Kriegel 2011) - - 58.7% -
TransE (dim=50) 47.4% 212.4 71.8% 63.9 1m27m
TransE (dim=100) 51.1% 225.2 82.8% 72.2 1h44m
TransE (dim=200) 51.2% 234.3 83.2% 75.6 2h50m
StarSpace (dim=50) 45.7% 191.2 74.2% 70.0 1h21m
StarSpace (dim=100) 50.8% 209.5 83.8% 62.9 2h35m
StarSpace (dim=200) 52.1% 245.8 83.0% 62.1 2h41m

Table 4: Test metrics on Freebase 15k dataset. * indicates results cited from (Bordes et al. 2013). † indicates results cited from
(Nickel et al. 2016).

K 1 5 10 25 50 100 250 500 1000


Epochs 3260 711 318 130 69 34 13 7 4
hit@10 67.05% 68.08% 68.13% 67.63% 69.05% 66.99% 63.95% 60.32% 54.14%

Table 5: Adapting the number of negative samples k for a 50-dim model for 1 hour of training on Freebase 15k.

word2vec. However, StarSpace, which is naturally more mark method. TransE uses an L2 similarity ||head + rela-
suited to this task, outperforms all those methods, includ- tion - tail||2 and SGD updates with single entity corruptions
ing Tagspace and SVMs by a significant margin. Overall, of head or tail that should have a larger distance. In con-
from the evaluation one can see that unsupervised methods trast, StarSpace uses a dot product, k-negative sampling, and
of learning word embeddings are inferior to training specifi- two different embeddings to represent the relation entity, de-
cally for the document recommendation task at hand, which pending on whether it appears in a or b.
StarSpace does. The results are given in Table 4. Results for SE, SME and
LFM are reported from (Bordes et al. 2013) and optimize
Link Prediction: Embedding Multi-relation the dimension from the choices 20, 50 and 75 as a hyper-
Knowledge Graphs parameter. RESCAL is reported from (Nickel et al. 2016).
We show that one can also use StarSpace on tasks of knowl- For TransE we ran it ourselves so that we could report the
edge representation. We use the Freebase 15k dataset from results for different embedding dimensions, and because we
(Bordes et al. 2013), which consists of a collection of triplets obtained better results by fine tuning it than previously re-
(head, relation type, tail) extracted from Freebase5 . This ported. Comparing TransE and StarSpace for the same em-
data set can be seen as a 3-mode tensor depicting ternary bedding dimension, these two methods then give similar per-
relationships between synsets. There are 14,951 concepts formance. Note there are some recent improved results on
(mids) and 1,345 relation types among them. The training this dataset using larger embeddings (Kadlec, Bajgar, and
set contains 483,142 triplets, the validation set 50,000 and Kleindienst 2017) or more complex, but less general, meth-
the test set 59,071. As described in (Bordes et al. 2013), ods (Shen et al. 2017).
evaluation is performed by, for each test triplet, removing the
head and replacing by each of the entities in the dictionary
in turn. Scores for those corrupted triplets are first computed Influence of k In this section, we ran experiments on the
by the models and then sorted; the rank of the correct en- Freebase 15k dataset to illustrate the complexity of our
tity is finally stored. This whole procedure is repeated while model in terms of the number of negative search exam-
removing the tail instead of the head. We report the mean ples. We set dim = 50, and the max training time of
of those predicted ranks and the hits@10. We also conduct the algorithm to be 1 hour for all experients. We report
a filtered evaluation that is the same, except all other valid the number of epochs the algorithm completes within the
heads or tails from the train or test set are discarded in the time limit and the best filtered hits@10 result over possi-
ranking, following (Bordes et al. 2013). ble learning rate choices, for different k (number of nega-
We compare with a number of methods, including transE tives searched for each positive training example). We set
presented in (Bordes et al. 2013). TransE was shown to out- k = [1, 5, 10, 25, 50, 100, 250, 500, 1000].
perform RESCAL (Nickel, Tresp, and Kriegel 2011), RFM The result is presented in Table 5. We observe that the
(Jenatton et al. 2012), SE (Bordes et al. 2011) and SME number of epochs finished within the 1 hour training time
(Bordes et al. 2014) and is considered a standard bench- constraint is close to an inverse linear function of k. In this
particular setup, [1, 100] is a good range of k and the best
5
http://www.freebase.com result is achieved at K = 50.

5573
Metric Hits@1 Hits@10 Hits@20 Mean Rank Training Time
Unsupervised methods
TFIDF 56.63% 72.80% 76.16% 578.98 -
fastText (public Wikipedia model) 18.08% 36.36% 42.97% 987.27 -
fastText (our dataset) 16.89% 37.60% 45.25% 786.77 40h
Supervised method
SVM Ranker BoW features 56.73% 69.24% 71.86% 723.47 -
SVM Ranker: fastText features (public) 18.44% 37.80% 45.91% 887.96 -
StarSpace 56.75% 78.14% 83.15% 122.26 89h

Table 6: Test metrics and training time on Wikipedia Article Search (Task 1).

Metric Hits@1 Hits@10 Hits@20 Mean Rank Training Time


Unsupervised methods
TFIDF 24.79% 35.53% 38.25% 2523.68 -
fastText (public Wikipedia model) 5.77% 14.08% 17.79% 2393.38 -
fastText (our dataset) 5.47% 13.54% 17.60% 2363.74 40h
StarSpace (word-level training) 5.89% 16.41% 20.60% 1614.21 45h
Supervised methods
SVM Ranker BoW features 26.36% 36.48% 39.25% 2368.37 -
SVM Ranker: fastText features (public) 5.81% 12.14% 15.20% 1442.05 -
StarSpace (sentence pair training) 30.07% 50.89% 57.60% 422.00 36h
StarSpace (word+sentence training) 25.54% 45.21% 52.08% 484.27 69h

Table 7: Test metrics and training time on Wikipedia Sentence Matching (Task 2).

Wikipedia Article Search & Sentence Matching as the input, and for Task 2 another random sentence (differ-
In this section, we apply our model on a Wikipedia article ent from the input) is picked from the article as the label
search and a sentence match problem. We use the Wikipedia (otherwise the rest of the article for Task 1). Negative enti-
dataset introduced by (Chen et al. 2017), which is the 2016- ties can be selected at random from the training set. In the
12-21 dump of English Wikipedia. For each article, only the case of training for Task 1, for label features we use a fea-
plain text is extracted and all structured data sections such as ture dropout probability of 0.8 which both regularizes and
lists and figures are stripped. It contains a total of 5,075,182 greatly speeds up training. We also try StarSpace word-level
articles with 9,008,962 unique uncased token types. The training, and multi-tasking both sentence and word-level for
dataset is split into 5,035,182 training examples, 10,000 vali- Task 2.
dation examples and 10,000 test examples. We then consider We compare StarSpace with the publicly released fastText
the following evaluation tasks: model, as well as a fastText model trained on the text of
our dataset.6 We also compare to a TFIDF baseline. For fair
• Task 1: given a sentence from a Wikipedia article as a
comparison, we set the dimension of all embedding models
search query, we try to find the Wikipedia article it came
to be 300. The results for tasks 1 and 2 are summarized in Ta-
from. We rank the true Wikipedia article (minus the sen-
ble 6 and 7 respectively. StarSpace outperforms TFIDF and
tence) against 10,000 other Wikipedia articles using rank-
fastText by a significant margin, this is because StarSpace
ing evaluation metrics. This mimics a web search like sce-
can train directly for the tasks of interest whereas it is not
nario where we would like to search for the most relevant
in the declared scope of fastText. Note that StarSpace word-
Wikipedia articles (web documents). Note that we effec-
level training, which is similar to fastText in method, obtains
tively have supervised training data for this task.
similar results to fastText. Crucially, it is StarSpace’s ability
• Task 2: pick two random sentences from a Wikipedia ar- to do sentence and document level training that brings the
ticle, use one as the search query, and try to find the other performance gains.
sentence coming from the same original document. We A comparison of the predictions of StarSpace and fastText
rank the true sentence against 10,000 other sentences from on the article search task (Task 1) on a few random queries
different Wikipedia articles. This fits the scenario where are given in Table 8. While fastText results are semantically
we want to find sentences that are closely semantically re- in roughly the right part of the space, they lack finer pre-
lated by topic (but do not necessarily have strong word cision. For example, the first query is looking for articles
overlap). Note also that we effectively have supervised about an olympic skater, which StarSpace correctly under-
training data for this task.
We can train our Starspace model in the following way: 6
FastText training is unsupervised even on our dataset since
each update step selects a Wikipedia article from our train- its original design does not support directly using supervised data
ing set. Then, one random sentence is picked from the article here.

5574
Input Query StarSpace result fastText result
Article: Eva Groajov. Article: Michael Reusch.
Paragraph: Eva Groajov , later Bergerov-Groajov , is a Paragraph: Michael Reusch (February 3, 1914April 6 ,
She is the 1962 Blue Swords champion and 1960 former competitive figure skater who represented 1989) was a Swiss gymnast and Olympic Champion.
Winter Universiade silver medalist. Czechoslovakia. She placed 7th at the 1961 European He competed at the 1936 Summer Olympics in Berlin,
Championships and 13th at the 1962 World where he received silver medals in parallel bars and
Championships. She was coached by Hilda Mdra. team combined exercises...
Article: Mantanani Islands.
Paragraph: The Mantanani Islands form a small group Article: Gum-Gum
The islands are accessible by a one-hour speedboat
of three islands off the north-west coast of the state of Paragraph: Gum-Gum is a township of Sandakan,
journey from Kuala Abai jetty, Kota Belud, 80 km
Sabah, Malaysia, opposite the town of Kota Belud, in Sabah, Malaysia. It is situated about 25km from
north-east of Kota Kinabalu, the capital of Sabah.
northern Borneo. The largest island is Mantanani Besar; Sandakan town along Labuk Road.
the other two are Mantanani Kecil and Lungisan...
Article: Stir of Echoes
Article: The Fabulous Five
Maggie withholds her conversation with Neil from Tom Paragraph: Stir of Echoes is a 1999 American
Paragraph: The Fabulous Five is an American book
and goes to the meeting herself, and Neil tells her the supernatural horror-thriller released in the United States
series by Betsy Haynes in the late 1980s . Written
spirit that contacted Tom has asked for something and on September 10 , 1999 , starring Kevin Bacon and
mainly for preteen girls , it is a spin-off of Haynes ’
will grow upset if it does not get done. directed by David Koepp . The film is loosely based on
other series about Taffy Sinclair...
the novel ”A Stir of Echoes” by Richard Matheson...

Table 8: StarSpace predictions for some example Wikipedia Article Search (Task 1) queries where StarSpace is correct.

Task MR CR SUBJ MPQA SST TREC MRPC SICK-R SICK-E STS14


Unigram-TFIDF* 73.7 79.2 90.3 82.4 - 85.0 73.6 / 81.7 - - 0.58 / 0.57
ParagraphVec (DBOW)* 60.2 66.9 76.3 70.7 - 59.4 72.9 / 81.1 - - 0.42 / 0.43
SDAE* 74.6 78.0 90.8 86.9 - 78.4 73.7 / 80.7 - - 0.37 / 0.38
SIF(GloVe+WR)* - - - 82.2 - - - - 84.6 0.69 / -
word2vec* 77.7 79.8 90.9 88.3 79.7 83.6 72.5 / 81.4 0.80 78.7 0.65 / 0.64
GloVe* 78.7 78.5 91.6 87.6 79.8 83.6 72.1 / 80.9 0.80 78.6 0.54 / 0.56
fastText (public Wikipedia model)* 76.5 78.9 91.6 87.4 78.8 81.8 72.4 / 81.2 0.80 77.9 0.63 / 0.62
StarSpace [word] 73.8 77.5 91.53 86.6 77.2 82.2 73.1 / 81.8 0.79 78.8 0.65 / 0.62
StarSpace [sentence] 69.1 75.1 85.4 80.5 72.0 63.0 69.2 / 79.7 0.76 76.2 0.70 / 0.67
StarSpace [word + sentence] 72.1 77.1 89.6 84.1 77.5 79.0 70.2 80.3 0.79 77.8 0.69/0.66
StarSpace [ensemble w+s] 76.6 80.3 91.8 88.0 79.9 85.2 71.8 / 80.6 0.78 82.1 0.69 / 0.65

Table 9: Transfer test results on SentEval. * indicates model results that have been extracted from (Conneau et al. 2017). For
MR, CR, SUBJ, MPQA, SST, TREC, SICK-R we report accuracies; for MRPC, we report accuracy/F1; for SICK-R we report
Pearson correlation with relatedness score; for STS we report Pearson/Spearman correlations between the cosine distance of
two sentences and human-labeled similarity score.

Task STS12 STS13 STS14 STS15 STS16


fastText (public Wikipedia model) 0.60 / 0.59 0.62 / 0.63 0.63 / 0.62 0.68 / 0.69 0.62 / 0.66
StarSpace [word] 0.53 / 0.54 0.60 / 0.60 0.65 / 0.62 0.68 / 0.67 0.64 / 0.65
StarSpace [sentence] 0.58 / 0.58 0.66 / 0.65 0.70 / 0.67 0.74 / 0.73 0.69 / 0.69
StarSpace [word+sentence] 0.58 / 0.59 0.63 / 0.63 0.68 / 0.65 0.72 / 0.72 0.68 / 0.68
StarSpace [ensemble w+s] 0.58 / 0.59 0.64 / 0.64 0.69 / 0.65 0.73 / 0.72 0.69 / 0.69

Table 10: Transfer test results on STS tasks using Pearson/Spearman correlations between sentence similarity and human scores.
.

stands whereas fastText picks an olympic gymnast. Note tasks including binary classification, multi-class classifica-
that the query does not specifically mention the word skater, tion, entailment, paraphrase detection, semantic relatedness
StarSpace can only understand this by understanding related and semantic textual similarity from SentEval. Detailed de-
phrases, e.g. the phrase “Blue Swords” refers to an interna- scription of these transfer tasks and baseline models can be
tional figure skating competition. The other two examples found in (Conneau et al. 2017).
given yield similar conclusions. We train the following models on the Wikipedia Task 2
from the previous section, and evaluate sentence embed-
Learning Sentence Embeddings dings generated by those models:
In this section, we evaluate sentence embeddings generated • StarSpace trained on word level.
by our model and use SentEval7 which is a tool from (Con- • StarSpace trained on sentence level.
neau et al. 2017) for measuring the quality of general pur-
pose sentence embeddings. We use a total of 14 transfer • StarSpace trained (multi-tasked) on both word and sen-
tence level.
7
https://github.com/facebookresearch/SentEval • Ensemble of StarSpace models trained on both word and

5575
sentence level: we train a set of 13 models, multi-tasking References
on Wikipedia sentence match and word-level training then
Bai, B.; Weston, J.; Grangier, D.; Collobert, R.; Sadamasa,
concatenate all embeddings together to generate a 13 ×
K.; Qi, Y.; Chapelle, O.; and Weinberger, K. 2009. Su-
300 = 3900 dimension embedding for each word.
pervised semantic indexing. In Proceedings of the 18th
We present the results in Table 9 and Table 10. StarSpace ACM conference on Information and knowledge manage-
performs well, outperforming many methods on many of ment, 187–196. ACM.
the tasks, although no method wins outright across all tasks.
Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003.
Particularly on the STS (Semantic Textual Similarity) tasks
A neural probabilistic language model. Journal of machine
Starspace has very strong results. Please refer to (Conneau
learning research 3(Feb):1137–1155.
et al. 2017) for further results and analysis of these datasets.
Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017.
5 Discussion and Conclusion Enriching word vectors with subword information. Trans-
actions of the Association for Computational Linguistics
In this paper, we propose StarSpace, a method of embedding
5:135–146.
and ranking entities using the relationships between entities,
and show that the method we propose is a general system Bordes, A.; Weston, J.; Collobert, R.; Bengio, Y.; et al. 2011.
capable of working on many tasks: Learning structured embeddings of knowledge bases. In
AAAI, volume 6, 6.
• Text Classification / Sentiment Analysis: we show that
our method achieves good results, comparable to fastText Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and
(Joulin et al. 2016) on three different datasets. Yakhnenko, O. 2013. Translating embeddings for model-
ing multi-relational data. In Advances in neural information
• Content-based Document recommendation: it can directly processing systems, 2787–2795.
solve these tasks well, whereas applying off-the-shelf
fastText, Tagspace or word2vec gives inferior results. Bordes, A.; Glorot, X.; Weston, J.; and Bengio, Y. 2014. A
semantic matching energy function for learning with multi-
• Link Prediction in Knowledge Bases: we show that relational data. Machine Learning 94(2):233–259.
our method outperforms several methods, and matches
TransE (Bordes et al. 2013) on Freebase 15K. Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Read-
ing Wikipedia to answer open-domain questions. In Associ-
• Wikipedia Search and Sentence Matching tasks: it out- ation for Computational Linguistics (ACL).
performs off-the-shelf embedding models due to directly
training sentence and document-level embeddings. Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.;
Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language pro-
• Learning Sentence Embeddings: It performs well on the cessing (almost) from scratch. Journal of Machine Learning
14 SentEval transfer tasks of (Conneau et al. 2017) com- Research 12(Aug):2493–2537.
pared to a host of embedding methods.
Conneau, A.; Schwenk, H.; Barrault, L.; and Lecun, Y. 2016.
StarSpace should also be highly applicable to other tasks Very deep convolutional networks for natural language pro-
we did not evaluate here such as other classification, rank- cessing. arXiv preprint arXiv:1606.01781.
ing, retrieval or metric learning tasks. Importantly, what is
more general about our method compared to many existing Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bor-
embedding models is: (i) the flexibility of using features to des, A. 2017. Supervised learning of universal sentence
represent labels that we want to classify or rank, which en- representations from natural language inference data. arXiv
ables it to train directly on a downstream prediction/ranking preprint arXiv:1705.02364.
task; and (ii) different ways of selecting positives and nega- Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgra-
tives suitable for those tasks. Choosing the wrong generators dient methods for online learning and stochastic optimiza-
E + and E − gives greatly inferior results, as shown e.g. in tion. Journal of Machine Learning Research 12(Jul):2121–
Table 7. 2159.
Future work will consider the following enhancements: Garcia-Duran, A.; Bordes, A.; and Usunier, N. 2015. Com-
going beyond discrete features, e.g. to continuous features, posing relationships with translations. Ph.D. Dissertation,
considering nonlinear representations and experimenting CNRS, Heudiasyc.
with other entities such as images. Finally, while our model
is relatively efficient, we could consider hierarchical classifi- Goldberg, K.; Roeder, T.; Gupta, D.; and Perkins, C. 2001.
cation schemes as in FastText to try to make it more efficient; Eigentaste: A constant time collaborative filtering algorithm.
the trick here would be to do this while maintaining the gen- Information Retrieval 4(2):133–151.
erality of our model which is what makes it so appealing. Hermann, K. M.; Das, D.; Weston, J.; and Ganchev, K. 2014.
Semantic frame identification with distributed word repre-
6 Acknowledgement sentations. In ACL (1), 1448–1458.
We would like to thank Timothee Lacroix for sharing with Jenatton, R.; Roux, N. L.; Bordes, A.; and Obozinski, G. R.
us his implementation of TransE. We also thank Edouard 2012. A latent factor model for highly multi-relational
Grave, Armand Joulin and Arthur Szlam for helpful discus- data. In Advances in Neural Information Processing Sys-
sions on the StarSpace model. tems, 3167–3175.

5576
Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. 2016. the 2014 Conference on Empirical Methods in Natural Lan-
Bag of tricks for efficient text classification. arXiv preprint guage Processing (EMNLP), 1822–1827.
arXiv:1607.01759. Xiao, Y., and Cho, K. 2016. Efficient character-level docu-
Kadlec, R.; Bajgar, O.; and Kleindienst, J. 2017. Knowl- ment classification by combining convolution and recurrent
edge base completion: Baselines strike back. arXiv preprint layers. arXiv preprint arXiv:1602.00367.
arXiv:1705.10744. Zhang, X., and LeCun, Y. 2015. Text understanding from
Koren, Y., and Bell, R. 2015. Advances in collaborative scratch. arXiv preprint arXiv:1502.01710.
filtering. In Recommender systems handbook. Springer. 77–
118.
Lawrence, N. D., and Urtasun, R. 2009. Non-linear matrix
factorization with gaussian processes. In Proceedings of the
26th Annual International Conference on Machine Learn-
ing, 601–608. ACM.
Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas,
D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; Van Kleef, P.;
Auer, S.; et al. 2015. Dbpedia–a large-scale, multilingual
knowledge base extracted from wikipedia. Semantic Web
6(2):167–195.
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef-
ficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781.
Nickel, M.; Rosasco, L.; Poggio, T. A.; et al. 2016. Holo-
graphic embeddings of knowledge graphs.
Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2011. A three-
way model for collective learning on multi-relational data.
In Proceedings of the 28th international conference on ma-
chine learning (ICML-11), 809–816.
Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild:
A lock-free approach to parallelizing stochastic gradient de-
scent. In Advances in neural information processing sys-
tems, 693–701.
Rendle, S. 2010. Factorization machines. In Data Mining
(ICDM), 2010 IEEE 10th International Conference on, 995–
1000. IEEE.
Shen, Y.; Huang, P.-S.; Chang, M.-W.; and Gao, J. 2017.
Modeling large-scale structured relationships with shared
memory for knowledge base completion. In Proceedings
of the 2nd Workshop on Representation Learning for NLP,
57–68.
Shi, Y.; Karatzoglou, A.; Baltrunas, L.; Larson, M.; Oliver,
N.; and Hanjalic, A. 2012. Climf: learning to maximize
reciprocal rank with collaborative less-is-more filtering. In
Proceedings of the sixth ACM conference on Recommender
systems, 139–146. ACM.
Tang, D.; Qin, B.; and Liu, T. 2015. Document modeling
with gated recurrent neural network for sentiment classifica-
tion. In EMNLP, 1422–1432.
Weston, J.; Bengio, S.; and Usunier, N. 2011. Wsabie: Scal-
ing up to large vocabulary image annotation. In IJCAI, vol-
ume 11, 2764–2770.
Weston, J.; Bordes, A.; Yakhnenko, O.; and Usunier, N.
2013. Connecting language and knowledge bases with
embedding models for relation extraction. arXiv preprint
arXiv:1307.7973.
Weston, J.; Chopra, S.; and Adams, K. 2014. # tagspace:
Semantic embeddings from hashtags. In Proceedings of

5577

You might also like