Multivariate Gaussian Document Representation From Word Embeddings For Text Categorization
Multivariate Gaussian Document Representation From Word Embeddings For Text Categorization
Multivariate Gaussian Document Representation From Word Embeddings For Text Categorization
Yannis Stavrakas
IMIS / RC ATHENA
[email protected]
450
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 450–455,
c
Valencia, Spain, April 3-7, 2017.
2017 Association for Computational Linguistics
different datasets. model (Mikolov et al., 2013). Specifically, for our
The rest of this paper is organized as follows. experiments, we used a publicly available model1
Section 2 provides an overview of the related M consisting of 300-dimensional vectors trained
work. Section 3 provides a description of the pro- on a Google News dataset of about 100 billion
posed approach. Section 4 evaluates the proposed words. Words contained in the vocabulary w ∈ V,
representation. Finally, Section 5 concludes. but not contained in the model w 6∈ M were ini-
tialized to random vectors.
2 Related Work To generate a representation for each document,
Mitchell and Lapata (2008) proposed a gen- we assume that its words were generated by a mul-
eral framework for generating representations of tivariate Gaussian distribution. Specifically, we re-
phrases or sentences. They computed vector gard the embeddings of all words w present in a
representations of short phrases as a mixture of document as i.i.d. samples drawn from a multi-
the original word vectors, using several different variate Gaussian distribution:
element-wise vector operations. Later, their work
w ∼ N (µ, Σ) (1)
was extended to take into account syntactic struc-
ture and grammars (Erk and Padó, 2008; Baroni
where w is the distributed representation of a word
and Zamparelli, 2010; Coecke et al., 2010). Lebret
w, µ is the mean vector of the distribution and Σ
and Collobert (2015) proposed to learn representa-
its covariance matrix.
tions for documents by averaging their word rep-
We set µ and Σ to their Maximum Likeli-
resentations. Their model learns word represen-
hood estimates, given by the sample mean and the
tations suitable for summation. Le and Mikolov
empirical covariance matrix respectively. More
(2014) presented an algorithm to learn vector rep-
specifically, the sample mean of a document cor-
resentations for paragraphs by inserting an addi-
responds to the centroid of its words, i. e. we add
tional memory vector in the input layer. Song and
the vectors of the words present in the text and nor-
Roth (2015) presented three mechanisms for gen-
malize the sum by the total number of words. For
erating dense representations of short documents
an input sequence of words d, its mean vector µ is
by combining Wikipedia-based explicit semantic
given by:
analysis representations with distributed word rep- 1 X
resentations. µ= w (2)
|d|
Neural networks with convolutional and pool- w∈d
ing layers have also been widely used for gen- where |d| is the cardinality of d, i. e. its number
erating representations of phrases or documents. of words. The empirical covariance matrix is then
These networks allow the model to learn which defined as:
sequences of words are good indicators of each
topic, and then, combine them to produce vec- 1 X
Σ= (w − µ)(w − µ)T (3)
tor representations for documents. These archi- |d|
w∈d
tectures have been proved effective in many NLP
tasks, such as document classifcation (Johnson Hence, each document is represented as a mul-
and Zhang, 2015), short-text categorization (Wang tivariate Gaussian distribution and the problem
et al., 2015), sentiment classification (Kalchbren- transforms from classifying textual documents to
ner et al., 2014; Kim, 2014) and paraphrase detec- classifying distributions.
tion (Yin and Schütze, 2015). To measure the similarity between pairs of doc-
uments, we compare their Gaussian representa-
3 Gaussian Document Representation tions. There are several well-known definitions
from Word Embeddings of similarity or distance between distributions.
Some examples include the Kullback-Leibler di-
Let D = {d1 , d2 , . . . , dm } be a set of m doc-
vergence, the Fisher kernel, the χ2 distance and
uments. The documents are pre-processed (to-
the Bhattacharyya kernel. However, most of these
kenization, punctuation and special character re-
measures are very time consuming. In our setting
moval) and the vocabulary of the corpus V is ex-
where µ and Σ are very high-dimensional (if n
tracted. To obtain a distributed representation for
each word w ∈ V, we employed the word2vec 1
https://code.google.com/archive/p/word2vec/
451
# training # test vocabulary word2vec
is the dimensionality of the distributed representa- Dataset # classes
examples examples size size
tions, then µ ∈ Rn and Σ ∈ Rn×n ), the complex- Reuters 5, 485 2, 189 8 23, 585 15, 587
ity of these measures is prohibitive, even for small Amazon 8, 000 CV 4 39, 133 30, 526
TREC 5, 452 500 6 9, 513 9, 048
document collections.
Snippets 10, 060 2, 280 8 29, 276 17, 067
We proceed by defining a more efficient func- BBCSport 348 389 5 14, 340 13, 390
tion for measuring the similarity between two dis- Polarity 10, 662 CV 2 18, 777 16, 416
Subjectivity 10, 000 CV 2 21, 335 17, 896
tributions. More specifically, the similarity be- Twitter 3, 115 CV 3 6, 266 4, 460
tween two documents d1 and d2 is set equal to
the convex combination of the similarities of their Table 1: Summary of the 8 datasets that were used
mean vectors µ1 and µ2 and their covariance ma- in our document classification experiments.
trices Σ1 and Σ2 . The similarity between the
mean vectors µ1 and µ2 is calculated using cosine
3) Centroid Documents are projected in the
similarity:
word embedding space as the centroids of their
µ1 · µ2 words. This representation corresponds to the
sim(µ1 , µ2 ) = (4)
kµ1 kkµ2 k mean vector µ of the Gaussian representation pre-
sented in Section 3. Similarity between documnets
where k · k is the Euclidean norm for vectors. The
is computed using cosine similarity (Equation 4).
similarity between the covariance matrices Σ1 and
4) WMD Distances between documents are
Σ2 can be computed using the following formula:
P computed using the Word Mover’s Distance (Kus-
Σ1 ◦ Σ2 ner et al., 2015). To compute the distances, we
sim(Σ1 , Σ2 ) = (5)
kΣ1 kF × kΣ2 kF used pre-trained vectors from word2vec. A k-nn
algorithm is then employed to classify the docu-
where (· ◦ ·) is the Hadamard or element-wise ments based on the distances between them. As in
product between matrices (we sum over all its ele- (Kusner et al., 2015), we used values of k ranging
ments) and k · kF is the Frobenius norm for matri- from 1 to 19.
ces. Hence, the similarity between two documents
5) CNN A convolutional neural network ar-
is equal to:
chitecture that has recently showed state-of-the-
sim(d1 , d2 ) = α sim(µ1 , µ2 ) art results on sentence classification (Kim, 2014).
(6) We used a model with pre-trained vectors from
+ (1 − α) sim(Σ1 , Σ2 )
word2vec where all word vectors are kept static
where α ∈ [0, 1]. It is trivial to show that the above during training. As regards the hyperparameters,
similarity measure is also a valid kernel function. we used the same settings as in (Kim, 2014): rec-
tified linear units, filter windows of 3, 4, 5 with
4 Experiments 100 feature maps each, dropout rate of 0.5, l2 con-
We evaluate the proposed approach as well as the straint of 3, mini-batch size of 50, and 25 epochs.
baselines in the context of text categorization on
4.2 Datasets
eight standard datasets.
In our experiments, we used several standard
4.1 Baselines datasets: (1) Reuters: contains stories collected
We next present the baselines against which we from the Reuters news agency. (2) Amazon: prod-
compared our approach: uct reviews acquired from Amazon over four dif-
1) BOW (binary) Documents are represented ferent sub-collections (Blitzer et al., 2007). (3)
as bag-of-words vectors. If a word is present in TREC: a set of questions classified into 6 differ-
the document its entry in the vector is 1, otherwise ent types (Li and Roth, 2002). (4) Snippets:
0. To perform text categorization, we employed a consists of snippets that were collected from the
linear SVM classifier. results of Web search transactions (Phan et al.,
2) NBSVM It combines a Naive Bayes classi- 2008). (5) BBCSport: consists of sports news
fier with an SVM and achieves remarkable results articles from the BBC Sport website (Greene and
on several tasks (Wang and Manning, 2012). We Cunningham, 2006). (6) Polarity: consists
used a combination of both unigrams and bigrams of positive and negative snippets acquired from
as features. Rotten Tomatoes (Pang and Lee, 2005). (7)
452
Dataset Reuters Amazon TREC Snippets
Method Accuracy F1-score Accuracy F1-score Accuracy F1-score Accuracy F1-score
BOW (binary) 0.9571 0.8860 0.9126 0.9127 0.9660 0.9692 0.6171 0.5953
Centroid 0.9676 0.9171 0.9311 0.9312 0.9540 0.9586 0.8123 0.8170
WMD 0.9502 0.8204 0.9200 0.9201 0.9240 0.9336 0.7417 0.7388
NBSVM 0.9712 0.9155 0.9486 0.9486 0.9780 0.9805 0.6474 0.6357
CNN 0.9707 0.9297 0.9448 0.9449 0.9800 0.9800 0.8478 0.8466
Gaussian 0.9712 0.9388 0.9498 0.9497 0.9820 0.9841 0.8224 0.8244
Dataset BBCSport Polarity Subjectivity Twitter
Method Accuracy F1-score Accuracy F1-score Accuracy F1-score Accuracy F1-score
BOW (binary) 0.9640 0.9690 0.7615 0.7614 0.9004 0.9004 0.7467 0.6205
Centroid 0.9923 0.9915 0.7783 0.7782 0.9100 0.9100 0.7361 0.5727
WMD 0.9871 0.9866 0.6642 0.6639 0.8604 0.8603 0.7031 0.4436
NBSVM 0.9871 0.9892 0.8698 0.8698 0.9369 0.9368 0.7852 0.6191
CNN 0.9486 0.9461 0.8037 0.8031 0.9315 0.9314 0.7549 0.6137
Gaussian 0.9974 0.9974 0.8021 0.8020 0.9310 0.9310 0.7534 0.6443
Table 2: Performance (accuracy and macro-average F1-score) in text categorization on the 8 datasets.
453
ues of α close to 0.5. Furthermore, when dropping Katrin Erk and Sebastian Padó. 2008. A Structured Vec-
the second term of Equation 6 (α = 1), the method tor Space Model for Word Meaning in Context. In Pro-
ceedings of the 2008 Conference on Empiricial Methods
is equivalent to the Centroid baseline and the per- in Natural Language Processing, pages 897–906.
formance drops significantly.
Derek Greene and Pádraig Cunningham. 2006. Practical So-
lutions to the Problem of Diagonal Dominance in Kernel
5 Conclusion Document Clustering. In Proceedings of the 23rd Interna-
tional Conference on Machine Learning, pages 377–384.
We proposed an approach that models each docu-
ment as a Gaussian distribution based on the em- Rie Johnson and Tong Zhang. 2015. Effective Use of Word
Order for Text Categorization with Convolutional Neural
beddings of its words. We then defined a function Networks. In Proceedings of the 2015 Conference of the
that measures the similarity between two docu- North American Chapter of the Association for Computa-
ments based on the similarity of their distributions. tional Linguistics: Human Language Technologies, pages
103–112.
Empirical evaluation demonstrated the effective-
ness of the approach across a range of datasets. We Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom.
2014. A Convolutional Neural Network for Modelling
attribute this performance gain of the proposed ap- Sentences. In Proceedings of the 52nd Annual Meeting
proach to the high quality of the embeddings and of the Association for Computational Linguistics, pages
its ability to effectively utilize these embeddings. 655–665.
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word
George W. Furnas, and Richard A. Harshman. 1990. In- embeddings efficiently with noise-contrastive estimation.
dexing by Latent Semantic Analysis. Journal of the Amer- In Advances in Neural Information Processing Systems,
ican Society for Information Science, 41(6):391–407. pages 2265–2273.
454
Bo Pang and Lillian Lee. 2004. A Sentimental Educa-
tion: Sentiment Analysis using Subjectivity Summariza-
tion based on Minimum Cuts. In Proceedings of the 42nd
Annual Meeting on Association for Computational Lin-
guistics, pages 271–278.
455