Multivariate Gaussian Document Representation From Word Embeddings For Text Categorization

Multivariate Gaussian Document Representation from Word Embeddings
for Text Categorization

Giannis Nikolentzos Polykarpos Meladianos
École Polytechnique and AUEB École Polytechnique and AUEB
[email protected] [email protected]
François Rousseau Michalis Vazirgiannis

École Polytechnique École Polytechnique and AUEB
[email protected] [email protected]
Yannis Stavrakas
IMIS / RC ATHENA
[email protected]
Abstract a latent low-dimensional representation of docu-

ments. Latent Semantic Analysis (Deerwester et
Recently, there has been a lot of activity al., 1990) and Latent Dirichlet Allocation (Blei et
in learning distributed representations of al., 2003) are the main employed methods for this
words in vector spaces. Although there are task. However, these methods do not systemati-
models capable of learning high-quality cally yield improved performance compared to the
distributed representations of words, how BOW representation.
to generate vector representations of the Recently, there has been a growing interest in
same quality for phrases or documents still methods for learning distributed representations of
remains a challenge. In this paper, we pro- words (Bengio et al., 2003; Collobert et al., 2011;
pose to model each document as a multi- Mikolov et al., 2013; Mnih and Kavukcuoglu,
variate Gaussian distribution based on the 2013; Pennington et al., 2014; Lebret and Col-
distributed representations of its words. lobert, 2014). In the embedding space, semanti-
We then measure the similarity between cally similar words are likely to be close to each
two documents based on the similarity other. Moreover, simple linear operations on word
of their distributions. Experiments on vectors can produce meaningful results. For exam-
eight standard text categorization datasets ple, the closest vector to “Vietnam” + “capital” is
demonstrate the effectiveness of the pro- found to be “Hanoi” (Mikolov et al., 2013).
posed approach in comparison with state-
Several recent works make use of distributed
of-the-art methods.
representations of phrases to tackle various NLP
problems (Bahdanau et al., 2015; Lebret et al.,
1 Introduction
2015). There is therefore a clear need for methods
During the past decade, there has been a signif- that generate meaningful phrase or document rep-
icant increase in the availability of textual infor- resentations based on the representations of their
mation mainly due to the exploding popularity of words. The most straightforward approach gener-
the World Wide Web. This tremendous amount ates phrase or document representations by simply
of textual information growth has established the summing the vector representations of the words
need for the development of effective text-mining appearing in the phrase or document.
approaches. In this paper, we propose to model documents
Traditionally, documents are represented as as multivariate Gaussian distributions. The mean
bag-of-words (BOW) vectors. The BOW repre- of each distribution is the average of the vector
sentation is very simple and it has proven effec- representations of its words and its covariance ma-
tive in easy and moderate tasks, however, for more trix measures the variation of the dimensions from
demanding tasks, such as short text modeling, its the mean with respect to each other. Empirical
performance drops significantly. evaluation proves the superiority of the proposed
In order to overcome the weakness of BOW, representation over the standard BOW represen-
researchers proposed methods that try to learn tation and other baseline approaches in a host of
450
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 450–455,
c
Valencia, Spain, April 3-7, 2017. 2017 Association for Computational Linguistics
different datasets. model (Mikolov et al., 2013). Specifically, for our
The rest of this paper is organized as follows. experiments, we used a publicly available model1
Section 2 provides an overview of the related M consisting of 300-dimensional vectors trained
work. Section 3 provides a description of the pro- on a Google News dataset of about 100 billion
posed approach. Section 4 evaluates the proposed words. Words contained in the vocabulary w ∈ V,
representation. Finally, Section 5 concludes. but not contained in the model w 6∈ M were ini-
tialized to random vectors.
2 Related Work To generate a representation for each document,
Mitchell and Lapata (2008) proposed a gen- we assume that its words were generated by a mul-
eral framework for generating representations of tivariate Gaussian distribution. Specifically, we re-
phrases or sentences. They computed vector gard the embeddings of all words w present in a
representations of short phrases as a mixture of document as i.i.d. samples drawn from a multi-
the original word vectors, using several different variate Gaussian distribution:
element-wise vector operations. Later, their work
w ∼ N (µ, Σ) (1)
was extended to take into account syntactic struc-
ture and grammars (Erk and Padó, 2008; Baroni
where w is the distributed representation of a word
and Zamparelli, 2010; Coecke et al., 2010). Lebret
w, µ is the mean vector of the distribution and Σ
and Collobert (2015) proposed to learn representa-
its covariance matrix.
tions for documents by averaging their word rep-
We set µ and Σ to their Maximum Likeli-
resentations. Their model learns word represen-
hood estimates, given by the sample mean and the
tations suitable for summation. Le and Mikolov
empirical covariance matrix respectively. More
(2014) presented an algorithm to learn vector rep-
specifically, the sample mean of a document cor-
resentations for paragraphs by inserting an addi-
responds to the centroid of its words, i. e. we add
tional memory vector in the input layer. Song and
the vectors of the words present in the text and nor-
Roth (2015) presented three mechanisms for gen-
malize the sum by the total number of words. For
erating dense representations of short documents
an input sequence of words d, its mean vector µ is
by combining Wikipedia-based explicit semantic
given by:
analysis representations with distributed word rep- 1 X
resentations. µ= w (2)
|d|
Neural networks with convolutional and pool- w∈d
ing layers have also been widely used for gen- where |d| is the cardinality of d, i. e. its number
erating representations of phrases or documents. of words. The empirical covariance matrix is then
These networks allow the model to learn which defined as:
sequences of words are good indicators of each
topic, and then, combine them to produce vec- 1 X
Σ= (w − µ)(w − µ)T (3)
tor representations for documents. These archi- |d|
w∈d
tectures have been proved effective in many NLP
tasks, such as document classifcation (Johnson Hence, each document is represented as a mul-
and Zhang, 2015), short-text categorization (Wang tivariate Gaussian distribution and the problem
et al., 2015), sentiment classification (Kalchbren- transforms from classifying textual documents to
ner et al., 2014; Kim, 2014) and paraphrase detec- classifying distributions.
tion (Yin and Schütze, 2015). To measure the similarity between pairs of doc-
uments, we compare their Gaussian representa-
3 Gaussian Document Representation tions. There are several well-known definitions
from Word Embeddings of similarity or distance between distributions.
Some examples include the Kullback-Leibler di-
Let D = {d1 , d2 , . . . , dm } be a set of m doc-
vergence, the Fisher kernel, the χ2 distance and
uments. The documents are pre-processed (to-
the Bhattacharyya kernel. However, most of these
kenization, punctuation and special character re-
measures are very time consuming. In our setting
moval) and the vocabulary of the corpus V is ex-
where µ and Σ are very high-dimensional (if n
tracted. To obtain a distributed representation for
each word w ∈ V, we employed the word2vec 1
https://code.google.com/archive/p/word2vec/
451
# training # test vocabulary word2vec
is the dimensionality of the distributed representa- Dataset # classes
examples examples size size
tions, then µ ∈ Rn and Σ ∈ Rn×n ), the complex- Reuters 5, 485 2, 189 8 23, 585 15, 587
ity of these measures is prohibitive, even for small Amazon 8, 000 CV 4 39, 133 30, 526
TREC 5, 452 500 6 9, 513 9, 048
document collections.
Snippets 10, 060 2, 280 8 29, 276 17, 067
We proceed by defining a more efficient func- BBCSport 348 389 5 14, 340 13, 390
tion for measuring the similarity between two dis- Polarity 10, 662 CV 2 18, 777 16, 416
Subjectivity 10, 000 CV 2 21, 335 17, 896
tributions. More specifically, the similarity be- Twitter 3, 115 CV 3 6, 266 4, 460
tween two documents d1 and d2 is set equal to
the convex combination of the similarities of their Table 1: Summary of the 8 datasets that were used
mean vectors µ1 and µ2 and their covariance main our document classification experiments.
trices Σ1 and Σ2 . The similarity between the
mean vectors µ1 and µ2 is calculated using cosine
3) Centroid Documents are projected in the
similarity:
word embedding space as the centroids of their
µ1 · µ2 words. This representation corresponds to the
sim(µ1 , µ2 ) = (4)
kµ1 kkµ2 k mean vector µ of the Gaussian representation pre-
sented in Section 3. Similarity between documnets
where k · k is the Euclidean norm for vectors. The
is computed using cosine similarity (Equation 4).
similarity between the covariance matrices Σ1 and
4) WMD Distances between documents are
Σ2 can be computed using the following formula:
P computed using the Word Mover’s Distance (Kus-
Σ1 ◦ Σ2 ner et al., 2015). To compute the distances, we
sim(Σ1 , Σ2 ) = (5)
kΣ1 kF × kΣ2 kF used pre-trained vectors from word2vec. A k-nn
algorithm is then employed to classify the docu-
where (· ◦ ·) is the Hadamard or element-wise ments based on the distances between them. As in
product between matrices (we sum over all its ele- (Kusner et al., 2015), we used values of k ranging
ments) and k · kF is the Frobenius norm for matri- from 1 to 19.
ces. Hence, the similarity between two documents
5) CNN A convolutional neural network ar-
is equal to:
chitecture that has recently showed state-of-the-

sim(d1 , d2 ) = α sim(µ1 , µ2 ) art results on sentence classification (Kim, 2014).
(6) We used a model with pre-trained vectors from
+ (1 − α) sim(Σ1 , Σ2 )
word2vec where all word vectors are kept static
where α ∈ [0, 1]. It is trivial to show that the above during training. As regards the hyperparameters,
similarity measure is also a valid kernel function. we used the same settings as in (Kim, 2014): rec-
tified linear units, filter windows of 3, 4, 5 with
4 Experiments 100 feature maps each, dropout rate of 0.5, l2 con-
We evaluate the proposed approach as well as the straint of 3, mini-batch size of 50, and 25 epochs.
baselines in the context of text categorization on
4.2 Datasets
eight standard datasets.
In our experiments, we used several standard
4.1 Baselines datasets: (1) Reuters: contains stories collected
We next present the baselines against which we from the Reuters news agency. (2) Amazon: prod-
compared our approach: uct reviews acquired from Amazon over four dif-
1) BOW (binary) Documents are represented ferent sub-collections (Blitzer et al., 2007). (3)
as bag-of-words vectors. If a word is present in TREC: a set of questions classified into 6 differ-
the document its entry in the vector is 1, otherwise ent types (Li and Roth, 2002). (4) Snippets:
0. To perform text categorization, we employed a consists of snippets that were collected from the
linear SVM classifier. results of Web search transactions (Phan et al.,
2) NBSVM It combines a Naive Bayes classi- 2008). (5) BBCSport: consists of sports news
fier with an SVM and achieves remarkable results articles from the BBC Sport website (Greene and
on several tasks (Wang and Manning, 2012). We Cunningham, 2006). (6) Polarity: consists
used a combination of both unigrams and bigrams of positive and negative snippets acquired from
as features. Rotten Tomatoes (Pang and Lee, 2005). (7)
452
Dataset Reuters Amazon TREC Snippets
Method Accuracy F1-score Accuracy F1-score Accuracy F1-score Accuracy F1-score
BOW (binary) 0.9571 0.8860 0.9126 0.9127 0.9660 0.9692 0.6171 0.5953
Centroid 0.9676 0.9171 0.9311 0.9312 0.9540 0.9586 0.8123 0.8170
WMD 0.9502 0.8204 0.9200 0.9201 0.9240 0.9336 0.7417 0.7388
NBSVM 0.9712 0.9155 0.9486 0.9486 0.9780 0.9805 0.6474 0.6357
CNN 0.9707 0.9297 0.9448 0.9449 0.9800 0.9800 0.8478 0.8466
Gaussian 0.9712 0.9388 0.9498 0.9497 0.9820 0.9841 0.8224 0.8244
Dataset BBCSport Polarity Subjectivity Twitter
Method Accuracy F1-score Accuracy F1-score Accuracy F1-score Accuracy F1-score
BOW (binary) 0.9640 0.9690 0.7615 0.7614 0.9004 0.9004 0.7467 0.6205
Centroid 0.9923 0.9915 0.7783 0.7782 0.9100 0.9100 0.7361 0.5727
WMD 0.9871 0.9866 0.6642 0.6639 0.8604 0.8603 0.7031 0.4436
NBSVM 0.9871 0.9892 0.8698 0.8698 0.9369 0.9368 0.7852 0.6191
CNN 0.9486 0.9461 0.8037 0.8031 0.9315 0.9314 0.7549 0.6137
Gaussian 0.9974 0.9974 0.8021 0.8020 0.9310 0.9310 0.7534 0.6443
Table 2: Performance (accuracy and macro-average F1-score) in text categorization on the 8 datasets.
Subjectivity: contains subjective sentences

1.00
gathered from Rotten Tomatoes and objective sen-
tences gathered from the Internet Movie Database 0.98
(Pang and Lee, 2004). (8) Twitter: contains
a set of tweets, each labeled with its sentiment 0.96
accuracy
(Sanders, 2011). Table 1 shows statistics of the

8 datasets. 0.94
4.3 Text Categorization 0.92
To perform text categorization, we employed an 0.90

0.0 0.2 0.4 0.6 0.8 1.0
SVM classifier (Boser et al., 1992). Since the pro- α
posed similarity function (Equation 6) is a kernel,

we directly built the kernel matrices2 . We tuned Figure 1: Classification accuracy of the proposed
parameter α of the proposed approach using cross- method with respect to parameter α on the TREC
validation on the training set of TREC and used dataset.
the same value on all datasets (α = 0.5).
To assess the effectiveness of the different ap- not utilize word embeddings. It is also important
proaches, we employed two well-known evalu- to note that the approaches that use word embed-
ation metrics: accuracy and macro-average F1- dings (Centroid, WMD, CNN, Gaussian) achieve
score. Table 2 shows the performance of the con- an immense increase in performance on the Snip-
sidered approaches on the eight text categorization pets dataset. One possible explanation is that these
datasets. On all datasets except three (Snippets, snippets belong to domains that are highly related
Polarity, Subjectivity), the proposed approach out- to these of the articles on which the word2vec
performs the other methods. Furthermore, on two model was trained. Overall, our results demon-
of the remaining three datasets (Snippets, Sub- strate the effectiveness of the proposed method
jectivity), it achieves performance comparable to and the benefit of using word embeddings for mea-
the best-performing methods. WMD is the worst- suring the similarity between pairs of documents.
performing method on most datasets. This may
As regards the proposed method, we also com-
be due to the k-nn algorithm that is employed to
puted the sensitivity of the classification to the
classify the documents. NBSVM achieves impres-
value of parameter α. Specifically, Figure 1 shows
sive results on all datasets, considering that it does
how the classification accuracy changes with re-
2
Our code is available at: http://www.db-net. spect to parameter α on the TREC dataset. As you
aueb.gr/nikolentzos/code/gaussian.zip can see, the highest accuracy is achieved for val-
453
ues of α close to 0.5. Furthermore, when dropping Katrin Erk and Sebastian Padó. 2008. A Structured Vec-
the second term of Equation 6 (α = 1), the method tor Space Model for Word Meaning in Context. In Pro-
ceedings of the 2008 Conference on Empiricial Methods
is equivalent to the Centroid baseline and the per- in Natural Language Processing, pages 897–906.
formance drops significantly.
Derek Greene and Pádraig Cunningham. 2006. Practical So-
lutions to the Problem of Diagonal Dominance in Kernel
5 Conclusion Document Clustering. In Proceedings of the 23rd Interna-
tional Conference on Machine Learning, pages 377–384.
We proposed an approach that models each docu-
ment as a Gaussian distribution based on the em- Rie Johnson and Tong Zhang. 2015. Effective Use of Word
Order for Text Categorization with Convolutional Neural
beddings of its words. We then defined a function Networks. In Proceedings of the 2015 Conference of the
that measures the similarity between two docu- North American Chapter of the Association for Computa-
ments based on the similarity of their distributions. tional Linguistics: Human Language Technologies, pages
103–112.
Empirical evaluation demonstrated the effective-
ness of the approach across a range of datasets. We Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom.
2014. A Convolutional Neural Network for Modelling
attribute this performance gain of the proposed ap- Sentences. In Proceedings of the 52nd Annual Meeting
proach to the high quality of the embeddings and of the Association for Computational Linguistics, pages
its ability to effectively utilize these embeddings. 655–665.
Yoon Kim. 2014. Convolutional Neural Networks for Sen-

tence Classification. In Proceedings of the 2014 Confer-
References ence on Empirical Methods in Natural Language Process-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ing, page 1746–1751.
2015. Neural Machine Translation by Jointly Learning
to Align and Translate. In Proceedings of the 3rd Interna- Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q.
tional Conference on Learning Representations. Weinberger. 2015. From Word Embeddings To Docu-
ment Distances. In Proceedings of the 32th International
Marco Baroni and Roberto Zamparelli. 2010. Nouns are Conference on Machine Learning, pages 957–966.
vectors, adjectives are matrices: Representing adjective-
Quoc Le and Tomas Mikolov. 2014. Distributed Represen-
noun constructions in semantic space. In Proceedings of
tations of Sentences and Documents. In Proceedings of
the 2010 Conference on Empiricial Methods in Natural
the 31st International Conference on Machine Learning,
Language Processing, pages 1183–1193.
pages 1188–1196.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Rémi Lebret and Ronan Collobert. 2014. Word Embeddings
Christian Jauvin. 2003. A Neural Probabilistic Lan- through Hellinger PCA. In Proceedings of the 14th Con-
guage Model. The Journal of Machine Learning Re- ference of the European Chapter of the Association for
search, 3:1137–1155. Computational Linguistics, pages 482–490.
David M. Blei, Andrew Y. Ng, and Michael I Jordan. 2003. Rémi Lebret and Ronan Collobert. 2015. “The Sum
Latent Dirichlet Allocation. The Journal of Machine of Its Parts”: Joint Learning of Word and Phrase
Learning Research, 3:993–1022. Representations with Autoencoders. arXiv preprint
arXiv:1506.05703.
John Blitzer, Mark Dredze, Fernando Pereira, et al. 2007.
Biographies, Bollywood, Boom-boxes and Blenders: Do- Remi Lebret, Pedro Pinheiro, and Ronan Collobert. 2015.
main Adaptation for Sentiment Classification. In Pro- Phrase-based Image Captioning. In Proceedings of The
ceedings of the 45th Annual Meeting of the Association 32nd International Conference on Machine Learning,
for Computational Linguistics, pages 440–447. pages 2085–2094.
Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vap- Xin Li and Dan Roth. 2002. Learning Question Classifiers.
nik. 1992. A Training Algorithm for Optimal Margin In Proceedings of the 19th International Conference on
Classifiers. In Proceedings of the 5th Annual Workshop Computational Linguistics, pages 1–7.
on Computational Learning Theory, pages 144–152.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,
Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. and Jeff Dean. 2013. Distributed Representations of
2010. Mathematical Foundations for a Compositional Words and Phrases and their Compositionality. In Ad-
Distributional Model of Meaning. Linguistic Analysis, vances in Neural Information Processing Systems, pages
36:345–384. 3111–3119.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Jeff Mitchell and Mirella Lapata. 2008. Vector-based Mod-
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. els of Semantic Composition. In Proceedings of the 46th
Natural Language Processing (almost) from Scratch. The Annual Meeting on Association for Computational Lin-
Journal of Machine Learning Research, 12:2493–2537. guistics, pages 236–244.
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word
George W. Furnas, and Richard A. Harshman. 1990. In- embeddings efficiently with noise-contrastive estimation.
dexing by Latent Semantic Analysis. Journal of the Amer- In Advances in Neural Information Processing Systems,
ican Society for Information Science, 41(6):391–407. pages 2265–2273.
454
Bo Pang and Lillian Lee. 2004. A Sentimental Educa-
tion: Sentiment Analysis using Subjectivity Summariza-
tion based on Minimum Cuts. In Proceedings of the 42nd
Annual Meeting on Association for Computational Lin-
guistics, pages 271–278.
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting

class relationships for sentiment categorization with re-
spect to rating scales. In Proceedings of the 43rd An-
nual Meeting on Association for Computational Linguis-
tics, pages 115–124.
Jeffrey Pennington, Richard Socher, and Christopher D.

Manning. 2014. GloVe: Global Vectors for Word Repre-
sentation. In Proceedings of the 2014 Conference on Em-
piricial Methods in Natural Language Processing, pages
1532–1543.
Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi.

2008. Learning to Classify Short and Sparse Text & Web
with Hidden Topics from Large-scale Data Collections.
In Proceedings of the 17th International Conference on
World Wide Web, pages 91–100.
Niek J. Sanders. 2011. Twitter Sentiment Corpus. Sanders

Analytics.
Yangqiu Song and Dan Roth. 2015. Unsupervised Sparse

Vector Densification for Short Text Similarity. In Pro-
ceeding of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics
and Human Language Technologies, pages 1275–1280.
Sida Wang and Christopher D. Manning. 2012. Baselines

and Bigrams: Simple, Good Sentiment and Topic Classi-
fication. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics, pages 90–94.
Peng Wang, Jiaming Xu, Bo Xu, Cheng-Lin Liu, Heng

Zhang, Fangyuan Wang, and Hongwei Hao. 2015. Se-
mantic Clustering and Convolutional Neural Network for
Short Text Categorization. In Proceedings of the 53rd An-
nual Meeting of the Association for Computational Lin-
guistics, pages 352–357.
Wenpeng Yin and Hinrich Schütze. 2015. Convolutional

Neural Network for Paraphrase Identification. In Proceed-
ings of the 2015 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 901–911.
455

Multivariate Gaussian Document Representation From Word Embeddings For Text Categorization

Uploaded by

Copyright:

Available Formats

Multivariate Gaussian Document Representation From Word Embeddings For Text Categorization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multivariate Gaussian Document Representation From Word Embeddings For Text Categorization

Uploaded by

Copyright:

Available Formats

Multivariate Gaussian Document Representation from Word Embeddings

for Text Categorization

François Rousseau Michalis Vazirgiannis

Abstract a latent low-dimensional representation of docu-

Subjectivity: contains subjective sentences

(Sanders, 2011). Table 1 shows statistics of the

4.3 Text Categorization 0.92

To perform text categorization, we employed an 0.90

posed similarity function (Equation 6) is a kernel,

Yoon Kim. 2014. Convolutional Neural Networks for Sen-

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting

Jeffrey Pennington, Richard Socher, and Christopher D.

Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi.

Niek J. Sanders. 2011. Twitter Sentiment Corpus. Sanders

Yangqiu Song and Dan Roth. 2015. Unsupervised Sparse

Sida Wang and Christopher D. Manning. 2012. Baselines

Peng Wang, Jiaming Xu, Bo Xu, Cheng-Lin Liu, Heng

Wenpeng Yin and Hinrich Schütze. 2015. Convolutional

You might also like