A Hybrid Framework For Text Modeling With Convolutional RNN: Chenglong Wang, Feijun Jiang, Hongxia Yang
A Hybrid Framework For Text Modeling With Convolutional RNN: Chenglong Wang, Feijun Jiang, Hongxia Yang
A Hybrid Framework For Text Modeling With Convolutional RNN: Chenglong Wang, Feijun Jiang, Hongxia Yang
2061
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
dissimilarities of a sentence pair. Recurrent neural networks (RNNs) W ∈ R H ×d , U ∈ R H ×H , and b ∈ R H ×1 are the network parameters.
are powerful tools for modeling sequential data, yet training them Single directional LSTMs suffer form the weakness that they cannot
by back-propagation through time can be difficult. [43] proposes to utilize the contextual information from the future tokens. BI-LSTMs
add the attention before computing the sentence representation for solve this problem by using both the previous and future context
attention based RNN models. [17] shows that dependence models through processing the sequence in two directions, and generate
set up from Markov random field can be naturally extended by as- two sequences of output vectors. The output for each token is the
signing weights to concepts and demonstrate that the dependence concatenation of the two vectors from both directions.
model can be trained using existing learning-to-rank techniques There is another popular RNN unit, namely gated recurrent
with a relatively small number of training queries. [23] proposes unit (GRU)[1]. The GRU is capable of capturing dependencies on
the key-value memory networks which are versatile models for different time scales adaptively. Similarly to the LSTM unit, the
reading documents or knowledge bases and answering questions GRU has gating units that modulate the flow of information inside
about them, allowing to encode prior knowledge about the task at the unit, however, without having a separate memory cell. This
hand in the key-value memories. procedure of taking a linear sum between the existing state and
the newly computed state is similar to the LSTM unit. The GRU,
1.1 Contributions however, does not have any mechanism to control the degree to
The major contributions of this paper can be summarized as follows: which its state is exposed, but exposes the whole state each time.
Hence it is more appropriate in our situation due to the imbalance
(1) We propose the hybrid conv − RN N framework that can length between questions and answers in AS. The hidden state ht
process the text using both convolutional and recurrent used for learning sentence representations is computed by
neural networks, seamless integrating the merits on ex-
tracting different aspects of linguistic information from ht = (1 − zt ) ◦ ht −1 + zt ◦ h˜t , (7)
both structures and thus strengthening the matching and h˜t = σ (W w t + U [r t ◦ ht −1 ] + b), (8)
classification power of the framework. zt = σ (Wz w t + Uz ht −1 + bz ), (9)
(2) We extend the base conv − RN N and propose novel frame-
works for SC and AS respectively. rt = σ (Wr w t + Ur ht −1 + br ), (10)
(3) We test empirically on a very wide variety of data sets, where W ,Wz ,Wr ∈ R H ×d ; U , Uz , Ur ∈ R H ×H and b, bz , br ∈ R H ×1
including WikiQA[51], InsuranceQA[4] and several bench- are network parameters.
mark datasets of SC including movie reviewer (MR [31]),
Stanford sentiment treebank (SST [38]), IMDB [18] and 2.2 Convolutional Neural Network (CNN)
Subj [30]. For AS, the proposed model outperforms the A CNN leverages three important ideas that can help improve a
state-of-the-arts on both testing datasets; for SC, we achieve machine learning system: sparse interaction, parameter sharing
the best performances in 4 out of the 5 tasks. To the best of and equivariant representation. Sparse interaction contrasts with
our knowledge, it is by far the most complete comparison traditional neural networks where each output is interactive with
results in the fields of AS and SC. each input. In a CNN, the filter size (or kernel size) is usually much
The rest of the paper is organized as follows. In Section 2, we briefly smaller than the input size. As a result , the output is only interactive
review the related work on RNN, CNN and their hybrid framework. with a narrow window of the input. Parameter sharing refers to
Then in Section 3, we introduce the conv − RN N as well as the reusing the filter parameters in the convolution operations, while
SC model and the attention based AS mdoel. Section 4 presents the element in the weight matrix of traditional neural network will
experimental results on extensive datasets and applications. Finally, be used only once to calculate the output. Equivariant representation
we conclude the paper in Section 5. is related to the idea of k-MaxPooling which is usually combined
with a CNN. So each filter of the CNN represents some feature, and
2 RELATED WORK after the convolution operation, the 1-MaxPooling value represents
the highest degree that the input contains the feature. The position
2.1 Recurrent Neural Network (RNN) of that feature in the input is irrelevant due to the convolution. This
Long short-term memory (LSTM) is a popular RNN model and has property is very useful for many NLP applications. Below is an
been widely applied in various NLP problems. The H dimensional example to demonstrate the CNN implementation.
hidden state ht at the time step t is updated as follows: Assuming that W ∈ R n×d is the input sentence matrix, with
it = σ (Wi w t + Ui ht −1 + bi ), (1) each word represented by a d-dimensional word embedding vector;
f ∈ Rm×d represents the filter with sliding window size m. Then
ft = σ (Wf w t + Uf ht −1 + bf ), (2) the convolutional output of the input W and the filter f is a n-
ot = σ (Wo w t + Uo ht −1 + bo ), (3) dimensional vector o:
C̃t = tanh(Wc w t + Uc ht −1 + bc ), (4) m−1
Õ d−1
Õ
Ct = i t ∗ C̃t + ft ∗ Ct −1 , (5) oi = fm−k −1, j Wi−k, j . (11)
k =0 j=0
ht = ot ∗ tanh(Ct ), (6)
After the k-MaxPooling, the maximum of the k values will be kept
where there are three gates, input gate i, forget gate f and output for the filter f , which indicates the k highest degree that filter f
gate o, and a cell memory vector Ct . σ is the sigmoid function, matches the input W .
2062
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
There are some fundamental differences between CNN and RNN Table 1: Notations used in conv − RN N
and thus can bring us different benefits. Convoluational networks
can be stacked to represent large context sizes and extract hierarchi- Notation and Description
cal features over larger contexts with more abstractive features. On wi the ith word in the input sentence
the contrary, RNN views the input as a chain structure and therefore |s | the length of the input sentence
requires a linear number O(N ) of operations. However, the latter S the input sentence
is well designed for sequence modeling. Besides, in RNN, the next V vocabulary
output depends on the previous hidden state which is not suitable |V | the size of vocabulary
for parallelization over the elements of a sequence. CNN, on the W the word embedding matrix
other side, is very amenable to this computing paradigm since the dw , d r the dimension of word embedding and RNN cell
computation of all input words can be performed simultaneously. respectively
vi the word embedding of word w i
f
2.3 Hybrid Framework r t , r tb , r t the output of forward/backward RNN
units and BI-RNN layer respectively at time step t
With recent advances of neural network models in natural language f
processing, a standard for sequence modeling now is to encode a h |s | , hb|s | the final hidden states of forward/backward RNN
sequence of text as an embedding vector using models such as CNN units respectively
or RNN. For example, to match two sequences, a straightforward n the number of filter vectors
approach is to encode each sequence as a vector and then to combine fi the ith filter vector used in convolution layer
the two vectors to make a decision. In a CNN, the filter size (or c it the output of convolution layer with
kernel size) is usually much smaller than the input size. As a result, filter vector i at time step t
the output is only interactive with a narrow window of the input Aq the attention vector based on input question q
and usually emphasizes the local lexical connections of the n-gram. Xs the final semantic representation of input sentence S
On the other hand, RNN is well designed for sequence modeling.
Especially long short-term memory (LSTM) models can successfully Algorithm 1 conv − RN N Algorithm
keep the useful information from long-range dependency but with
a tradeoff of ignoring the local n-gram coherence. Fundamentally, The input
Input: sentence consists of a series of words:
recurrent and convolutional neural networks have their own pros w 1 , ..., w |s | , where w i is drawn from a finite-sized vocab-
and cons, and it has been found that using a vector from either CNN ulary V
1: Represent w i with its corresponding word embedding vi ∈
or RNN to encode an entire sequence is not sufficient to capture all
the important information sequence [8, 9]. Rdw via a lookup table operation vi = LTW (w i ). Define
There have been several trials to design a hybrid framework for S = v 1 , ..., v |s | as the input sentence embedding matrix with
coherent combinations of CNNs and RNNs and enjoy the merits dimension Rdw × |s | .
f
from both. [40] developed hybrid models that process the text using 2: Apply BI-RNN to process S to get outputs r t , r rb ∈ R d r and
both convolutional and recurrent neural networks on extracting f
final hidden states h |s | , hb|s | of forward and backward RNN
linguistic information from both structures to address passage AS. f b
[6] proposed a novel neural network model based on a hybrid of respectively at time
h stepi t. Concatenate r t , r r and get r t ∈
R 2dr , denote r t = r t ; r tb .
f
ConvNet and BI-LSTMs for the semantic textual similarity mea-
surement problem. Besides that, their pairwise word interaction 3: Use a set of n filter vectors fi ∈ R 2d r to process R and get C
model and the similarity focus layer can better capture fine-grained where Cit = fiT · r t
semantic information, compared to previous sentence modeling 4: Adopt the rectified linear (ReLU) function max(0, x) to process
approaches that attempt to ”cram” all sentence information into a C with outputs A defined as Ait = max(0, Cit )
fixed-length vector. [44] proposed an efficient hybrid model that 5: Apply max pooling to process A and get X s ∈ R n where X s [i] =
tackles the problem which combines a fast deep model with an ini- max(0, A [i, :]).
tial information retrieval model to effectively and efficiently handle Output: Return X s
AS.
3 MODEL FORMULATION the useful information from long-range dependency while empha-
sizing the local information at each time step t simultaneously.
Fundamentally, recurrent and convolutional neural networks have
3.1 conv − RN N their own pros and cons, and it has been found that using a vector
CNN with convolutional layers and nonlinear layers followed by a from either CNN or RNN to encode an entire sequence is not suffi-
pooling layer has been widely used for semantic representations cient to capture all the important information from the sequence.
in text modeling in various NLP tasks, and has proven to gain bet- Thus, we propose the following hybrid framework for coherent
ter performances than traditional NLP methods. However, CNN combinations of CNNs and RNNs and enjoy the merits from both,
emphasizes the local n-gram features and could not capture long namely conv − RN N . Our model consists of the following four
range interactions. On the other hand, RNN could efficiently keep types of layers: word embedding layer, BI-RNN layer, convolutional
2063
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
2064
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
q q
each word w i in question q, we augment its word embedding vi
q
with an overlapping score oi which is the maximum inner prod-
q
uct of vi with any word embedding in the answer; similarly for
each word w ia in answer a, we augment its word embedding via
with an overlapping score oia which is the maximum inner prod-
uct of via with any word embedding in the question. This word
matching feature was inspired by [45]. Given a word w i , the final
word representation is obtained by concatenating the original word
embedding and the corresponding overlapping score.
We use separate BI-RNN layers to process the questions and
answers respectively but adopt a shared convolutional layer. This
is because questions and answers usually have very different struc-
tures, e.g. the lengths of answers are usually much longer than
that of questions. It has shown significant improvement on per-
formance and convergence rate by using weight-sharing layers on
top of embedding layers [4]. The intuition behind this is that the
corresponding elements in q and a are guaranteed to represent the
same topic in a shared convolutional layer but there is no such
constraint with separate layers.
In addition, we develop a simple but effective attention mech-
anism to improve the semantic representations for the answers
based on the questions. In QA pairs, the answers might be much
longer than questions and contain lots of words that are irrelevant
to the questions. Hence totally separate encoding of questions and
answers might result in answer sentence representations distracted
by irrelevant information. We add external information from ques-
tion BI-RNN encoder to the input of the conv − RN N for answer
sentence encoding. In the recurrent neural networks, the final hid-
den states h |s | or the average of all hidden states |s1 | t =1 ht is
Í |s |
Figure 3: conv − RN N based question-answer matching net-
usually adopted as the question representation. In this paper, we
work. f
add the final hidden states h |s | , hb|s | from forward and backward
RNN units respectively to get the attention vector Aq . We apply
of input sentences is critical to the performance of sentence clas- the gated recurrent unit (GRU)[1] as the RNN unit. Formally, given
sification tasks. conv − RN N is adopted to extract the semantic Aq , the hidden state ht used for learning answer representations is
information of the input texts, which are then used to predict its computed by
classes. Specifically, there is a joint layers on top of conv − RN N . ht = (1 − zt ) ◦ ht −1 + zt ◦ h˜t , (14)
This joint layer concatenates the output of conv − RN N , X q , and
h˜t = σ (W vt + U [r t ◦ ht −1 ] + CAq + b), (15)
two final hidden states fromh the forward iand 0
backward RNN units
f
respectively into X join = h |s | 0, X q0 , hb|s | 0 , which is used as the fi- zt = σ (Wz vt + Uz ht −1 + Cz Aq + bz ), (16)
nal representation of input texts. This model includes an additional rt = σ (Wr vt + Ur ht −1 + Cr Aq + br ), (17)
hidden layer on top of the joint layer to allow for modeling inter-
whereW ,Wz ,Wr ∈ Rdr ×dw ; U , Uz , Ur ∈ Rdr ×dr ; C, Cz , Cr ∈ Rdr ×dr
actions between the components of intermediate representations.
and b, bz , br ∈ Rdr are weight matrices. σ is non-linear activation
On top of the whole model, there is a softmax classification layer,
function. This attention mechanism is designed to focus on the
which generates a distribution over the class labels.
words in the answer sentences that are strongly connected to the
questions.
3.3 Attention Based conv − RN N for Answer Given the resulting vector representations X q and X a , the Geo-
Selection metric mean of Euclidean and Sigmoid Dot (GESD)[4] is used to
We further proposed an attention based conv − RN N for AS as measure the relatedness between the two representations:
illustrated in Figure 3. The problem is formulated as follows: assum- 1 1
ing that a question q is associated with a set of candidate answers X sim = × . (18)
1 + kx − y k 1 + exp(−γ (xyT + c))
{a 1 , ..., an } accompanied with their judgements {y1 , ..., yn }, where
yi = 1 if the answer is correct and yi = 0 otherwise. To better cap- It has been proved that GESD could achieve superior performance
ture QA relationship, we augmented the input word embeddings than simple cosine similarity. On top of the GESD layer and two
with additional dimensions to represent the semantic similarity blocks, there is a joint layer which concatenates X q , X a and X sim
between the words of question and answer sentences. Formally, for into a single vector: X join = [X q0 , X a0 , X sim
0 ] 0 . This vector is then
2065
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
Table 2: Summary Statistics of SC Datasets. Traditional Machine Learning (ML) [3] studied a statisti-
cal parsing framework for sentence-level sentiment clas-
Data c l N Vpr e Test sification; [46] identified that simple Naive Bayes (NB)
|V |
and Support Vector Machine (SVM) variants outperform
MR 2 20 106,62 18,765 16,448 CV
most published results on sentiment analysis datasets; [47]
SST-1 5 53 11,855 17,836 16,262 2,210
showed how to do fast dropout training by sampling from
SST-2 2 53 9,613 16,188 14,827 1,821
or integrating a Gaussian approximation, which is justi-
Subj 2 23 10,000 21,323 17,913 CV
fied by the central limit theorem and empirical evidence,
IMDB 2 251 50,000 102,896 58962 25,000
and it results in an order of magnitude speedup and more
Note: c: Number of classes. l: Average sentence length. N: Size of stability; [42] also showed that the dropout regularizer
dataset. |V |: Vocabulary size. Vpr e : Number of words present in
is first-order equivalent to an L 2 regularizer applied after
the set of pre-trained word embeddings. Test: Test set size. CV scaling the features by an estimate of the inverse diagonal
(cross validation): No standard train/test split and thus 10 fold CV Fisher information matrix; [19] compared several machine
was used. learning approaches in the field of Sentiment analysis, and
combined them to achieve better performance.
Deep Learning (DL) [14] extended word2vec with a new
passed through two layers of full-connected neural networks, which method called Paragraph-Vec, which is an unsupervised
generates a distribution over the class labels. algorithm that learns fixed-length feature representations
from variable-length pieces of texts, such as sentences,
4 EXPERIMENTS AND EVALUATIONS paragraphs, and documents. [36–38] are various exten-
4.1 Sentence Classification sions of recursive networks; [26] proposed to incorporate
generic and target domain embeddings in CNN for SC;
4.1.1 Experimental Datasets. We tested conv − RN N for SC [45] proposed a general “compare-aggregate” framework
(proposed in Section 3.2) on five widely used datasets summarized that performs word-level matching followed by aggrega-
as following: tion using CNN; [10] reported on a series of experiments
• MR: Short movie review dataset with one sentence per re- with CNNs trained on top of pre-trained word vectors for
view. Each review was labeled with their overall sentiment sentence-level classification tasks; [13] empirically studied
polarity (positive or negative). desirable properties such as semantic coherence, attention
• SST-1: Stanford Sentiment Treebank 1, an extension of mechanism and kernel reusability in CNNs for learning SC
movie review dataset. It includes fine-grained labels (very- tasks; [2] presented two approaches that use unlabeled data
positive, positive, neutral, negative, very-negative) for to improve sequence learning with recurrent networks; [7]
215,154 phrases in the parse trees of 11,855 sentences, leveraged the Combinatory Categorial Grammar (CCG)
which is more convenient for applications of recursive combinatory operators to guide a non-linear transforma-
neural network (RecNN). tion of meaning within a sentence; [27] proposed novel
• SST-2: Similar to SST-1 but with only binary labels (positive approaches that use both word embeddings created from
or negative, neutral reviews removed). generic and target domain corpora when it is difficult to
• Subj: Subjectivity dataset containing sentences labeled find a domain corpus that is large enough for creating
with respect to their subjectivity status (subjective or ob- effective word embeddings.
jective). Hybrid Framework of ML and DL [28] presented a depen-
• IMDB: A large Internet movie database for binary senti- dency tree-based method for sentiment classification of
ment classification. It includes 50k full-length labeled re- Japanese and English subjective sentences using condi-
views with provided training and testing splits. Besides, it tional random fields with hidden variables. [39] intro-
provides 50k unlabeled reviews for unsupervised learning. duced a generalization of LSTMs to tree-structured net-
The samples of the first 4 tasks are short snippets with an average work topologies; [16] used the multi-task learning frame-
length less than 60. IMDB is a much larger dataset containing work to jointly learn across multiple related tasks based
reviews with an average length more than 250. Summary statistics on recurrent neural network.
of these datasets are given in Table 2. We preprocessed the texts 4.1.3 Experimental Setup.
so that punctuations are treated as separate tokens and tokenized
the text by space. We did not truncate the sentences to specific Pre-trained Word Vectors For all of the SC tasks, we use
length. Besides, all the characters are converted to lower case. the publicly pretrained word2vec [21, 22] vectors trained
For comparison with other published results, the standard splits on part of Google News dataset (about 100 billion words).
are used when they are available; for MR and Subj, 10-fold cross- This word embedding model was trained using the con-
validation was used for comparison. tinuous bag-of-words architecture and contains 300 di-
mensional vectors for 3 million words and phrases. For
4.1.2 Baseline Competitors. We did a very extensive compar- words that are not present in the set of word2vec, uni-
isons with the state-of-the-arts methodologies, which can be broadly form distribution was used to generate the vector repre-
divided into the following categories: sentations. Preliminary experiments shown that randomly
2066
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
Table 3: Hyper-parameter Configurations for Grid Search Table 4: Comparison Results of conv − RN N on sentence
classification tasks.
RNN model GRU or LSTM
Dimension of RNN unit dr 100, 150, 200 Model MR SST-1 SST-2 Subj IMDB
Number of filters n 150, 200, 300
Dimension of hidden layer 200, 400, 600 Sent-Parser [3] 79.5 - - - -
weight of L 2 -norm 10−5 , 10−4 , 10−3 , 10−2 NBSVM [46] 79.4 - - 93.2 91.32
MNB [46] 79.0 - - 93.6 86.59
G-Dropout[47] 79.0 - - 93.4 91.2
generated vectors are better to have same variance as pre- F-Dropout [47] 79.1 - - 93.6 91.1
trained ones. Uniform distribution between [−0.25, 0.25] Drop-Bi [42] - - - - 91.98
was used to generate random vectors for words that are NB-SVM Trigram [19] - - - - 91.87
not present in word2vec. [35] has shown that it is better
Paragraph-Vec [14] - 48.7 87.7 - 92.58
to keep the word embeddings static if the dataset is too
RAE [37] 77.7 43.2 82.4 - -
small to fine-tune the word matrix. On the other hand,
MV-RNN [36] 79.0 44.4 82.9 - -
fine-tuning the word matrix along with the model may
RNTN [38] - 45.7 85.4 - -
gain an improvement in final results for larger datasets.
DCNN [26] - 48.5 86.8 - -
As a result, we used two sets of word embeddings, static
CNN-non-static [10] 81.5 48.0 87.2 93.4 -
and fine tuned. The two sets of vectors are concatenated
CNN-multichannel [10] 81.1 47.4 88.1 93.2 -
to represent each word. During training, gradients are
SA-LSTM [2] - - - - 92.76
back-propagated only through one word matrix. Hence
WkA + 25% flexible [13] 80.02 46.11 84.29 92.68 90.16
the model could be able to fine-tune one word matrix while
filters (FF)
keeping another static. Both word matrix are initialized
Fully Connected [27] 81.59 - - - -
with word2vec or by uniform distribution for words that
Layer Combination
are not present in word2vec.
Training and Hyper-parameter settings Similarly to [10], Tree-CRF [28] 77.3 - - - -
we used a grid search on the SST-2 dev set to determine Tree-LSTM [39] - 50.6 86.9 - -
the best configurations, but there is no fine tuning for the Multi-Task [16] - 49.6 87.9 94.1 91.3
left tasks. In particular, we tuned the hyper-parameters CCAE [7] 77.8 - - - -
combinations as shown in Table 3. We carried out ex- conv − RN N 81.99 51.67 88.91 94.13 90.39
periments on the hyper-parameter combinations over the
whole grid. As a result, we used LSTM with 150 dimension
We use blue to highlight wins and use ‘-’ to represent results that
as the RNN unit. The number of filters n in convolution
are not provided.
layer is set to 200. The dimension of the hidden layer is
200. We also added L 2 regularisation term to the loss func-
tion, and the weight of L 2 -norm is set to 10−3 . For MR and was achieved by SA-LSTM with additional unlabeled data. [46]
Subj, we also utilize dropout on the word embedding layer, explored this situation with follow up discussions1 . We argue
BI-RNN layer and max-pooling layer respectively. The that IMDB is a such larger dataset containing reviews with average
optimal dropout rate is set to 0.2, which is selected from length more than 250 and maximum length 2,635. [24] also suggests
{0.2, 0.4, 0.6, 0.8}. The dropout rate was optimized along that “statistical methods” work well for datasets with hundreds of
with other hyper-parameters shown above by grid search words in each example but they cannot handle snippets with few
for MR and Subj. The overall network is trained to minimise sentences. Deep neural networks are limited in representations
the cross-entropy of the predicted and true labels. The of large size of text, which is a worthy direction of further study.
model is trained with mini-batches by back-propagation When there are sufficient contents, simple methods such as bag of
using Adam optimization methods [11]. Batch size is set to words (BOW) are good enough.
16, which is tested and chosen from {16, 32, 64, 128}. The As we can see, in 4 out of 5 tasks, our model with little task-
learning rate is set to 5 × 10−4 , which is optimized from specific hyper-parameter tuning exceeds all the other state-of-art
{10−4 , 5 × 10−4 , 10−3 }. Overall, we do not perform any methods. Notice that previous state-of-the-arts cover different do-
task-specific tuning except dropout. mains, traditional ML, DL, and the hybrid framework of ML and DL.
For MR/Subj, the number of sentences is one order of magnitude
4.1.4 Results and Discussions. Table 4 lists the test accuracy
smaller than that of parameters in our model, hence regulariza-
results of our model compared to other published methods on the
tion has a great effect on the performance. We use L 2 -norm and
5 benchmark datasets. For MR/SST-2, the best performances are
dropout to control overfitting. Dropout has been proved to be
achieved by CNN based models; for SST-1, LSTM based models
such a powerful regularizer that enables us to use a large enough
behave better overall. Interestingly, for Subj and IMDB, simple
network. Consistently, dropout achieved 2%-3% relative better per-
models such as Naive Bayes/SVM with bag of word features would
formances while L 2 -norm could only gain slightly better results
gain excellent performance, and none of the deep learning models
could do significantly better. Specifically, the best result for IMDB 1 https://github.com/sidaw/nbsvm
2067
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
Table 5: Summary statistics of AS dataset. be trained using existing learning-to-rank techniques, even
with a relatively small number of training queries.
InsuranceQA WikiQA DNN [6] presented a hybrid deep learning network to ex-
Train Dev Test Train Dev Test plicitly model pairwise word interactions and present a
novel similarity focus mechanism to identify important
#Q 12887 1000 1800*2 832 126 243
correspondences for better similarity measurement; [20]
#C 50 500 500 10 9 10
introduced a generic variational inference framework for
#w in Q 7.2 7.2 7.2 6.5 6.5 6.4
generative and conditional models of text and validated
#w in A 92.1 92.1 92.1 25.5 24.7 25.1
this framework on two very different text modelling ap-
Note: #Q: Number of questions. #C: Average number of answers plications, generative document modelling and supervised
per question. #w in Q: Average number of words per question. #w question answering; [48] designed a model to take into
in A: Average number of words per answer account both the similarities and dissimilarities by decom-
posing and composing lexical semantics over sentences;
[35] used the relational information given by the matches
(usually less than 1%). For SST-1/SST-2, the marked labels are pro- between words from the two members of the pair through
vided at the phrase-level with more than 200k samples train models, CNN.
and regularizations are not very important to avoid overfitting in Attention Based DNN [34] proposed Attentive Pooling (AP),
this situation. a two-way attention mechanism for discriminative model
training; [52] presented a similar general Attention Based
4.2 Answer Selection CNN (ABCNN) for modeling a pair of sentences. [43] an-
alyzed the deficiency of traditional attention based RNN
4.2.1 Datasets. To test the performances of the proposed atten-
models quantitatively and qualitatively and presented three
tion based conv − RN N in Section 3.3, we focus on the following
new RNN models that added attention information before
two widely used benchmark datasets with summary statistics in
RNN hidden representation.
Table 5.
• WikiQA: An open-domain AS dataset containing 3,047 4.2.3 Experimental Setup. For AS tasks, we utilize the Global
questions originally sampled from Bing query logs. The Vectors for Word Representation (Glove)[32]. Specifically, We use
candidate answers were extracted from the summay para- the provided model Common Crawl with 300 dimensional vectors
graphs of associated Wikipedia pages, with labels on whether and 2.2M vocabulary to initialize the word matrix. For InsuranceQA,
the sentence is a correct answer to the question provided we used two sets of word embeddings during training: static and
by crowdsourcing workers. 20.3% of the answers in the fine-tuned. These two sets of vectors are concatenated to represent
WikiQA dataset share no content words with questions the corresponding words. As for WikiQA, we only use the static
and is constructed in a natural and realistic manner. We embedding due to the reason that the WikiQA is an open-domain
followed the same pre-processing steps as [50] and adopted dataset and its train/dev/test sets contain separate questions from
the standard setup of only considering questions having different domains. Hence there are much fewer overlapping words
correct answers for training and evaluation. among train/dev/test sets. Besides, WikiQA is much smaller com-
• InsuranceQA: A large-scale non-factoid QA dataset. All pared to InsuranceQA, hence fine-tuning the word matrix during
of the pairs are from the insurance domain. It provides a training would easily leads to overfitting and has a negative effect
training set, a validation set and two test sets. For each on the final outputs.
question in test sets and dev set, there is a set of 500 can- Similarly to the set up of SC experiments, a grid search on Wik-
didate answers, which include the ground-truth answers iQA is used to determine the best configurations but no tuning for
and randomly selected negative answers. InsuranceQA. In particular, we tuned the same hyper-parameters
shown in 4.1.3. As a result, we utilize GRU as RNN unit in Bi-RNN
4.2.2 Baseline Competitors. [51] released the WikiQA dataset to encode questions and answers respectively. Questions and an-
and compared methods that achieve very competitive results. Other swers share the same convolution layer and max-pooling layer. The
methods can be categorized into Information Retrieval, DNN and dimension of GRU dr is set to 150 , and the number of filters is set
the recently popular Attention Based DNN as following: to 200. The parameter of GESD, γ and c, are both set to 1.0. The
Information Retrieval [23] introduced a new method, Key- cross-entropy between the predicted and true distributions is the
Value Memory Networks, that makes reading documents objective function to be optimized. L 2 -norm is also added to loss
more viable by utilizing different encodings in the address- function for regularization and the regularizer is set to 10−4 . We use
ing and output stages of the memory read operation; [49] dropout on the Bi-LSTM layer and joint layer, and the dropout rate
presented an information retrieval approach for chatbot en- is set to 0.8. We use two layers of full-connected neural networks
gines that can leverage unstructured documents, instead of on top of X join to predict the probability distribution over classes,
Q-R pairs, to respond to utterances. [17] showed that one of and the hidden size is 200. Training is done via stochastic gradient
the most effective existing term dependence models could descent (SGD) over shuffled mini-batches updated through Adam.
be naturally extended by assigning weights to concepts and The learning rate is set to 5 × 10−4 and batch size is set to 64. We use
demonstrated that the weighted dependence model could mean average precision (MAP) and mean reciprocal rank (MRR) for
2068
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
Table 6: Comparison Results of conv − RN N on WikiQA. Table 7: Comparison Results of conv − RN N on Insur-
anceQA.
Model MAP MRR
Model dev test1 test2
Word Cnt [51] 0.4891 0.4924
Wgt Word Cnt [51] 0.5099 0.5132 IR model [17] 52.7 55.1 50.8
LCLR [51] 0.5993 0.6086 QA-LSTM with attention [41] 68.4 68.1 62.2
Key-Value Memory Network [23] 0.7069 0.7265 CNN with GESD [4] 65.4 65.3 61.0
DocChat+(2) [49] 0.7008 0.7222 Attentive LSTM [40] 68.9 69.0 64.8
IARNN-Occam [43] 69.1 68.9 65.1
Paragraph-Vec [51] 0.5110 0.5160
IARNN-Gate [43] 70.0 70.1 62.8
CNN [51] 0.6190 0.6281
AP-BILSTM [34] 68.4 71.7 66.4
Paragraph-Vec-Cnt [51] 0.5976 0.6058
CNN-Cnt [51] 0.6520 0.6652 conv − RN N 71.7 71.4 68.3
CubeCNN [6] 0.7090 0.7234
NASM + Cnt [20] 0.689 0.707
L.D.C [48] 0.7058 0.7226 REFERENCES
CNNr [35] 0.6951 0.7107 [1] K. Cho, B.V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bourgares, H. Schwenk,
IARNN-Occam(context) [43] 0.7341 0.7418 and Y. Bengio. 2014. Learning phrase representations using rnn encoder-decoder
for statistical machine translation. In arXiv preprint arXiv:1406.1078.
PairwiseRank+SentLevel [33] 0.701 0.718 [2] A M Dai and Q V. Le. 2015. Semi-Supervised Sequence Learning. In Advances in
Neural Information Processing Systems. 3079–3087.
AP-CNN [34] 0.6886 0.6957 [3] L. Dong, F. Wei, S. Liu, M. Zhou, and K. Xu. 2015. A Statistical Parsing Framework
ABCNN [52] 0.6914 0.7127 for Sentiment Classification. Computational Linguistics 41, 2 (2015), 293–336.
[4] M. Feng, B. Xiang, M.R. Glass, L. Wang, and B. Zhou. 2015. Applying deep
conv − RN N 0.7427 0.7504 learning to answer selection: A study and an open task. In IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU).
We use blue to highlight wins. [5] Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence
Carin. 2016. Unsupervised Learning of Sentence Representations using Convo-
lutional Neural Networks. arXiv preprint arXiv:1611.07897 (2016).
[6] H. He and J. Lin. 2016. Pairwise Word Interaction Modeling with Deep Neural
the ranked set of answers to measure the performance on WikiQA. Networks for Semantic Similarity Measurement. In Proceedings of NAACL-HLT.
For InsuranceQA, performance is measured using top-one accuracy. [7] K.M. Hermann and P. Blunsom. 2013. The Role of Syntax in Vector Space Models
of Compositional Semantics. In The 51st Annual Meeting of the Association for
4.2.4 Results and Discussions. Table 6 and 7 summarize the re- Computational Linguistics.
[8] K.M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and
sults of the proposed attention based conv − RN N . For WikiQA, P. Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of
it is clear that the sentence semantic models based on deep neural the Conference on Advances in Neural Information Processing Systems. 1693–1701.
[9] F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The Goldilocks Principle:
networks (CNN or RNN) significantly outperform traditional infor- Reading Children’s Books with Explicit Memory Representations. In Proceedings
mation retrieval methods, suggesting that semantic understanding of the International Conference on Learning Representations.
beyond lexical semantics is important for AS tasks. Much previous [10] Y. Kim. 2014. Convolutional neural networks for sentence classification. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language
work [20, 51] has demonstrated significant accuracy boosting with Processing (EMNLP). 1746–1751.
the result obtained from the combination of a lexical overlapping [11] D.P. Kingma and J. Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR
feature and the output from the deep semantic model. Results from (2014).
[12] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
attention based neural network models verify the effectiveness of at- Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in
tention mechanism for AS tasks. Our attention based conv − RN N neural information processing systems. 3294–3302.
[13] M. Lakshmana, S. Sellamanickam, S. Shevade, and K. Selvaraj. 2016. Learning
also demonstrates its effectiveness in semantic representation in Semantically Coherent and Reusable Kernels in Convolution Neural Nets for
this task. Notice that the comparisons also revealed that attention Sentence Classification. In arXiv:1608.00466.
is important for semantic matching of questions and answers. The [14] Q.V. Le and T. Mikolov. 2014. Distributed Representations of Sentences and Doc-
uments. In Proceedings of the 31st International Conference on Machine Learning.
proposed attention mechanism could consistently boost 1.0%-2.0% [15] Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences
for MAP measure of WikiQA on average. and Documents.. In ICML, Vol. 14. 1188–1196.
[16] P. Liu, X. Qiu, and X. Huang. 2016. Recurrent neural network for text classification
with multi-task learning. In arXiv:1605.05101.
5 CONCLUSIONS [17] BenderskyTh M., D. Metzler, and B.C. Croft. 2010. Learning concept importance
using a weighted dependence model. Proceedings of the third ACM international
We propose a generic inference hybrid framework for text model- conference on Web Search and Data Mining (WSDM) (2010).
ing, namely conv − RN N , which seamlessly integrates the merits [18] A. L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, and C. Potts. 2011. Learning
from both CNN and RNN. Besides, based on conv − RN N , we Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language Technologies -
also propose a novel sentence classification model and an attention Volume 1. 142–150.
based answer selection model, both of which utilize the effective- [19] M. Mesnil, T. Mikolov, M. Ranzato, and Y. Bengio. 2015. Ensemble of Generative
and Discriminative Techniques for Sentiment Analysis of Movie Reviews. In
ness of conv − RN N on semantic understanding to strengthen the Accepted as a workshop contribution at ICLR 2015.
sentence classification and matching power respectively. We test [20] Y. Miao, L. Yu, and P. Blunsom. 2016. Neural Variational Inference for Text Pro-
empirically on a very wide variety of datasets on sentence clas- cessing. In Proceedings of the 33rd International Conference on Machine Learning.
[21] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
sification and answer selection and empirically demonstrate the Estimation of Word Representations in Vector Space. In Proceedings of Workshop
effectiveness of conv − RN N . at ICLR.
2069
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
[22] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed [37] R. Socher, J. Pennington, E.H. Huang, A.Y. Ng, and C.D. Manning. 2011. Semi-
Representations of Words and Phrases and their Compositionality. In Advances supervised Recursive Autoencoders for Predicting Sentiment Distributions. In
in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Proceedings of the Conference on Empirical Methods in Natural Language Processing
Z. Ghahramani, and K.Q. Weinberger (Eds.). Curran Associates, Inc., 3111–3119. (EMNLP). 151–161.
[23] A. Miller, A. Fisch, J. Dodge, A. H. Karimi, A. Bordes, and J. Weston. 2016. Key- [38] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C.D. Manning, A.Y. Ng, and C. Potts.
Value Memory Networks for Directly Reading Documents. In arXiv:1602.03126. 2013. Recursive Deep Models for Semantic Compositionality over a Sentiment
[24] K. Moilanen and S. Pulman. 2007. Sentiment Composition. In In Proceedings of Treebank. In Proceedings of the Conference on Empirical Methods in Natural
RANLP. 378–382. Language Processing (EMNLP). 1631–1642.
[25] R.J. Mooney. 2014. Semantic parsing: Past, present, and future. In In Association [39] K.S Tai, R. Socher, and C.D. Manning. 2015. Improved semantic representations
for Computational Linguistics (ACL) Workshop on Semantic Parsing. from tree-structured long short-term memory networks. In Proceedings of the
[26] Kalchbrenner N., Grefenstette E., and Blunsom P. 2014. A convolutional neural 53rd Annual Meeting of the Association for Computational Linguistics and the 7th
network for modelling sentences. In Proceedings of the 52nd Annual Meeting of International Joint Conference on Natural Language Processing. 1556fi?!–1566.
the Association for Computational Linguistics. 655–665. [40] M. Tan, C. Santos, B. Xiang, and B. Zhou. 2016. Improved Representation Learning
[27] Limsopatham N. and N. Collier. 2016. Modelling the combination of generic for Question Answer Matching. In Proceedings of the 54th Annual Meeting of the
and target domain embeddings in a convolutional neural network for sentence Association for Computational Linguistics. 464–473.
classification. In Proceedings of the 15th Workshop on Biomedical Natural Language [41] M. Tan, B. Xiang, and B. Zhou. 2015. LSTM-based Deep Learning Models for
Processing. 103–112. Non-factoid Answer Selection. In arXiv preprint arXiv:1511.04108.
[28] T. Nakagawa, K. Inui, and S. Kurohashi. 2010. Dependency Tree-based Sentiment [42] S. Wager, S. Wang, and P.S. Liang. 2013. Dropout training as adaptive regulariza-
Classification Using CRFs with Hidden Variables. In Human Language Technolo- tion. In Advances in neural information processing systems. 351–359.
gies: The 2010 Annual Conference of the North American Chapter of the Association [43] B. Wang, K. Liu, and J. Zhao. 2016. Inner attention based recurrent neural
for Computational Linguistics. Association for Computational Linguistics, 786– networks for answer selection. In The Annual Meeting of the Association for
794. Computational Linguistics.
[29] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, [44] L. Wang, M. Tan, and J. Han. 2016. FastHybrid: A Hybrid Model for Efficient An-
Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long swer Selection. In Proceedings of COLING 2016, the 26th International Conference
short-term memory networks: Analysis and application to information retrieval. on Computational Linguistics: Technical Papers. 2378–2388.
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24, 4 [45] S. Wang and J. Jiang. 2016. A Compare-Aggregate Model for Matching Text
(2016), 694–707. Sequences. CoRR abs/1611.01747 (2016).
[30] B. Pang and L. Lee. 2004. A Sentimental Education: Sentiments Analysis using [46] S. Wang and C.D. Manning. 2012. Baselines and Bigrams: Simple, Good Senti-
Subjectivity Summarization based on Minimum Cuts. In Proceedings ACL. ment and Topic Classification. In Proceedings of the 50th Annual Meeting of the
[31] B. Pang and L. Lee. 2005. Seeing Stars: Exploiting Class Relationships for Association for Computational Linguistics. 90–94.
Sentiment Categorization with Respect to Rating Scales. In Proceedings of ACL. [47] S.I. Wang and C.D. Manning. 2013. Fast dropout training. In Proceedings of the
115–124. 30th International Conference on Machine Learning.
[32] J. Pennington, R. Socher, and C.D. Manning. 2014. GloVe: Global Vectors for Word [48] Z. Wang, H. Mi, and A. Ittycheriah. 2016. Sentence Similarity Learning by Lexical
Representation. In Empirical Methods in Natural Language Processing (EMNLP). Decomposition and Composition. In arXiv:1602.07019.
1532–1543. [49] Z. Yan, N. Duan, J. Bao, P. Chen, M. Zhou, Z. Li, and J. Zhou. 2016. DocChat: an in-
[33] J. Rao, H. He, and J. Lin. 2016. Noise-Contrastive Estimation for Answer Selection formation retrieval approach for chatbot engines using unstructured documents.
with Deep Neural Networks. In Proceedings CIKM’ 16. In The Annual Meeting of the Association for Computational Linguistics.
[34] C. Santos, M. Tan, B. Xiang, and B. Zhou. 2016. Attentive Pooling Networks. In [50] Y. Yang, W. Yih, and Meek C. 2015. A Challenge Dataset for Open-Domain
arXiv:1602.03609. Question Answering.
[35] A. Severyn and A. Moschitti. 2016. Modeling Relational Information in Question- [51] Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A Challenge Dataset for Open-
Answer Pairs with Convolutional Neural Networks. In arXiv:1602.01178. domain Question Answering. In Proceedings of EMNLP. 2013–2018.
[36] R. Socher, B. Huval, C.D. Manning, and A.Y. Ng. 2012. Semantic Compositionality [52] W. Yin, H. Schutze, B. Xiang, and B. Zhou. 2016. ABCNN: Attention-Based
through Recursive Matrix-vector Spaces. In Joint Conference on Empirical Methods Convolutional Neural Network for Modeling Sentence Pairs. In Transactions of
in Natural Language Processing and Computational Natural Language Learning. the Association for Computational Linguistics, Vol. 4. 259–272.
1201–1211.
2070