MOQA Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Addressing Complex and Subjective Product-Related

Queries with Customer Reviews

Julian McAuley Alex Yang


University of California University of California
San Diego San Diego
[email protected] [email protected]

ABSTRACT decision? And (2) how can we address specific queries that a user
Online reviews are often our first port of call when considering wishes to answer in order to evaluate a product?
products and purchases online. When evaluating a potential pur- To help users answer specific queries, review websites like Ama-
chase, we may have a specific query in mind, e.g. ‘will this baby zon offer community-Q/A systems that allow users to pose product-
seat fit in the overhead compartment of a 747?’ or ‘will I like this specific questions to other consumers.1 Our goal here is to respond
album if I liked Taylor Swift’s 1989?’. To answer such questions to such queries automatically and on-demand. To achieve this we
we must either wade through huge volumes of consumer reviews make the basic insight that our two goals above naturally comple-
hoping to find one that is relevant, or otherwise pose our question ment each other: given a large volume of community-Q/A data
directly to the community via a Q/A system. (i.e., questions and answers), and a large volume of reviews, we
In this paper we hope to fuse these two paradigms: given a large can automatically learn what makes a review relevant to a query.
volume of previously answered queries about products, we hope We see several reasons why reviews might be a useful source
to automatically learn whether a review of a product is relevant to of information to address product-related queries, especially com-
a given query. We formulate this as a machine learning problem pared to existing work that aims to solve Q/A-like tasks by building
using a mixture-of-experts-type framework—here each review is knowledge bases of facts about the entities in question:
an ‘expert’ that gets to vote on the response to a particular query; • General question-answering is a challenging open problem. It is
simultaneously we learn a relevance function such that ‘relevant’ certainly hard to imagine that a query such as “Will this baby seat
reviews are those that vote correctly. At test time this learned rele- fit in the overhead compartment of a 747?” could be answered
vance function allows us to surface reviews that are relevant to new by building a knowledge-base using current techniques. How-
queries on-demand. We evaluate our system, Moqa, on a novel cor- ever it is more plausible that some review of that product will
pus of 1.4 million questions (and answers) and 13 million reviews. contain information that is relevant to this query. By casting the
We show quantitatively that it is effective at addressing both binary problem as one of surfacing relevant opinions (rather than nec-
and open-ended queries, and qualitatively that it surfaces reviews essarily generating a conclusive answer), we can circumvent this
that human evaluators consider to be relevant. difficulty, allowing us to handle complex and arbitrary queries.
• Fundamentally, many of the questions users ask on review web-
Keywords sites will be those that can’t be answered using knowledge bases
Relevance ranking; question answering; text modeling; reviews; derived from product specifications, but rather their questions
bilinear models will be concerned with subjective personal experiences. Reviews
are a natural and rich source of data to address such queries.
1. INTRODUCTION • Finally, the massive volume and range of opinions makes review
Consumer reviews are invaluable as a source of data to help peo- systems difficult to navigate, especially if a user is interested in
ple form opinions on a wide range of products. Beyond telling us some niche aspect of a product. Thus a system that identifies
whether a product is ‘good’ or ‘bad’, reviews tell us about a wide opinions relevant to a specific query is of fundamental value in
range of personal experiences; these include objective descriptions helping users to navigate such large corpora of reviews.
of the products’ properties, subjective qualitative assessments, as To make our objectives more concrete, we aim to formalize the
well as unique use- (or failure-) cases. problem in terms of the following goal:
The value and diversity of these opinions raises two questions of
interest to us: (1) How can we help users navigate massive volumes Goal: Given a query about a particular product, we
of consumer opinions in order to find those that are relevant to their want to determine how relevant each review of that
product is to the query, where ‘relevance’ is measured
in terms of how helpful the review will be in terms of
identifying the correct response.
The type of system we produce to address this goal is demon-
Copyright is held by the International World Wide Web Conference Com- strated in Figure 1. Here we surface opinions that are identified
mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the as being ‘relevant’ to the query, which can collectively vote (along
author’s site if the Material is used in electronic media. with all other opinions, in proportion to their relevance) to deter-
WWW 2016, April 11–15, 2016, Montréal, Québec, Canada. mine the response to the query.
ACM 978-1-4503-4143-1/16/04. 1
http://dx.doi.org/10.1145/2872427.2883044. E.g. amazon.com/ask/questions/asin/B00B71FJU2
swer the question per se, but rather to surface relevant opinions that
Product: BRAVEN BRV-1 Wireless Bluetooth Speaker
will help the questioner form their own conclusion.
It seems then that to address our goal we’ll need a system with
two components: (1) A relevance function, to determine which re-
views contain information relevant to a query, and (2) a prediction
function, allowing relevant reviews to ‘vote’ on the correct answer.
However as we stated, our main goal is not to answer questions
directly but rather to surface relevant opinions that will help the
user answer the question themselves; thus it may seem as though
Query: “I want to use this with my iPad air while taking a this ‘voting’ function is not required. Indeed, at test time, only the
jacuzzi bath. Will the volume be loud enough over the bath relevance function is required—this is exactly the feature that shall
jets?” allow our model to handle arbitrary, open-ended, and subjective
customer opinions, ranked by relevance: vote: queries. However the voting function is critical at training time,
so that with a large corpus of already-answered questions, we can
“The sound quality is great, especially for the simultaneously learn relevance and voting functions such that ‘rel-
size, and if you place the speaker on a hard sur- evant’ reviews are those that vote for the correct answer.
yes
face it acts as a sound board, and the bass really The properties that we want above are captured by a classical
kicks up.” machine learning framework known as mixtures of experts [18].
Mixtures of experts are traditionally used when one wishes to com-
“If you are looking for a water resistant blue bine a series of ‘weak learners’—there the goal is to simultaneously
tooth speaker you will be very pleased with this yes estimate (a) how ‘expert’ each predictor is with respect to a partic-
product.” ular input and (b) the parameters of the predictors themselves. This
is an elegant framework as it allows learners to ‘focus’ on inputs
“However if you are looking for something to that they are good at classifying—it doesn’t matter if they some-
throw a small party this just doesnt have the no times make incorrect predictions, so long as they correctly classify
sound output.” those instances where they are predicted to be experts.
In our setting, individual reviews or opinions are treated as ex-
etc. etc. perts that get to vote on the answer to each query; naturally some
opinions will be unrelated to some queries, so we must also learn
how relevant (i.e., expert) each opinion is with respect to each
Response: Yes query. Our prediction (i.e., voting) function and relevance func-
tion are then learned simultaneously such that ‘relevant’ opinions
are precisely those that are likely to vote correctly. At test time, the
Figure 1: An example of how our system, Moqa, is used. This relevance function can be used directly to surface relevant opinions.
is a real output produced by Moqa, given the customer query We evaluate our model using a novel corpus of questions and
about the product above. We simultaneously learn which cus- answers from Amazon. We consider both binary questions (such as
tomer opinions are ‘relevant’ to the query, as well as a pre- the example in Figure 1), and open-ended questions, where reviews
diction function that allows each opinion to ‘vote’ on the re- must vote amongst alternative answers. Quantitatively, we compare
sponse, in proportion to its relevance. These relevance and pre- our technique to state-of-the-art methods for relevance ranking, and
diction functions are learned automatically from large corpora find that our learned definition of relevance is more capable of re-
of training queries and reviews. solving queries compared to hand-crafted relevance measures.
Qualitatively, we evaluate our system by measuring whether hu-
man evaluators agree with the notion of ‘relevance’ that we learn.
This is especially important for open-ended queries, where it is in-
feasible to answer questions directly, but rather we want to surface
This simple example demonstrates exactly the features that make
opinions that are helpful to the user.
our problem interesting and difficult: First, the query (‘is this loud
enough?’) is inherently subjective, and depends on personal expe-
rience; it is hard to imagine that any fact-based knowledge reposi- 1.1 Contributions
tory could provide a satisfactory answer. Secondly, it is certainly a We summarize our contributions as follows: First, we develop
‘long-tail’ query—it would be hard to find relevant opinions among a new method, Moqa, that is able to uncover opinions that are rel-
the (300+) reviews for this product, so a system to automatically evant to product-related queries, and to learn this notion of rele-
retrieve them is valuable. Third, it is linguistically complex—few vance from training data of previously answered questions. Sec-
of the important words in the query appear among the most rele- ond, we collect a large corpus of 1.4 million answered questions
vant reviews (e.g. ‘jacuzzi bath’/‘loud enough’)—this means that and 13 million reviews on which to train the model. Ours is among
existing solutions based on word-level similarity are unlikely to be the first works to combine community Q/A and review data in this
effective. This reveals the need to learn a complex definition of way, and certainly the first to do it at the scale considered here.
‘relevance’ that is capable of accounting for subtle linguistic differ- Third, we evaluate our system against state-of-the-art approaches
ences such as synonyms. for relevance ranking, where we demonstrate (a) the need to learn
Finally, in the case of Figure 1, our model is able to respond to the notion of ‘relevance’ from training data; (b) the need to handle
the query (in this instance correctly) with a binary answer. More heterogeneity between questions, reviews, and answers; and (c) the
importantly though, the opinions surfaced allow the user to deter- value of opinion data to answer product-related queries, as opposed
mine the answer themselves—in this way we can extend our model to other data like product specifications.
to handle general open-ended queries, where the goal is not to an- Code and data is available on the first author’s webpage.
2. RELATED WORK
Table 1: Notation.
The most closely related branches of work to ours are (1) those
Symbol Description
that aim to mine and summarize opinions and facets from docu-
ments (especially from review corpora), and (2) those that study q ∈ Q, a ∈ A query and query set, answer and answer set
Q/A systems in general. To our knowledge our work is among the y∈Y label set (for binary questions)
first at the interface between these two tasks, i.e., to use consumer r∈R review and review set
reviews as a means of answering general queries about products, s relevance/scoring function
though we build upon ideas from several related areas. v prediction/voting function
δ indicator function (1 iff the argument is true)
Document summarization. Perhaps most related to our goal of θ, ϑ, A, B terms in the bilinear relevance function
selecting relevant opinions among large corpora of reviews is the ϑ0 , X, Y terms in the bilinear prediction function
problem of multi-document summarization [25, 30]. Like ours, this p(r|q) relevance of a review r to a query q
task consists of finding relevant or ‘salient’ parts of documents [7, p(y|r, q) probability of selecting a positive answer to
30] and intelligently combining them. Most related are approaches a query q given a review r
that apply document summarization techniques to ‘evaluative text’ p(a > ā|r) preference of answer a over ā
(i.e., reviews), in order to build an overview of opinions or product
features [6, 22, 31]. In contrast to our contribution, most of the
above work is not ‘query-focused,’ e.g. the goal is to summarize
product features or positive vs. negative opinions, rather than to
address specific queries, though we note a few exceptions below. training data as it is here. Next (as mentioned above) is the notion
that questions, answers, and documents are heterogeneous, mean-
Relevance ranking. A key component of the above line of work is ing that simple bag-of-words type approaches may be insufficient
to learn whether a document (or a phrase within a document) is rel- to compare them [3, 46], so that instead one must decompose ques-
evant to a given query. ‘Relevance’ can mean many things, from the tions [15] or model their syntax [32]. Also relevant is the problem
‘quality’ of the text [1], to its lexical salience [10], or its diversity of identifying experts [5, 21, 35, 40] or high-quality answers [2],
compared to already-selected documents [6]. In query-focused set- or otherwise identifying instances where similar questions have al-
tings, one needs a query-specific notion of relevance, i.e., to deter- ready been answered elsewhere [13, 19], though these differ from
mine whether a document is relevant in the context of a given query. our paradigm in that the goal is to select among answers (or an-
For this task, simple (yet effective) word-level similarity measures swerers), rather than to address the questions themselves.
have been developed, such as Okapi BM25, a state-of-the-art TF- Naturally also relevant is the large volume of Q/A work from the
IDF-based relevance ranking measure [20, 26]. A natural limita- information retrieval community (e.g. TREC Q/A2 ); however note
tion one must overcome though is that queries and documents may first that due to the data involved (in particular, subjective opinions)
be linguistically heterogeneous, so that word-level measures may our approach is quite different from systems that build knowledge
fail [3, 46]. This can be addressed by making use of grammatical bases (e.g. systems like Watson [11]), or generally systems whose
rules and phrase-level approaches (e.g. ROUGE measures [44]), or task is to retrieve a list of objective facts that conclusively answer
through probabilistic language models ranging from classical meth- a query. Rather, our goal is to use Q/A data as a means of learning
ods [37] to recent approaches based on deep networks [23, 41]. We a ‘useful’ relevance function, and as such our experiments mainly
discuss ranking measures more in Section 3.1. focus on state-of-the-art relevance ranking techniques.
Opinion mining. Studying consumer opinions, especially through
rating and review datasets is a broad and varied topic. Review text
2.1 Key differences
has been used to augment ‘traditional’ recommender systems by Though related to the above areas, our work is novel in a vari-
finding the aspects or facets that are relevant to people’s opinions ety of ways. Our work is among the first at the interface of Q/A
[14, 28, 43] and, more related to our goal, to find ‘helpful’ reviews and opinion mining, and is novel in terms of the combination of
[4, 9] or experts on particular topics [34]. There has also been data used, and in terms of scale. In contrast to the above work on
work on generating summaries of product features [17], includ- summarization and relevance ranking, given a large volume of an-
ing work using multi-document summarization as mentioned above swered queries and a corpus of weakly relevant documents (i.e.,
[6, 22, 31]. This work is related in terms of the data used, and the reviews of the product being queried), our goal is to be as agnostic
need to learn some notion of ‘relevance,’ though the goal is not as possible to the definition of “what makes an opinion relevant to a
typically to address general queries as we do here. We are aware of query?,” and to learn this notion automatically from data. This also
relatively little work that attempts to combine question-answering differentiates our work from traditional Q/A systems as our goal is
with opinion mining, though a few exceptions include [33], which not to answer queries directly (i.e., to output ‘facts’ or factoids), but
answers certain types of queries on Amazon data (e.g. “find 100 rather to learn a relevance function that will help users effectively
books with over 200 5-star ratings”); or [45] which learns to distin- navigate multiple subjective viewpoints and personal experiences.
guish ‘facts’ from subjective opinions; or [36], which tries to solve Critically, the availability of a large training corpus allows us to
cold-start problems by finding opinion sentences of old products learn complex mappings between questions, reviews, and answers,
that will be relevant to new ones. Though in none of these cases is while accounting for the heterogeneity between them.
the goal to address general queries.
Q/A systems. Many of the above ideas from multi-document sum- 3. MODEL PRELIMINARIES
marization, relevance ranking, and topical expert-finding have been Since our fundamental goal is to learn relevance functions so as
adapted to build state-of-the-art automated Q/A systems. First is to surface useful opinions in response to queries, we mainly build
‘query-focused’ summarization [7, 24], which is similar to our task upon and compare to existing techniques for relevance ranking.
in that phrases must be selected among documents that match some
2
query, though typically the relevance function is not learned from http://trec.nist.gov/tracks.html
We also briefly describe the mixture-of-experts framework (upon is very high-dimensional (in our application, the size of the vocab-
which we build our model) before we describe Moqa in Section 4. ulary squared), we assume that it is low-rank, i.e., that it can be
approximated by M ∼ AB T where A and B are each rank K.3
3.1 Standard measures for relevance ranking Thus our similarity measure becomes
We first describe a few standard measures for relevance ranking,
given a query q and a document d (in our case, a question and a qAB T dT = (qA) · (dB). (5)
review), whose relevance to the query we want to determine. This has an intuitive explanation, which is that A and B project
Cosine similarity is a simple similarity measure that operates on terms from the query and the document into a low-dimensional
Bag-of-Words representations of a document and a query. Here the space such that ‘similar’ terms (such as synonyms) in the query and
similarity is given by the document are projected nearby (and have a high inner product).

q·d 3.2 Mixtures of Experts


cos(q, d) = , (1)
kqkkdk Mixtures of experts (MoEs) are a classical way to combine the
i.e., the cosine of the angle between (the bag-of-words representa- outputs of several classifiers (or ‘weak learners’) by associating
tions of) the query q and a document d. This can be further refined weighted confidence scores with each classifier [18]. In our setting
by weighting the individual dimensions, i.e., ‘experts’ shall be individual reviews, each of which lends support
for or against a particular response to a query. The value of such
cos ϑ (q, d) =
(q d) · θ
, (2) a model is that relevance and classification parameters are learned
kqkkdk simultaneously, which allows individual learners to focus on clas-
sifying only those instances where they are considered ‘relevant,’
where (q d) is the Hadamard product.
without penalizing them for misclassification elsewhere. In the next
Okapi BM25 is state-of-the-art among ‘TF-IDF-like’ ranking func- section we show how this is useful in our setting, where only a tiny
tions and is regularly used for document retrieval tasks [20, 27]. subset of reviews may be helpful in addressing a particular query.
TF-IDF-based ranking measures address a fundamental issue with Generally speaking, for a binary classification task, each expert
measures like the cosine similarity (above) whereby common—but outputs a probability associated with a positive label. The final
irrelevant—words can dominate the ranking function. This can be classification output is then given by aggregating the predictions of
addressed by defining a ranking function that rewards words which the experts, in proportion to their confidence (or expertise). This
appear many times in a selected document (high TF), but which are can be expressed probabilistically as
rare among other documents (high IDF). Okapi BM25 is a param- confidence in f ’s ability to classify X
eterized family of functions based on this idea: X z }| {
p(y|X) = p(f |X) p(y|f, X) . (6)
n
X IDF(qi ) · f (qi , d) · (k1 + 1) f
| {z }
bm25 (q, d) = |d|
. (3) f ’s prediction
f (qi , d) + k1 · (1 − b + b · )
i=1 avgdl Here our confidence in each expert, p(f |X), is treated as a prob-
Again q and d are the query and a document, and f and IDF are ability, which can be obtained from an arbitrary real-valued score
the term frequency (of a word qi in the query) and inverse docu- s(f, X) using a softmax function:
ment frequency as described above. ‘avgdl’ is the average docu- exp(s(f, X))
ment length, and b and k1 are tunable parameters, which we set as p(f |X) = P 0
. (7)
described in [27]. See [20, 27] for further detail. f 0 exp(s(f , X))

Essentially, we treat BM25 as a state-of-the-art ‘off-the-shelf’ Similarly for binary classification tasks the prediction of a particu-
document ranking measure that we can use for evaluation and bench- lar expert can be obtained using a logistic function:
marking, and also as a feature for ranking in our own model.
1
p(y|f, X) = σ(v(f, X)) = . (8)
Bilinear models. While TF-IDF-like measures help to discover 1 + e−v(f,X)
rare but important words, an issue that still remains is that of syn-
Here s and v are our ‘relevance’ and ‘voting’ functions respec-
onyms, i.e., different words being used to refer to the same concept,
tively. To define an MoE model, we must now define (parame-
and therefore being ignored by the similarity measure in question.
terized) functions s(f, X) and v(f, X), and tune their parameters
This is especially an issue in our setting, where questions and re-
to maximize the likelihood of the available training labels. We next
views are only tangentially related and may draw from very dif-
describe how this formulation can be applied to queries and re-
ferent vocabularies [3, 46]—thus one needs to learn that a word
views, and describe our parameter learning strategy in Section 4.2.
used in (say) a question about whether a baby seat fits in overhead
luggage is ‘related to’ a review that describes its dimensions.
Bilinear models [8, 12, 42] can help to address this issue by 4. MOQA
learning complex mappings between words in one corpus and words We now present our model, Mixtures of Opinions for Question
in another (or more generally between arbitrary feature spaces). Answering, or Moqa for short. In the previous section we outlined
Here compatibility between a query and a document is given by the ‘Mixture of Experts’ framework, which combines weak learn-
X ers by aggregating their outputs with weighted confidence scores.
qM dT = Mij qi dj , (4) Here, we show that such a model can be adapted to simultaneously
i,j identify relevant reviews, and combine them to answer complex
where M is a matrix whose entry Mij encodes the relationship queries, by treating reviews as experts that either support or oppose
between a term qi in the query and a term dj in the document (set- a particular response.
ting M = I on normalized vectors recovers the cosine similarity). 3
This is similar to the idea proposed by Factorization Machines
This is a highly flexible model, which even allows that the dimen- [38], allowing complex pairwise interactions to be handled by as-
sions of the two feature spaces be different; in practice, since M suming that they have low-rank structure (i.e., they factorize).
4.1 Mixtures of Experts for review relevance the relevance parameters Θ = {θ, ϑ, A, B} and the prediction pa-
ranking rameters Θ0 = {ϑ0 , X, Y } simultaneously so as to maximize the
As described in Section 3.2, our MoE model is defined in terms likelihood that the training answers will be given the correct labels.
of two parameterized functions: s, which determines whether a In other words, we want to define these functions such that reviews
review (‘expert’) is relevant to the query, and v, which given the given high relevance scores are precisely those that help to predict
query and a review makes a prediction (or vote). Our goal is that the correct answer. Using the expression in (eq. 6), the likelihood
predictions are correct exactly for those reviews considered to be function is given by
relevant. We first define our relevance function s before defining Y Y
our prediction functions for binary queries in Section 4.2 and arbi- LΘ,Θ0 (Y|Q, R) = pΘ,Θ0 (y|q) (1 − pΘ,Θ0 (y|q)),
trary queries in Section 4.3. q∈Qyes
(train) (train)
q∈Qno
Our scoring function s(r, q) defines the relevance of a review r (12)
to a query q. In principle we could make use of any of the relevance (train) (train)
where Qyes and Qno are training sets of questions with positive
measures from Section 3.1 ‘as is,’ but we want our scoring function and negative answers, and Y and R are the label set and reviews
to be parameterized so that we can learn from training data what respectively. p(y|q) (the probability of selecting the answer ‘yes’
constitutes a ‘relevant’ review. Thus we define a parameterized given the query q) is given by
scoring function as follows:
X  esΘ (q,r)

1
sΘ (r, q) = φ(r, q) · θ + ψ(q)M ψ(r)T . (9) pΘ,Θ0 (y|q) = P 0 ,
| {z } | {z }
r∈Ri(q) r 0 ∈Ri(q) e
sΘ (q,r )
1 + e−vΘ0 (q,r)
pairwise similarity bilinear model | {z }
prediction
| {z }
relevance
Here φ(r, q) is a feature vector that is made up of existing pairwise (13)
similarity measures. θ then weights these measures so as to de- where Ri(q) is the set of reviews associated with the item referred
termine how they should be combined in order to achieve the best to in the query q. We optimize the (log) likelihood of the parameters
ranking. Thus φ(r, q) allows us to straightforwardly make use of in (eq. 12) using L-BFGS, a quasi-Newton method for non-linear
existing ‘off-the-shelf’ similarity measures that are considered to optimization of problems with many variables. We added a simple
be state-of-the-art. In our case we make use of BM25+ [26] and `2 regularizer to the model parameters, though did not run into is-
ROUGE-L [44] (longest common subsequence) features, though sues of overfitting, as the number of parameters is far smaller than
we describe our experimental setup in more detail in Section 5. the number of samples available for training.
The second expression in (eq. 9) is a bilinear scoring function
between features of the query (ψ(q)) and the review (ψ(r)). As 4.3 Open-ended questions
features we us a simple bag-of-words representation of the two ex-
While binary queries already account for a substantial fraction of
pressions with an F = 5000 word vocabulary. As we suggested
our dataset, and are a valuable testbed for quantitatively evaluating
previously, learning an F × F dimensional parameter M is not
our method, we wish to extend our method to arbitrary open-ended
tractable, so we approximate it by
questions, both to increase its coverage, and to do away with the
M = (ψ(q) ψ(r)) · ϑ + ψ(q)AB T ψ(r)T . (10) need for labeled yes/no answers at training time.
Here our goal is to train a method that given a corpus of can-
| {z } | {z }
diagonal term low-rank term
didate answers (one of which is the ‘true’ answer that a responder
ϑ (the diagonal component of M ) then accounts for simple term- provided) will assign a higher score to the true answer than to all
to-term similarity, whereas A and B (the low-rank component of non-answers. Naturally in a live system one does not have access
M ) are projections that map ψ(q) and ψ(r) (respectively) into K- to such a corpus containing the correct answer, but recall that this is
dimensional space (K = 5 in our experiments) in order to account not required: rather, we use answers only at training time to learn
for linguistic differences (such as synonym use) between the two our relevance function, so that at test time we can surface relevant
sources of text. Thus rather than fitting F × F parameters we need reviews without needing candidate answers to be available.
to fit only (2K + 1) · F parameters in order to approximate M . Specifically, we want to train the model such that the true answer
To obtain the final relevance function, we optimize all parame- is given a higher rank than all non-answers, i.e., to train a ranking
ters Θ = {θ, ϑ, A, B} using supervised learning, as described in function to maximize the average Area Under the Curve (AUC):
the following section.
1 X 1 X
4.2 Binary (i.e., yes/no) questions AUC (train) = δ(a(q) > ā), (14)
|Q(train) | |A| ā∈A
Dealing with binary (yes/no) questions is a relatively straight- q∈Q(train)

forward application of an MoE-type model, where each of the ‘ex-


where a(q) is the ‘true’ answer for the query q (A is the answer set)
perts’ (i.e., reviews) must make a binary prediction as to whether
and δ(a(q) > ā) is an indicator counting whether this answer was
the query is supported by the content of the review. This we also
preferred over a non-answer ā. In other words, the above simply
achieve using a bilinear scoring function:
counts the fraction of cases where the true answer was considered
vΘ0 (q, r) = (ψ(q) ψ(r)) · ϑ0 + ψ(q)XY T ψ(r)T . (11) better than non-answers.
In practice, the AUC is (approximately) maximized by optimiz-
Note that this is different from the relevance function s in (eq. 9)
ing a pairwise ranking measure, where the true answer should be
(though it has a similar form). The role of (eq. 11) above is to vote
given a higher score than a (randomly chosen) non-answer, i.e., in-
on a binary outcome; how much weight/relevance is given to this
stead of optimizing pΘ,Θ0 (y|q) from (eq. 13) we optimize
vote is determined by (eq. 9). Positive/negative v(q, r) corresponds
to a vote in favor of a positive or negative answer (respectively). relevance
X z }| {
Learning. Given a training set of questions with labeled yes/no p(a > ā|q) p(r|q) p(a > ā|r) .
answers (to be described in Section 5.2), our goal is to optimize r
| {z }
a is a better answer than ā
To do so we make use of the same relevance function s and the same
scoring function v used in (eq. 11), with two important differences: Table 2: Dataset Statistics.
First, the scoring function takes a candidate answer (rather than questions
Dataset products reviews
the query) as a parameter (i.e., v(a, r) rather than v(q, r)). This is (w/ answers)
because our goal is no longer to estimate a binary response to the electronics 314,263 39,371 4,314,858
query q, but rather to determine whether the answer a is supported home and kitchen 184,439 24,501 2,012,777
by the review r. Second, since we want to use this function to rank sports and outdoors 146,891 19,332 1,013,196
answers, we no longer care that v(a, r) is maximized, but rather tools and home impr. 101,088 13,397 752,947
that v(a, r) (for the true answer) is higher than v(ā, r) for non- automotive 89,923 12,536 395,872
answers ā. This can be approximated by optimizing the logistic cell phones 85,865 10,407 1,534,094
loss health and personal care 80,496 10,860 1,162,587
1 patio lawn and garden 59,595 7,986 451,473
p(a > ā|r) = σ(v(a, r) − v(ā, r)) = . (15)
1 + ev(ā,r)−v(a,r) total 1,447,173 191,185 13,498,681
This will approximate the AUC if enough random non-answers are
selected; optimizing pairwise ranking losses as a means of opti-
mizing the AUC is standard practice in recommender systems that
Note that the purpose of this small, manually labeled sample is
make use of implicit feedback [39]. Otherwise, training proceeds as
not to train Moqa but rather to evaluate simple techniques for au-
before, with the two differences being that (1) p(a > ā|r) replaces
tomatically labeling yes/no questions and answers. This is much
the prediction function in (eq. 13), and (2) multiple non-answers
easier than our overall task, since we are given the answer and sim-
must be sampled for training. In practice we use 10 epochs (i.e.,
ply want to determine whether it was positive or negative, for which
we generate 10 random non-answers per query during each train-
simple NLP techniques suffice.
ing iteration). On our largest dataset (electronics), training requires
To identify whether a question is binary, a recent approach devel-
around 4-6 hours on a standard desktop machine.
oped by Google proved to be effective [16]. This approach consists
of a series of complex grammatical rules which are used to form
5. EXPERIMENTS regular expressions, which essentially identify occurrences of ‘be’,
We evaluate Moqa in terms of three aspects: First for binary modal, and auxiliary verbs. Among our labeled data these rules
queries, we evaluate its ability to resolve them. Second, for open- identified yes/no questions with 97% precision at 82% recall. Note
ended queries, its ability to select the correct answer among alter- that in this setting we are perfectly happy to sacrifice some recall
natives. Finally we evaluate Moqa qualitatively, in terms of its abil- for the sake of precision—what we want is a sufficiently large sam-
ity to identify reviews that humans consider to be relevant to their ple of labeled yes/no questions to train Moqa, but we are willing to
query. We evaluate this on a large dataset of reviews and queries discard ambiguous cases in order to do so.
from Amazon, as described below. Next we want to label answers as being yes/no. Ultimately we
trained a simple bag-of-unigrams SVM, plus an additional feature
5.1 Data based on the first word only (which is often simply ‘yes’ or ‘no’).
We collected review and Q/A data from Amazon.com. We started Again, since we are willing to sacrifice recall for precision, we dis-
with a previous crawl from [29], which contains a snapshot of prod- carded test instances that were close to the decision hyperplane.
uct reviews up to July 2014 (but which includes only review data). By keeping only the 50% of instances about which the classifier
For each product in that dataset, we then collected all questions was the most confident, we obtained 98% classification accuracy
on its Q/A page, and the top-voted answer chosen by users. We on held-out data.
also crawled descriptions of all products, in order to evaluate how Finally we consider a question only if both of the above tests
description text compares to text from reviews. This results in a pass, i.e., the question is identified as being binary and the answer
dataset of 1.4 million questions (and answers) on 191 thousand is classified as yes/no with high confidence. Ultimately through the
products, about which we have over 13 million customer reviews. above process we obtained 309,419 questions that we were able to
We train separate models for each top-level category (electronics, label with high confidence, which can be used to train the binary
automotive, etc.). Statistics for the 8 largest categories (on which version of Moqa in Section 5.4.1.
we report results) are shown in Table 2.
5.3 Baselines
5.2 Labeling yes/no answers We compare Moqa against the following baselines:
Although the above data is already sufficient for addressing open- rand ranks and classifies all instances randomly. By definition this
ended questions, for binary questions we must first obtain addi- has 50% accuracy (on average) for both of the tasks we consider.
tional labels for training. Here we need to identify whether each Recall also that for yes/no questions around 62% are answered
question in our dataset is a yes/no question, and if so, whether it affirmatively, roughly reflecting the performance of ‘always yes’
has a yes/no answer. In spite of this need for additional labels, ad- classification.
dressing yes/no questions is valuable as it gives us a simple and
objective way to evaluate our system. Cosine similarity (c). The relevance of a review to a query is de-
We began by manually labeling one thousand questions to iden- termined by their cosine similarity, as in (eq. 1).
tify those which were binary, and those which had binary answers Okapi-BM25+ (o). BM25 is a state-of-the-art TF-IDF-based rele-
(note that these are not equivalent concepts, as some yes/no ques- vance measure that is commonly used in retrieval applications [20,
tions may be answered ambiguously). We found that 56.1% of 27]. Here we use a recent extension of BM25 P known as BM25+
questions are binary, and that 76.5% of these had conclusive binary [26], which includes an additional term (δ n i=1 IDF(qi )) in the
answers. Of those questions with yes/no answers, slightly over half above expression in order to lower-bound the normalization by doc-
(62.4%) had positive (i.e., ‘yes’) answers. ument length.
ROUGE-L (r). Review relevance is determined by ROUGE met-
rics, which are commonly used to measure similarity in document Table 3: Performance of Moqa against baselines in terms of the
summarization tasks [44]. Here we use ROUGE-L (longest com- accuracy@50%; only learning (i.e., -L) baselines are shown as
mon subsequence) scores. non-learning baselines are not applicable to this task.

Learning vs. non learning (-L). The above measures (c), (o), and red. in
(r) can be applied ‘off the shelf,’ i.e., without using a training set. rand ro-L cro-L Moqa error
We analyze the effect of applying maximum-likelihood training (as vs. cro-L
in eq. 12) to tune their parameters (c-L, o-L, etc.). electronics 50% 78.9% 79.7% 82.6% 3.7%
home and kitchen 50% 70.3% 64.6% 73.6% 13.9%
Mdqa is the same as Moqa, except that reviews are replaced by
sports and outdoors 50% 71.9% 72.8% 74.1% 1.8%
product descriptions.
tools and home impr. 50% 70.7% 69.0% 73.2% 6.1%
automotive 50% 74.8% 76.6% 78.4% 2.3%
The above baselines are designed to assess (1) the efficacy of cell phones 50% 74.6% 76.3% 79.4% 4.1%
existing state-of-the-art ‘off-the-shelf’ relevance measures for the health and personal care 50% 61.7% 75.5% 76.2% 0.9%
ranking tasks we consider (c, o, and r); (2) the benefit of using patio lawn and garden 50% 74.6% 75.4% 76.8% 1.8%
a training set to optimize the relevance and scoring functions (c-
L, o-L, etc.); (3) the effectiveness of reviews as a source of data average 50% 72.2% 73.7% 76.8% 4.3%
versus other potential knowledge bases (Mdqa); and finally (4) the
influence of the bilinear term and the performance of Moqa itself.
For the baselines above we use a linear scoring function in the
predictor (vΘ0 (q, r) = (ψ(q) ψ(r)) · ϑ0 ), though for Mdqa and Accuracy versus confidence (electronics)
Moqa we also include the bilinear term as in (eq. 11). Recall that
0.90 Moqa
our model already includes the cosine similarity, ROUGE score,
Mdqa
and BM25+ measures as features, so that comparison between the
baseline ‘cro-L’ (i.e., all of the above measures weighted by maxi- cro-L
mum likelihood) and Moqa essentially assesses the value of using 0.85 rouge/bm25+

accuracy@k
bilinear models for relevance ranking.
For all methods, we split reviews at the level of sentences, which
we found to be more convenient when surfacing results via an in- 0.80
terface, as we do in our qualitative evaluation. We found that this
also led to slightly (but consistently) better performance than using
complete reviews—while reviews contain more information, sen-
tences are much better targeted to specific product details. 0.75

5.4 Quantitative evaluation 90% 75% 60% 45% 30% 15% 0%


confidence (percentile rank)
5.4.1 Yes/no questions
We first evaluate our method in terms of its ability to correctly
classify held-out yes/no questions, using the binary groundtruth de- Figure 2: Accuracy as a function of confidence. Moqa correctly
scribed above. Here we want to measure the classification accuracy assigns high confidence to those queries it is able to accurately
(w.r.t. a query set Q): resolve.
accuracy(Q) =
1 X 1 1 Table 3 shows the performance of Moqa and baselines, in terms
δ(q ∈ Qyes )δ(p(y|q) > ) + δ(q ∈ Qno )δ(p(yq ) < ),
|Q| q∈Q | 2 2 of the accuracy@50% (i.e., for the 50% of predictions about which
each algorithms is most confident). Note that only methods with
{z } | {z }
true positives true negatives
learning (-L) are shown as non-learning approaches are not appli-
i.e., the fraction of queries that were given the correct binary label. cable here (since there is no good way to determine parameters
We found this to be an incredibly difficult measure to perform for a binary decision function in eq. 13 without learning). Here
well on (for any method), largely due to the fact that some fraction Moqa is substantially more accurate than alternatives, especially
of queries are simply not addressed among the reviews available. on larger datasets (where more data is available to learn a mean-
Fortunately, since we are training probabilistic classifiers, we can ingful bilinear map). Among the baselines ro-L (ROUGE+Okapi
also associate a confidence with each classification (i.e., its distance BM25+ with learned weights) was the second strongest, with addi-
from the decision boundary, | 21 − p(y|q)|). Our hope is that a good tional similarity-based features (cro-L) helping only slightly.
model will assign high confidence scores to exactly those queries Figure 2 shows the full spectrum of accuracy as a function of
that can be (correctly) addressed. To evaluate algorithms as a func- confidence on ‘electronics’ queries, i.e., it shows how performance
tion of confidence, we consider the accuracy@k: degrades as confidence decreases (other categories yielded similar
 X 1  results). Indeed we find that for all methods performance degrades
A@k = accuracy argmax | − p(y|q)| , (16) for low-confidence queries. Nevertheless Moqa remains more ac-
Q0 ∈Pk (Q) 2 curate than alternatives across the full confidence spectrum, and
q∈Q0
| {z }
k most confident predictions
for queries about which it is most confident obtains an accuracy of
around 90%, far exceeding the performance of any baseline. Figure
Where Pk (Q) is the set of k-sized subsets of Q. 2 also shows the performance of Mdqa, as we discuss below.
5.4.2 Open-ended questions
In Table 4 we show the performance of Moqa against baselines
for open-ended queries on our largest datasets. Cosine similarity
(c) was the strongest non-learning baseline, slightly outperform-
ing the ROUGE score (r) and BM25+ (o, not shown for brevity).
Learning improved all baselines, with the strongest being ROUGE
and BM25+ combined (ro-L), over which adding weighted cosine
similarity did not further improve performance (cro-L), much as we
found with binary queries above. Moqa was strictly dominant on
all datasets, reducing the error over the strongest baseline by 50.6%
on average.

5.4.3 Reviews versus product descriptions


We also want to evaluate whether review text is a better source
of data than other sources, such as product descriptions or specifi-
cations. To test this we collected description/specification text for
each of the products in our catalogue. From here, we simply inter-
change reviews with descriptions (recall that both models operate at
the level of sentences). We find that while Moqa with descriptions
(i.e., Mdqa) performs well (on par with the strongest baselines), it
is still substantially outperformed when we use review text. Here Figure 3: A screenshot of our interface for user evaluation.
Moqa yields a 37.5% reduction in error over Mdqa in Table 4;
similarly in Figure 2, for binary queries Mdqa is on par with the
strongest baseline but substantially outperformed by Moqa (again dom order without labels, from which they had to select whichever
other datasets are similar and not shown for brevity). they considered to be the most relevant.4 We also asked workers
Partly, reviews perform better because we want to answer sub- whether they considered a question to be ‘subjective’ or not, in
jective queries that depend on personal experiences, for which re- order to evaluate whether the subjectivity of the question impacts
views are simply a more appropriate source of data. But other than performance. A screenshot of our interface is shown in Figure 3.
that, reviews are simply more abundant—we have on the order of Results of this evaluation are shown in Figure 4. On average,
100 times as many reviews as descriptions (products with active Moqa was preferred in 73.1% of instances across the six datasets
Q/A pages tend to be reasonably popular ones); thus it is partly the we considered. This is a significant improvement; improvements
sheer volume and diversity of reviews available that makes them were similar across datasets (between 66.2% on Sports and Out-
effective as a means of answering questions. doors and 77.6% on Baby), and for both subjective and objective
We discuss these findings in more detail in Section 6. queries (62.9% vs. 74.1%). Ultimately Moqa consistently outper-
forms our strongest baseline in terms of subjective performance,
5.5 Qualitative evaluation though relative performance seems to be about the same for objec-
Finally, we evaluate Moqa qualitatively through a user study. Al- tive and subjective queries, and across datasets.
though we have shown Moqa to be effective at correctly resolving
binary queries, and at maximizing the AUC to select a correct an- 5.5.1 Examples
swer among alternatives, what remains to be seen is whether the Finally, a few examples of the output produced by Moqa are
relevance functions that we learned to do so are aligned with what shown in Figure 5. Note that none of these examples were avail-
humans consider to be ‘relevant.’ Evaluating this aspect is espe- able at training time, and only the question (along with the prod-
cially important because in a live system our approach would pre- uct being queried) are provided as input. These examples demon-
sumably not be used to answer queries directly (which we have strate a few features of Moqa and the data in question: First is the
shown to be very difficult, and in general still an open problem), wide variety of products, questions, and opinions that are reflected
but rather to surface relevant reviews that will help the user to eval- in the data; this linguistic variability demonstrates the need for a
uate the product themselves. model that learns the notion of relevance from data. Second, the
Here we use the relevance functions sΘ (q, r) that we learned in questions themselves (like the example from Figure 1) are quite
the previous section (i.e., from Table 4) to compare which definition different from those that could be answered through knowledge
of ‘relevance’ is best aligned with real users’ evaluations—note that bases; even those that seem objective (e.g. “how long does this
the voting function v is not required at this stage. stay hot?”) are met with a variety of responses representing differ-
We performed our evaluation using Amazon’s Mechanical Turk, ent (and sometimes contradictory) experiences; thus reviews are the
using ‘master workers’ to evaluate 100 queries from each of our perfect source of data to capture this variety of views. Third is the
five largest datasets, as well as one smaller dataset (baby) to assess heterogeneity between queries and opinions; words like “girl” and
whether our method still performs well when less data is available “tall” are identified as being relevant to “daughter” and “medium,”
for training. Workers were presented with a product’s title, im- demonstrating the need for a flexible model that is capable of learn-
age, and a randomly selected query (binary or otherwise). We then ing complicated semantics in general, and synonyms in particular.
presented them the top-ranked result from our method, as well as Also note that while our bilinear model has many thousands of
the top-ranked result using Okapi-BM25+/ROUGE measures (with parameters, at test time relevance can be computed extremely ef-
tuned parameters, i.e., ro-L from Table 4); this represents a state-of- ficiently, since in (eq. 10) we can project all reviews via B in ad-
the-art ‘off-the-shelf’ relevance ranking benchmark, with parame-
ters tuned following best practices; it is also the most competitive 4
We also showed a randomly selected result, and gave users the
baseline from Table 4. Results were shown to evaluators in a ran- option to select no result. We discarded cases with overlaps.
Table 4: Performance of Moqa against baselines (a key is shown at right for baselines from Section 5.3). Reported numbers are
average AUC (i.e., the models’ ability to assign the highest possible rank to the correct answer); higher is better.
red. in red. in
Dataset rand c r ro-L cro-L Mdqa Moqa error error
vs. cro-L vs. Mdqa
electronics 0.5 0.633 0.626 0.886 0.855 0.865 0.912 65.6% 54.5% rand random
home and kitchen 0.5 0.643 0.635 0.850 0.840 0.863 0.907 73.5% 48.1% c cosine similarity
sports and outdoors 0.5 0.653 0.645 0.848 0.845 0.860 0.885 35.1% 22.5% r ROUGE measures
tools and home impr. 0.5 0.638 0.632 0.860 0.817 0.834 0.884 58.8% 43.7% o Okapi BM25+
automotive 0.5 0.648 0.640 0.810 0.821 0.825 0.863 30.4% 27.7% -L ML parameters
cell phones 0.5 0.624 0.617 0.768 0.797 0.844 0.886 78.7% 37.5% Moqa our method
health and personal care 0.5 0.632 0.625 0.818 0.817 0.842 0.880 52.7% 31.9% Mdqa w/ descriptions
patio lawn and garden 0.5 0.634 0.628 0.835 0.833 0.796 0.848 10.2% 34.4%
average 0.5 0.638 0.631 0.834 0.828 0.841 0.883 50.6% 37.5%

Mechanical turk study the benefit of having substantially more data available for training
Sports and Outdoors Moqa rouge/bm25+
when considering open-ended questions.
Also surprising is that in our user study we obtained roughly
Tools and Home Impr. equal performance on subjective vs. objective queries. Partly this
Electronics may be because subjective queries are simply ‘more difficult’ to ad-
dress, so that there is less separation between methods, though this
Home and Kitchen would require a larger labeled dataset of subjective vs. objective
Automotive
queries to evaluate quantitatively. In fact, contrary to expectation
only around 20% of queries were labeled as being ‘subjective’ by
Baby workers. However the full story seems more complicated—queries
such as “how long does this stay hot?” (Figure 5) are certainly la-
beled as being ‘objective’ by human evaluators, though the variety
Qs. labeled ‘subjective’ Moqa rouge/bm25+ of responses shows a more nuanced situation. Really, a large frac-
Qs. labeled ‘objective’ tion of seemingly objective queries are met with contradictory an-
swers representing different user experiences, which is exactly the
0% 100% class of questions that our method is designed to address.
Relative subjective performance
6.1 Future work
We see several potential ways to extend Moqa.
Figure 4: User study. Bars indicate the fraction of times First, while we have made extensive use of reviews, there is a
that opinions surfaced by Moqa are preferred over those of wealth of additional information available on review websites that
the strongest baseline (a tuned combination of BM25+ and the could potentially be used to address queries. One is rating infor-
ROUGE score, ro-L from Section 5.3). mation, which could improve performance on certain evaluative
queries (though to an extent we already capture this information
as our model is expressive enough to learn the polarity of sen-
vance. Thus computing relevance takes only O(K + |q| + |r|) (i.e., timent words). Another is user information—the identity of the
the number of projected dimensions plus the number of words in questioner and the reviewer could be used to learn better relevance
the query and review); in practice this allows us to answer queries models, both in terms of whether their opinions are aligned, or even
in a few milliseconds, even for products with thousands of reviews. to identify topical experts, as has been done with previous Q/A sys-
tems [2, 5, 21, 35, 40].
In categories like electronics, a large fraction of queries are re-
6. DISCUSSION AND FUTURE WORK lated to compatibility (e.g. “will this product work with X?”). Ad-
Surprisingly, performance for open-ended queries (Table 4) ap- dressing compatibility-related queries with user reviews is another
pears to be better than performance for binary queries (Table 3), promising avenue of future work—again, the massive number of
both compared to random classification and to our strongest base- potential product combinations means that large volumes of user
line, against our intuition that the latter task might be more difficult. reviews are potentially an ideal source of data to address such ques-
There are a few reasons for this: One is simply that the task of dif- tions. Although our system can already address such queries to
ferentiating the true answer from a (randomly selected) non-answer some extent, ideally a model of compatibility-related queries would
is ‘easier’ than resolving a binary query; this explains why outper- make use of additional information, for instance reviews of both
forming a random baseline is easier, but does not explain the higher products being queried, or the fact that compatibility relationships
relative improvement against baselines. For the latter, note that the tend to be symmetric, or even co-purchasing statistics as in [29].
main difference between our method and the strongest baseline is Finally, since we are dealing with queries that are often subjec-
the use of a bilinear model; while a highly flexible model, it has tive, we would like to handle the possibility that they may have
far more parameters than baselines, meaning that a large dataset multiple and potentially inconsistent answers. Currently we have
is required for training. Thus what we are seeing may simply be selected the top-voted answer to each question as an ‘authoritative’
Binary model:
Product: Schwinn Searcher Bike (26-Inch, Silver) (amazon.com/dp/B007CKH61C)
Question: “Is this bike a medium? My daughter is 5’8”.”
Ranked opinions and votes: “The seat was just a tad tall for my girl so we actually sawed a bit off of the seat pole so that it
would sit a little lower.” (yes, .698); “The seat height and handlebars are easily adjustable.” (yes, .771); “This is a great bike for a
tall person.” (yes, .711)
Response: Yes (.722)
Actual answer (labeled as ‘yes’): My wife is 5’5” and the seat is set pretty low, I think a female 5’8” would fit well with the seat
raised.
Product: Davis & Sanford EXPLORERV Vista Explorer 60" Tripod (amazon.com/dp/B000V7AF8E)
Question: “Is this tripod better then the AmazonBasics 60-Inch Lightweight Tripod with Bag one?”
Ranked opinions and votes: “However, if you are looking for a steady tripod, this product is not the product that you are looking
for” (no, .295); “If you need a tripod for a camera or camcorder and are on a tight budget, this is the one for you.” (yes, .901);
“This would probably work as a door stop at a gas station, but for any camera or spotting scope work I’d rather just lean over the
hood of my pickup.” (no, .463);
Response: Yes (.863)
Actual answer (labeled as ‘yes’): The 10 year warranty makes it much better and yes they do honor the warranty. I was sent a
replacement when my failed.

Open-ended model:
Product: Mommy’s Helper Kid Keeper (amazon.com/dp/B00081L2SU)
Question: “I have a big two year old (30 lbs) who is very active and pretty strong. Will this harness fit him? Will there be any
room to grow?”
Ranked opinions: “So if you have big babies, this may not fit very long.”; “They fit my boys okay for now, but I was really
hoping they would fit around their torso for longer.”; “I have a very active almost three year old who is huge.”
Actual answer: One of my two year olds is 36lbs and 36in tall. It fits him. I would like for there to be more room to grow, but it
should fit for a while.
Product: Thermos 16 Oz Stainless Steel (amazon.com/dp/B00FKPGEBO)
Question: “how many hours does it keep hot and cold ?”
Ranked opinions: “Does keep the coffee very hot for several hours.”; “Keeps hot Beverages hot for a long time.”; “I bought this
to replace an aging one which was nearly identical to it on the outside, but which kept hot liquids hot for over 6 hours.”; “Simple,
sleek design, keeps the coffee hot for hours, and that’s all I need.”; “I tested it by placing boiling hot water in it and it did not keep
it hot for 10 hrs.”; “Overall, I found that it kept the water hot for about 3-4 hrs.”;
Actual answer: It doesn’t, I returned the one I purchased.

Figure 5: Examples of opinions recommended by Moqa. The top two examples are generated by the binary model, the bottom two
by the open-ended model. Note that none of these examples were available at training time, and only the question is provided as
input (the true answer and its label are shown for comparison). Opinions are shown in decreasing order of relevance. Note in the
second example that all opinions get to vote in proportion to their relevance; in this case the many positive votes among less-relevant
opinions outweigh the negative votes above, ultimately yielding a strong ‘yes’ vote.

response to be used at training time. But handling multiple, in- The main findings of our evaluation were as follows: First, re-
consistent answers could be valuable in several ways, for instance views proved particularly effective as a source of data for answering
to automatically identify whether a question is subjective or con- product-related queries, outperforming other sources of text like
tentious, or otherwise to generate relevance rankings that support a product specifications; this demonstrates the value of personal ex-
spectrum of subjective viewpoints. periences in addressing users’ queries. Second, we demonstrated
the need to handle heterogeneity between various text sources (i.e.,
questions, reviews, and answers); our large corpus of training data
7. CONCLUSION allowed us to train a flexible bilinear model that it capable of auto-
We presented Moqa, a system that automatically responds to matically accounting for linguistic differences between text sources,
product-related queries by surfacing relevant consumer opinions. outperforming hand-crafted word- and phrase-level relevance mea-
We achieved this by observing that a large corpus of previously- sures. Finally, we showed that Moqa is quantitatively able to ad-
answered questions can be used to learn the notion of relevance, dress both binary and open-ended questions, and qualitatively that
in the sense that ‘relevant’ opinions are those for which an accurate human evaluators prefer our learned notion of ‘relevance’ over hand-
predictor can be trained to select the correct answer as a function of crafted relevance measures.
the question and the opinion. We cast this as a mixture-of-experts
learning problem, where each opinion corresponds to an ‘expert’
that gets to vote on the correct response, in proportion to its rele- References
vance. These relevance and voting functions are learned automati- [1] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and
cally and evaluated on a large training corpus of questions, answers, G. Mishne. Finding high-quality content in social media. In
and reviews from Amazon. WSDM, 2008.
[2] A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. memory networks for natural language processing. CoRR,
Discovering value from community activity on focused ques- abs/1506.07285, 2015.
tion answering sites: a case study of Stack Overflow. In KDD, [24] J. Li and L. Sun. A query-focused multi-document summa-
2012. rizer based on lexical chains. In NIST, 2007.
[3] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. [25] C.-Y. Lin and E. Hovy. From single to multi-document sum-
Bridging the lexical chasm: statistical approaches to answer- marization: A prototype system and its evaluation. In ACL,
finding. In SIGIR, 2000. 2002.
[4] J. Bian, Y. Liu, D. Zhou, E. Agichtein, and H. Zha. Learning [26] Y. Lv and C. Zhai. Lower-bounding term frequency normal-
to recognize reliable users and content in social media with ization. In CIKM, 2011.
coupled mutual reinforcement. In World Wide Web, 2009. [27] C. Manning, P. Raghavan, and H. Schültze. An Introduction
[5] M. Bouguessa, B. Dumoulin, and S. Wang. Identifying au- to Information Retrieval. Cambridge University Press, 2009.
thoritative actors in question-answering forums: the case of [28] J. McAuley and J. Leskovec. Hidden factors and hidden top-
Yahoo! Answers. In KDD, 2008. ics: understanding rating dimensions with review text. In
[6] G. Carenini, R. Ng, and A. Pauls. Multi-document summa- ACM Conference on Recommender Systems, 2013.
rization of evaluative text. In ACL, 2006. [29] J. McAuley, R. Pandey, and J. Leskovec. Inferring networks
[7] Y. Chali and S. Joty. Selecting sentences for answering com- of substitutable and complementary products. In Knowledge
plex questions. In EMNLP, 2008. Discovery and Data Mining, 2015.
[8] W. Chu and S.-T. Park. Personalized recommendation on [30] K. McKeown and D. Radev. Generating summaries of multi-
dynamic content using predictive bilinear models. In World ple news articles. In SIGIR, 1995.
Wide Web, 2009. [31] X. Meng and H. Wang. Mining user reviews: From specifica-
[9] C. Danescu-Niculescu-Mizil, G. Kossinets, J. Kleinberg, and tion to summarization. In ACL Short Papers, 2009.
L. Lee. How opinions are received by online communities: A [32] A. Moschitti, S. Quarteroni, R. Basili, and S. Manandhar. Ex-
case study on amazon.com helpfulness votes. In World Wide ploiting syntactic and shallow semantic kernels for question
Web, 2009. answer classification. In ACL, 2007.
[10] G. Erkan and D. Radev. LexRank: Graph-based lexical cen- [33] A. Nazi, S. Thirumuruganathan, V. Hristidis, N. Zhang, and
trality as salience in text summarization. JAIR, 2004. G. Das. Answering complex queries in an online community
[11] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, network. In ICWSM, 2015.
A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, [34] A. Pal and S. Counts. Identifying topical authorities in mi-
J. Prager, N. Schlaefer, and C. Welty. Building Watson: An croblogs. In Web Search and Data Mining, 2011.
overview of the DeepQA project. In AI Magazine, 2010.
[35] A. Pal, R. Farzan, J. Konstan, and R. Kraut. Early detection
[12] W. Freeman and J. Tenenbaum. Learning bilinear models for of potential experts in question answering communities. In
two-factor problems in vision. In CVPR, 1996. UMAP, 2011.
[13] R. Gangadharaiah and B. Narayanaswamy. Natural language [36] D. H. Park, H. D. Kim, C. Zhai, and L. Guo. Retrieval of
query refinement for problem resolution from crowd-sourced relevant opinion sentences for new products. In SIGIR, 2015.
semi-structured data. In International Joint Conference on
[37] J. Ponte and B. Croft. A language modeling approach to in-
Natural Language Processing, 2013.
formation retrieval. In SIGIR, 1998.
[14] G. Ganu, N. Elhadad, and A. Marian. Beyond the stars:
[38] S. Rendle. Factorization machines. In ICDM, 2010.
Improving rating predictions using review text content. In
WebDB, 2009. [39] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-
Thieme. BPR: Bayesian personalized ranking from implicit
[15] S. Harabagiu, F. Lacatusu, and A. Hicki. Answering complex
feedback. In UAI, 2009.
questions with random walk models. In SIGIR, 2006.
[40] F. Riahi, Z. Zolaktaf, M. Shafiei, and E. Milios. Finding ex-
[16] J. He and D. Dai. Summarization of yes/no questions using a
pert users in community question answering. In World Wide
feature function model. JMLR, 2011.
Web, 2012.
[17] M. Hu and B. Liu. Mining and summarizing customer re-
[41] A. Severyn and A. Moschitti. Learning to rank short text pairs
views. In KDD, 2004.
with convolutional deep neural networks. In SIGIR, 2015.
[18] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. Adaptive
[42] J. Tenenbaum and W. Freeman. Separating style and content
mixtures of local experts. Neural Computation, 1991.
with bilinear models. Neural Computation, 2000.
[19] J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions
[43] H. Wang, Y. Lu, and C. Zhai. Latent aspect rating analysis on
in large question and answer archives. In CIKM, 2005.
review text data: a rating regression approach. In Knowledge
[20] K. S. Jones, S. Walker, and S. Robertson. A probabilistic Discovery and Data Mining, 2010.
model of information retrieval: development and comparative
[44] C. yew Lin. ROUGE: a package for automatic evaluation
experiments. In Information Processing and Management,
of summaries. In ACL Workshop on Text Summarization
2000.
Branches Out, 2004.
[21] P. Jurczyk and E. Agichtein. Discovering authorities in ques-
[45] H. Yu and V. Hatzivassiloglou. Towards answering opinion
tion answer communities by using link analysis. In Confer-
questions: Separating facts from opinions and identifying the
ence on Information and Knowledge Management, 2007.
polarity of opinion sentences. In EMNLP, 2003.
[22] R. Katragadda and V. Varma. Query-focused summaries or
[46] K. Zhang, W. Wu, H. Wu, Z. Li, and M. Zhou. Question re-
query-biased summaries? In ACL Short Papers, 2009.
trieval with high quality answers in community question an-
[23] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, swering. In CIKM, 2014.
I. Gulrajani, and R. Socher. Ask me anything: Dynamic

You might also like