Enriching Conversation Context in Retrieval-Based Chatbots: 1 Introduction & Related Works

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Enriching Conversation Context in

Retrieval-based Chatbots

Amir Vakili and Azadeh Shakery

University of Tehran {a vakili,shakery}@ut.ac.ir


arXiv:1911.02290v1 [cs.CL] 6 Nov 2019

Abstract. Work on retrieval-based chatbots, like most sequence pair


matching tasks, can be divided into Cross-encoders that perform word
matching over the pair, and Bi-encoders that encode the pair separately.
The latter has better performance, however since candidate responses
cannot be encoded offline, it is also much slower. Lately, multi-layer
transformer architectures pre-trained as language models have been used
to great effect on a variety of natural language processing and informa-
tion retrieval tasks. Recent work has shown that these language mod-
els can be used in text-matching scenarios to create Bi-encoders that
perform almost as well as Cross-encoders while having a much faster
inference speed. In this paper, we expand upon this work by develop-
ing a sequence matching architecture that utilizes the entire training set
as a makeshift knowledge-base during inference. We perform detailed
experiments demonstrating that this architecture can be used to further
improve Bi-encoders performance while still maintaining a relatively high
inference speed.

Keywords: Conversational Information Retrieval · Pre-trained Trans-


formers · Retrieval-based Chatbots

1 Introduction & Related Works


Previous literature divides conversational agents into two groups namely task-
oriented dialogue systems, that are designed to fulfil a single purpose within a
vertical domain such as purchasing an airline ticket, and non-task-oriented chat-
bots, that are designed to have natural and meaningful conversations with hu-
mans on open domain topics [7]. Non-task-oriented chatbots can also be trained
for information-seeking tasks if they are given the right data. To achieve this,
existing methods either employ retrieval based [15,24,5,22] or generative [16,19]
conversational agents. Retrieval-based systems select a response from candidates
retrieved from chat logs according to how well they match the current conver-
sation context as opposed to generative systems which synthesise new sentences
based on the context. As such retrieval-based systems enjoy the advantage of
having fluent and informative responses since these were originally written by
humans. Retrieval-based chatbots have also been used in real-world products
such as Microsoft XiaoIce [21] and Alibaba Group’s Alime Assist [12].
Retrieval-based chatbots, like most sequence matchers, are either Bi-Encoders
that encode the input pair separately and compare the resulting vectors (also
2 A. Vakili et al.

known as representation-focused) or Cross-Encoders that perform matching be-


tween all the words in the input pair (also known as interaction-focused) [8].
Cross-encoders usually give better results however their word-by-word match-
ing gives rise to higher computational complexity [20]. Moreover, In Bi-encoders
candidates response representations are independent of the conversation context
so they can be pre-computed offline and cached, thus greatly speeding up the
inference process especially if the pool of candidate responses is large [9].
Recently, the introduction of language models pre-trained on large corpora
has had a resounding impact on the field of natural language processing [4,13]
and has been slowly making its way into the conversational information retrieval
community [2]. By fine-tuning these models, researchers have achieved state-of-
the-art results on various tasks including sequence matching. These language
models usually consist of several transformer layers and can be used in both
Bi-Encoder and Cross-Encoder architectures. The Cross-Encoders typically con-
catenate a sequence pair and feed it to the language model to get a single vector,
while Bi-Encoders will calculate a representation for each and then compare the
resulting vectors. Recent work showed that Bi-Encoders utilizing BERT [4] can
be very effective for response selection and can be further enhanced to get their
results closer to Cross-Encoders while still maintaining their fast inference time
[9].
In this paper, we expand upon this work by proposing another possible mod-
ification to the BERT Bi-Encoders for use in the response retrieval setting which
we refer to as Context Enrichment. Instead of only comparing the candidate re-
sponse and conversation context representations, we compare the conversation
context to other conversation contexts found in the training set most similar to
the candidate response. We expect this change to improve performance as infor-
mation seeking dialogue systems can cover a vast array of very specific topics and
matching a response vector to a particular set of related conversations can help
the model better determine whether it is relevant to the conversation context
at hand. Essentially we treat the training set as a makeshift knowledge-base.
Since conversation representations in the training set and their similarities to
candidate responses can be pre-computed and cached offline, this will not add
significant overhead to the prediction process.
This approach is similar to [6] which proposed finding conversations similar
to the current conversation and concatenating their corresponding responses to
the current context. This technique can works with most methods including
our own as it can be considered as a form of data augmentation. Also, our
method does not require searching chat logs during inference time as vector
similarities between training contexts and candidate responses can be computed
and cached offline. Our approach is also inspired by [1] who proposed learning
global context from a list of retrieved documents in the learning-to-rank setting
instead of ranking each document in isolation.
The overall goal of this paper is to show sentence representations made with
transformer architectures carry enough information to be used for knowledge
enhancement in retrieval-based chatbots. We conduct experiments comparing
Enriching Conversation Context in Retrieval-based Chatbots 3

the new approach to the typical BERT Bi-Encoder and also Cross-Encoder and
Bi-Encoder methods that do not use pre-trained transformers. The experiments
are conducted on an existing information-seeking conversational dataset. The
experiments show that new architecture improves upon the baseline without
greatly affecting its inference speed.
The rest of this paper is organized as follows: In section 2 we explain the
task and our proposed method in greater detail. Sections 3 and 4 are dedicated
to detailing experiment settings and obtained results.

2 Method

In this section, we explain the task in greater detail and provide an overview of
our proposed method.

2.1 Task Definition

Response retrieval is a setting in which a conversational agent must select the


proper response from a group of candidates given the conversation history up to
that point. A given model first ranks these candidates according to its prediction
as to how well it fits the conversation context, and then these rankings are
evaluated using ranking metrics based on gold annotations.
In this paper, we use the Ubuntu corpus dataset which is extracted from the
logs of an online chatroom dedicated to the support of the Ubuntu operating
system. The dataset features conversations between two people where one is
asking for technical help and the other does their best to aid them [15]. We use
the version of the dataset prepared by [25].
The task can be formalized as follows: Suppose we have a dataset D =
{ci , ri , yi }N
i=1 where ci = {t1 , · · · , tm } represents the conversational context and
ri = {t1 , · · · , tn } represents a candidate response and yi ∈ {0, 1} is a label. yi = 1
means that ri is a suitable choice for ci . The goal of a model should be to learn
a function g(c, r) that predicts the matching degree between any new context c
and a candidate response r.

Representations of Training Contexts


Similar to Candidate Response BiGRU Hidden States
Database of
Candidate Response SUB
Conversation
Representation Contexts
MULT

Fused Context

SUB SUB
MULT
AGG GATE MULT

BERT
TRANS- AGG
FORMER SUB Fused
MULT Matching
Enriched Context Signal
Fully
Context Context-Context Connected
Representation Matching Vectors Final
Word Piece Embeddings Transformer Final Score
of Conversation Context Layer States

Fig. 1. Architecture of proposed method.


4 A. Vakili et al.

2.2 BERT Bi-Encoder


The BERT Bi-Encoder [9] is the basis of our architecture. It uses BERT, a multi-
layer transformer pre-trained as a language model [4], to compute sentence repre-
sentations from the conversation context and candidate response separately and
then matches these representations together to predict their matching degree.
The two representations are computed as follows:

vc = reduce(T (context)) vr = reduce(T (response))

where T is the BERT transformer, reduce is a function that aggregates the final
hidden states of the transformer into a single vector. Following previous work
reduce selects the first hidden vector (corresponding to the special token [CLS]).
To compare the two vectors we use a slightly modified version of the SubMult
function [23] coupled with a two-layer feed-forward network as we found it gave
better results than simply using the dot product of the two vectors:
SubM ult(vc , vr ) = [vc , vr , vc − vr , vc − vr , (vc − vr ) ∗ (vr − vc )]
ŷ = m(vc , vr ) = W2 (ReLU (W1 (SubM ult(vc , vr ))

where ŷ is the predicted matching score for the context response pair; [·, ·] is the
concatenation operation. The network is trained to minimize cross-entropy loss
where the logits are m(c, r0 ), · · · , m(c, r1 ) and r0 is the correct response. The
rest of the candidate responses are selected from other samples in the training
batch as this greatly speeds up training [9,17].

2.3 BERT Bi-Encoder+CE


Bi-Encoder+CE is our proposed architecture. The BERT Bi-Encoder+CE not
only compares a context vector to the candidate responses vector, but it also
compares it to k context vectors similar to the response candidate that are re-
trieved from the training set using cosine similarity. The architecture is depicted
in Figure 1.
The model first constructs the representation vector C of the conversation
context, similar to the regular BERT Bi-Encoder. It then uses the pre-computed
candidate response vector R to retrieve a set of k similar contexts {C1R , · · · , CkR }
from the training set, which have also been pre-computed, using cosine similarity.
It then compares each of them to the context vector using the SubMult function.
The resulting vectors are passed to a bidirectional GRU [3].

ĈiR = SubM ult(C, CiR ) , Hi = BiGRU (ĈiR )


We use additive attention in order to aggregate the GRU hidden states
{H1 , · · · , Hk } ⊂ R2d into a single vector Ĉ ∈ R2d as follows:
Pk
ai = sof tmax(W13 (tanh(W11 Hi + W12 R)) , Ĉ = i=1 ai Hi
This is so the model can use the most relevant helper contexts to build the
enriched context vector.
Enriching Conversation Context in Retrieval-based Chatbots 5

For the model to learn when to use the enriched context Ce and when to use
the regular context, we use the first half of Ĉ as a control for a gating mecha-
nism. The resulting fused context vector Cf is then compared to the response
representation R:

Ce = {ĉi , · · · , ĉk/2 }, G = {ĉk/2+1 , · · · , ĉn }


Cf = sigmoid(G) ∗ Ce + (1 − sigmoid(G)) ∗ C
ŷ = W22 (ReLU (W21 SubM ult(Cf , R)))

3 Experiment Settings
Our models are implemented in the PyTorch framework [18]. For our BERT
component, we used the Distilbert implementation available in Huggingface’s
transformer library1 since it provides reasonable results despite having only 6
layers of transformers instead of the 12 in the original implementation. We train
the networks three times using different seeds and for each model select the one
that gave the best results on the development set. We also implemented Dual-
LSTM [15] and ESIM [5] as examples of non-transformer-based Bi-Encoder and
Cross-Encoders. Our code will be released as open-source.
To shorten training time we train a BERT Bi-Encoder until convergence
using the AdamW optimizer [14] and a learning rate of 5 × 10−5 . We fine-tune
other networks using it as a starting point with the Adam optimizer[10] and a
learning rate of 10−4 . Batch size is 32 for BERT models and 128 for the rest.

4 Results and Discussion


The results depicted in Table 1 show that the inclusion of training contexts
similar to the candidate responses can be beneficial to the response retrieval
process. It can also be noted that Context Enrichment (CE) adds a relatively
small amount of overhead and is still faster than even simple Cross-Encoder
architectures. Table 2 shows that increasing the number of considered contexts
similar to the candidate response improves performance but only up to k = 20.
Due to limited compute resources we were not able to match the state-of-
the-art Bi-Encoder result [9]. Their models were trained on 8 16GB GPUs while
the results here were obtained using a single 11GB GPU. This means we had
to make compromises on the sequence lengths, the number of negative samples
and model size. This is especially apparent in our BERT Cross-Encoder result as
increasing negative samples to match the Bi-Encoder is infeasible on one GPU.

4.1 Ablation Study


Table 3 depicts how well the model performs with various components altered.
These results show that only when we combine the enriched context and the reg-
ular context are we able to see significant gains. This means the enriched context
1
https://github.com/huggingface/transformers
6 A. Vakili et al.

R10 @1 R10 @2 R10 @5 M RR Inference time


SMN*[24] 72.6 84.7 96.1 — —
ESIM*[5] 75.9 87.2 97.3 84.8 —
MRFN*[22] 78.6 88.6 97.6 — —
Dual-LSTM[15] 54.8 73.1 91.3 70.3 —
ESIM [5] 74.4 85.2 96.1 83.4 9
BERT Bi-Encoder 78.0 88.5 97.4 86.2 3
BERT Bi-Encoder+CE 79.3 89.3 97.5 87.0 4.5
BERT Cross-Encoder 76.5 86.4 96.4 84.8 30

Table 1. Comparison of model performance. Starred results are reported from their re-
spective papers. All other models were re-implemented in Pytorch. Our main baseline is
the BERT Bi-Encoder, BERT Bi-Encoder+CE metrics that are statistically significant
relative to it are marked in bold. We use paired two-tailed t-tests with a p-value<0.05.
Inference time is average milliseconds to process a single sample on a GPU.

vector contains information complementary to the regular context vector. The


results also demonstrate the effectiveness of the SubMult function.

5 Conclusion and Future Work

In this paper, we introduced a new architecture for use in retrieval-based chat-


bots. When predicting the matching degree between a candidate response and a
conversation context, the new architecture takes into account contexts within the
training set that are possible matches for the candidate response and compares
them to the current context. The model improves upon the BERT Bi-Encoder
baseline without greatly affecting inference speed. We provide an overview of the
performance/speed trade-off between the mentioned architectures.
The advent of powerful sequence-encoders using transformer architectures
gives researchers a new avenue to explore as Bi-Encoders have been somewhat
ignored in favour of the more computationally expensive Cross-Encoders. One
possible new approach we intend to explore is to use graph convolution networks
[11] to construct a makeshift knowledge-base that can provide a much richer and
more structurally sound model for the relations between contexts and responses,
similar to work done previously on text classification [26].

k Dev R10 @1 Test R10 @1


0 78.1 78.0 Dev R10 @1 Test R10 @1
2 78.7 78.5 Bi-Encoder+CE 79.2 79.3
5 78.6 78.6 -Attention 78.9 78.8
10 78.9 79.0 -Gate 78.0 77.9
15 79.0 79.1
20 79.2 79.3 Bi-Encoder 78.1 78.0
25 79.1 79.2 -SubMult 76.8 76.5

Table 2. Analysis of Table 3. Ablated model metrics


hyper-parameter k
Enriching Conversation Context in Retrieval-based Chatbots 7

References
1. Ai, Q., Bi, K., Guo, J., Croft, W.B.: Learning a deep listwise context model for
ranking refinement. In: The 41st International ACM SIGIR Conference on Research
& Development in Information Retrieval. pp. 135–144. ACM (2018)
2. Aliannejadi, M., Zamani, H., Crestani, F., Croft, W.B.: Asking clarifying questions
in open-domain information-seeking conversations. In: Proceedings of the 42nd In-
ternational ACM SIGIR Conference on Research and Development in Information
Retrieval. pp. 475–484. ACM (2019)
3. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated re-
current neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep
Learning, December 2014 (2014)
4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
pp. 4171–4186 (2019)
5. Dong, J., Huang, J.: Enhance word representation for out-of-vocabulary on ubuntu
dialogue corpus. CoRR abs/1802.02614 (2018)
6. Ganhotra, J., Patel, S.S., Fadnis, K.P.: Knowledge-incorporating ESIM models
for response selection in retrieval-based dialog systems. CoRR abs/1907.05792
(2019)
7. Gao, J., Galley, M., Li, L., et al.: Neural approaches to conversational ai. Founda-
tions and Trends R in Information Retrieval 13(2-3), 127–298 (2019)
8. Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-
hoc retrieval. In: Proceedings of the 25th ACM International on Conference on
Information and Knowledge Management. pp. 55–64. ACM (2016)
9. Humeau, S., Shuster, K., Lachaux, M.A., Weston, J.: Real-time inference in multi-
sentence tasks with deep pretrained transformers. arXiv preprint arXiv:1905.01969
(2019)
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd In-
ternational Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings (2015)
11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. In: 5th International Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (2017)
12. Li, F., Qiu, M., Chen, H., Wang, X., Gao, X., Huang, J., Ren, J., Zhao, Z., Zhao,
W., Wang, L., Jin, G., Chu, W.: AliMe Assist : An intelligent assistant for cre-
ating an innovative e-commerce experience. In: Proceedings of the 2017 ACM on
Conference on Information and Knowledge Management, CIKM 2017, Singapore,
November 06 - 10, 2017. pp. 2495–2498 (2017)
13. Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural networks for natural
language understanding. In: Proceedings of the 57th Conference of the Association
for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019,
Volume 1: Long Papers. pp. 4487–4496 (2019)
14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th Inter-
national Conference on Learning Representations, ICLR 2019, New Orleans, LA,
USA, May 6-9, 2019 (2019)
15. Lowe, R., Pow, N., Serban, I., Pineau, J.: The ubuntu dialogue corpus: A large
dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of
8 A. Vakili et al.

the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue.
pp. 285–294 (2015)
16. Lowe, R., Pow, N., Serban, I.V., Charlin, L., Liu, C.W., Pineau, J.: Training end-
to-end dialogue systems with the ubuntu dialogue corpus. Dialogue & Discourse
8(1), 31–65 (2017)
17. Mazaré, P., Humeau, S., Raison, M., Bordes, A.: Training millions of personalized
dialogue agents. In: Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018.
pp. 2775–2779 (2018)
18. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In:
NIPS Autodiff Workshop (2017)
19. Serban, I.V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., Bengio,
Y.: A hierarchical latent variable encoder-decoder model for generating dialogues.
In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
20. Shen, D., Zhang, Y., Henao, R., Su, Q., Carin, L.: Deconvolutional latent-variable
model for text sequence matching. In: Thirty-Second AAAI Conference on Artificial
Intelligence (2018)
21. Shum, H.Y., He, X.d., Li, D.: From eliza to xiaoice: challenges and opportunities
with social chatbots. Frontiers of Information Technology & Electronic Engineering
19(1), 10–26 (2018)
22. Tao, C., Wu, W., Xu, C., Hu, W., Zhao, D., Yan, R.: Multi-representation fu-
sion network for multi-turn response selection in retrieval-based chatbots. In: Pro-
ceedings of the Twelfth ACM International Conference on Web Search and Data
Mining. pp. 267–275. ACM (2019)
23. Wang, S., Jiang, J.: A compare-aggregate model for matching text sequences. In:
5th International Conference on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Proceedings (2017)
24. Wu, Y., Wu, W., Xing, C., Zhou, M., Li, Z.: Sequential matching network: A
new architecture for multi-turn response selection in retrieval-based chatbots. In:
Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). vol. 1, pp. 496–505 (2017)
25. Xu, Z., Liu, B., Wang, B., Sun, C., Wang, X.: Incorporating loose-structured knowl-
edge into conversation modeling via recall-gate lstm. In: 2017 International Joint
Conference on Neural Networks (IJCNN). pp. 3506–3513. IEEE (2017)
26. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification.
In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp.
7370–7377 (2019)

You might also like