Enriching Conversation Context in Retrieval-Based Chatbots: 1 Introduction & Related Works
Enriching Conversation Context in Retrieval-Based Chatbots: 1 Introduction & Related Works
Enriching Conversation Context in Retrieval-Based Chatbots: 1 Introduction & Related Works
Retrieval-based Chatbots
the new approach to the typical BERT Bi-Encoder and also Cross-Encoder and
Bi-Encoder methods that do not use pre-trained transformers. The experiments
are conducted on an existing information-seeking conversational dataset. The
experiments show that new architecture improves upon the baseline without
greatly affecting its inference speed.
The rest of this paper is organized as follows: In section 2 we explain the
task and our proposed method in greater detail. Sections 3 and 4 are dedicated
to detailing experiment settings and obtained results.
2 Method
In this section, we explain the task in greater detail and provide an overview of
our proposed method.
Fused Context
SUB SUB
MULT
AGG GATE MULT
BERT
TRANS- AGG
FORMER SUB Fused
MULT Matching
Enriched Context Signal
Fully
Context Context-Context Connected
Representation Matching Vectors Final
Word Piece Embeddings Transformer Final Score
of Conversation Context Layer States
where T is the BERT transformer, reduce is a function that aggregates the final
hidden states of the transformer into a single vector. Following previous work
reduce selects the first hidden vector (corresponding to the special token [CLS]).
To compare the two vectors we use a slightly modified version of the SubMult
function [23] coupled with a two-layer feed-forward network as we found it gave
better results than simply using the dot product of the two vectors:
SubM ult(vc , vr ) = [vc , vr , vc − vr , vc − vr , (vc − vr ) ∗ (vr − vc )]
ŷ = m(vc , vr ) = W2 (ReLU (W1 (SubM ult(vc , vr ))
where ŷ is the predicted matching score for the context response pair; [·, ·] is the
concatenation operation. The network is trained to minimize cross-entropy loss
where the logits are m(c, r0 ), · · · , m(c, r1 ) and r0 is the correct response. The
rest of the candidate responses are selected from other samples in the training
batch as this greatly speeds up training [9,17].
For the model to learn when to use the enriched context Ce and when to use
the regular context, we use the first half of Ĉ as a control for a gating mecha-
nism. The resulting fused context vector Cf is then compared to the response
representation R:
3 Experiment Settings
Our models are implemented in the PyTorch framework [18]. For our BERT
component, we used the Distilbert implementation available in Huggingface’s
transformer library1 since it provides reasonable results despite having only 6
layers of transformers instead of the 12 in the original implementation. We train
the networks three times using different seeds and for each model select the one
that gave the best results on the development set. We also implemented Dual-
LSTM [15] and ESIM [5] as examples of non-transformer-based Bi-Encoder and
Cross-Encoders. Our code will be released as open-source.
To shorten training time we train a BERT Bi-Encoder until convergence
using the AdamW optimizer [14] and a learning rate of 5 × 10−5 . We fine-tune
other networks using it as a starting point with the Adam optimizer[10] and a
learning rate of 10−4 . Batch size is 32 for BERT models and 128 for the rest.
Table 1. Comparison of model performance. Starred results are reported from their re-
spective papers. All other models were re-implemented in Pytorch. Our main baseline is
the BERT Bi-Encoder, BERT Bi-Encoder+CE metrics that are statistically significant
relative to it are marked in bold. We use paired two-tailed t-tests with a p-value<0.05.
Inference time is average milliseconds to process a single sample on a GPU.
References
1. Ai, Q., Bi, K., Guo, J., Croft, W.B.: Learning a deep listwise context model for
ranking refinement. In: The 41st International ACM SIGIR Conference on Research
& Development in Information Retrieval. pp. 135–144. ACM (2018)
2. Aliannejadi, M., Zamani, H., Crestani, F., Croft, W.B.: Asking clarifying questions
in open-domain information-seeking conversations. In: Proceedings of the 42nd In-
ternational ACM SIGIR Conference on Research and Development in Information
Retrieval. pp. 475–484. ACM (2019)
3. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated re-
current neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep
Learning, December 2014 (2014)
4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
pp. 4171–4186 (2019)
5. Dong, J., Huang, J.: Enhance word representation for out-of-vocabulary on ubuntu
dialogue corpus. CoRR abs/1802.02614 (2018)
6. Ganhotra, J., Patel, S.S., Fadnis, K.P.: Knowledge-incorporating ESIM models
for response selection in retrieval-based dialog systems. CoRR abs/1907.05792
(2019)
7. Gao, J., Galley, M., Li, L., et al.: Neural approaches to conversational ai. Founda-
tions and Trends R in Information Retrieval 13(2-3), 127–298 (2019)
8. Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-
hoc retrieval. In: Proceedings of the 25th ACM International on Conference on
Information and Knowledge Management. pp. 55–64. ACM (2016)
9. Humeau, S., Shuster, K., Lachaux, M.A., Weston, J.: Real-time inference in multi-
sentence tasks with deep pretrained transformers. arXiv preprint arXiv:1905.01969
(2019)
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd In-
ternational Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings (2015)
11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. In: 5th International Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (2017)
12. Li, F., Qiu, M., Chen, H., Wang, X., Gao, X., Huang, J., Ren, J., Zhao, Z., Zhao,
W., Wang, L., Jin, G., Chu, W.: AliMe Assist : An intelligent assistant for cre-
ating an innovative e-commerce experience. In: Proceedings of the 2017 ACM on
Conference on Information and Knowledge Management, CIKM 2017, Singapore,
November 06 - 10, 2017. pp. 2495–2498 (2017)
13. Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural networks for natural
language understanding. In: Proceedings of the 57th Conference of the Association
for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019,
Volume 1: Long Papers. pp. 4487–4496 (2019)
14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th Inter-
national Conference on Learning Representations, ICLR 2019, New Orleans, LA,
USA, May 6-9, 2019 (2019)
15. Lowe, R., Pow, N., Serban, I., Pineau, J.: The ubuntu dialogue corpus: A large
dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of
8 A. Vakili et al.
the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue.
pp. 285–294 (2015)
16. Lowe, R., Pow, N., Serban, I.V., Charlin, L., Liu, C.W., Pineau, J.: Training end-
to-end dialogue systems with the ubuntu dialogue corpus. Dialogue & Discourse
8(1), 31–65 (2017)
17. Mazaré, P., Humeau, S., Raison, M., Bordes, A.: Training millions of personalized
dialogue agents. In: Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018.
pp. 2775–2779 (2018)
18. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In:
NIPS Autodiff Workshop (2017)
19. Serban, I.V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., Bengio,
Y.: A hierarchical latent variable encoder-decoder model for generating dialogues.
In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
20. Shen, D., Zhang, Y., Henao, R., Su, Q., Carin, L.: Deconvolutional latent-variable
model for text sequence matching. In: Thirty-Second AAAI Conference on Artificial
Intelligence (2018)
21. Shum, H.Y., He, X.d., Li, D.: From eliza to xiaoice: challenges and opportunities
with social chatbots. Frontiers of Information Technology & Electronic Engineering
19(1), 10–26 (2018)
22. Tao, C., Wu, W., Xu, C., Hu, W., Zhao, D., Yan, R.: Multi-representation fu-
sion network for multi-turn response selection in retrieval-based chatbots. In: Pro-
ceedings of the Twelfth ACM International Conference on Web Search and Data
Mining. pp. 267–275. ACM (2019)
23. Wang, S., Jiang, J.: A compare-aggregate model for matching text sequences. In:
5th International Conference on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Proceedings (2017)
24. Wu, Y., Wu, W., Xing, C., Zhou, M., Li, Z.: Sequential matching network: A
new architecture for multi-turn response selection in retrieval-based chatbots. In:
Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). vol. 1, pp. 496–505 (2017)
25. Xu, Z., Liu, B., Wang, B., Sun, C., Wang, X.: Incorporating loose-structured knowl-
edge into conversation modeling via recall-gate lstm. In: 2017 International Joint
Conference on Neural Networks (IJCNN). pp. 3506–3513. IEEE (2017)
26. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification.
In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp.
7370–7377 (2019)