C065
C065
Models Win Loss Tie tendency to generate similar answers to some extent.
Therefore, we believe that assigning a higher weight to the
MKEMP vs Transformer 45.6% 17.1% 37.3%
diversity loss is essential to address this issue. Through
MKEMP vs MoEL 38.1% 18.5% 43.4% multiple experiments, we have also confirmed that setting
MKEMP vs MIME 36.7% 18.9% 44.4% λ1 =1, λ2 =2, and λ3 =1.5 can yield outstanding results.
MKEMP vs EmpDG 36.2% 20.3% 43.5%
MKEMP vs CEM 34.8% 25.5% 39.7% III. EXPERIENCE
Table 2: Result of human A/B test. A. Implementation Details
The EmpatheticDialogues dataset was divided in an 8:1:1
appropriate. These responses often lack a profound ratio for our experiments. All models were implemented using
understanding of the specific context. While this approach PyTorch, and we initialized word embeddings with pre-trained
may reduce the generation of responses unrelated to the GLOVE vectors. The learning rate was initialized to 2�−5 , and
conversation topic, it does not align with the objectives of we incorporated a linear warm-up with 4000 warm-up steps to
dialogue system research. dynamically adjust the learning rate. Our choice of optimizer
Inspired by Jiang et al. [7], we address this issue by was Adam, with parameters set to �1 = 0.9 and
introducing an additional loss function known as Frequency- �2 = 0.999.Furthermore, we restricted the maximum allowed
Aware Cross-Entropy loss. This method penalizes frequently introductions of external concepts to 10 per dialogue and 5 per
occurring answers by applying weighted penalties. To token. The model training process was carried out on an RTX
accomplish this, we calculate the relative frequency �� of each A5000 GPU for efficient computation. During the inference
token �� in the training corpus before processing the next batch phase, we imposed a maximum decoding step limit of 30 to
of samples and derive their corresponding frequency-based ensure timely and accurate results.
weights �� :
�� = �
���� ��
(21) B. Baselines
�=1 ���� �� Transformer: A fully attention-based Seq2Seq model
� � = � × �� + 1 (22) trained using Maximum Likelihood Estimation (MLE) loss.
Here, V represents the size of the vocabulary, and a =− MoEL: A transformer-based model that discerns
−1
max1≤j≤V Rj represents the frequency slope. To ensure emotional states through context encoding and generates
that �� falls within the [0,1] range, we incorporate an responses for each emotion using multiple decoders [8].
additional bias factor of 1. Consequently, tokens with higher MIME: Another transformer-based model that utilizes
frequencies in the corpus are assigned lower weights. Finally, polarity-based emotion clusters and emotion imitation
we normalize �� to have a mean value of 1 and calculate the techniques to create empathetic responses.
diversity loss: EmpDG: A multi-scale adversarial model built on the
���� =− ��=1 ��=1 �� �� �� ���� �� |�<� , � (23) principles of Wasserstein Generative Adversarial Networks
1, �� = �� (WGAN). It excels at capturing nuanced human emotions by
�� �� = (24) employing both coarse-grained dialogue-level and fine-
0, ��ℎ���
grained token-level emotion modeling to produce empathetic
Where �� signifies candidate tokens, and �� �� denotes an
dialogues [9].
indicator function. Ultimately, we employ a multitask learning
CEM: A model that generates common-sense knowledge
approach to collectively optimize the generation loss, emotion
using pre-trained models and enhances empathetic response
loss, and diversity loss:
generation by incorporating it as supplementary information.
L = λ1 Lnll + λ2 Lemo +λ3 Ldiv (25)
Here, λ1 , λ2 , and λ3 are hyperparameters. In our C. Results
experiments, we observed that setting them to commonly used In order to validate the model's performance, we
values, such as λ1 =1, λ2 =2, and λ3 =1, still resulted in a employed both a set of automatic evaluation metrics and a set
of human evaluation metrics to assess its capabilities.
Models Acc PPL↓ Dist-1 Dist-2 models. We attribute this to the exceptional performance of
the transformer, resulting in responses generated by these
MKEMP 39.84 35.23 0.67 2.82 transformer-based models exhibiting good fluency. The A/B
w/o G 38.86 35.03 0.62 2.64 testing results of the manual evaluation are displayed in Table
w/o W 37.88 35.77 0.53 2.49 2, confirming that, from the perspective of real human
w/o D 39.51 34.97 0.41 2.27 evaluators, MKEMP performs better overall.
w/o R 34.69 36.05 0.54 2.55 Ablation study
Specifically, we created four variants of MKEMP by
Table 3: Result of ablation study. removing specific components from the model: (1)w/o G:
Removing the part that extracts fine-grained knowledge by
Additionally, to rigorously verify the effectiveness of each building an knowledge context graph. (2)w/o W: Eliminating
component within the MKEMP framework, we conducted a the part that utilizes COMET to extract coarse-grained
series of ablation experiments. knowledge. (3)w/o D: Excluding the diversity loss from the
Automatic evaluation loss function. (4)w/o R: Omitting the emotional context
Based on prior research, this paper adopts Perplexity refinement encoder and cognitive context refinement encoder.
(PPL), Distinct-1 (Dist-1), and Distinct-2 (Dist-2) as the The results of the ablation experiments are presented in Table
primary automatic evaluation metrics. PPL is a commonly 3, and we observe that all four variants perform significantly
employed metric for assessing overall generation quality. It worse than MKEMP on most metrics, demonstrating the
calculates the model's confidence within the candidate effectiveness of each component in our model.
response set, with higher confidence resulting in lower PPL.
Distinct-n measures generation diversity by quantifying the IV. CONCLUSION
proportion of different n-gram grammatical constructs within In this paper, we propose a Mixed Knowledge-Enhanced
the overall generated results. Higher values of Dist-1 and Dist- Empathetic Dialogue Generation Model (MKEMP) that
2 indicate greater diversity. Additionally, we also utilize generates high-quality empathetic responses by incorporating
Emotion Prediction Accuracy (Acc) as a metric. A higher Acc multi-scale external knowledge to capture rich emotional
implies superior emotion prediction capabilities for the model. states. Extensive automated and manual evaluations have
As illustrated in Table 1, MKEMP outperforms other empirically demonstrated the effectiveness of our approach in
models across all four automatic evaluation metrics. This empathetic response generation. Our future research may
suggests that MKEMP can generate responses that exhibit inspire other work to expand from single-source information
higher overall quality, enhanced readability, improved to multi-scale information for similar tasks.
grammatical correctness, and greater diversity. Furthermore,
the notable enhancement in Dist-2 can be attributed to the REFERENCES
introduction of multi-scale knowledge, empowering the model [1] Zandie, R., & Mahoor, M. H. (2020). Emptransfo: A multi-head
to generate diverse responses based on more abundant transformer architecture for creating empathetic dialog systems. arXiv
information. preprint arXiv:2003.02958.
Human evaluation [2] Welivita, A., & Pu, P. (2020). A taxonomy of empathetic response
intents in human social conversations. arXiv preprint arXiv:2012.04080.
To evaluate the model's performance more objectively,
[3] Zheng, C., Liu, Y., Chen, W., Leng, Y., & Huang, M. (2021). Comae: A
we conducted comprehensive manual evaluations. We enlisted multi-factor hierarchical framework for empathetic response
the expertise of three professional data annotators from a generation. arXiv preprint arXiv:2105.08316.
crowd sourcing company to assess 200 randomly selected sets [4] Sabour, S., Zheng, C., & Huang, M. (2022, June). Cem: Commonsense-
of dialogues. They were tasked with two specific assignments: aware empathetic response generation. In Proceedings of the AAAI
(1)Rating the responses generated by various models on a Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11229-11237).
scale of 1, 3, and 5, based on three predefined criteria: fluency, [5] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., &
Bengio, Y. (2017). Graph attention networks. arXiv preprint
relevance, and empathy [10]. (2)In a given scenario, the arXiv:1710.10903.
annotators compared responses from different baseline models [6] Majumder, N., Hong, P., Peng, S., Lu, J., Ghosal, D., Gelbukh, A., ... &
with those from MKEMP and selected the more suitable Poria, S. (2020). MIME: MIMicking emotions for empathetic response
response from the two models. The model with more generation. arXiv preprint arXiv:2010.01454.
selections was deemed to have better performance. [7] Jiang, S., Ren, P., Monz, C., & de Rijke, M. (2019, May). Improving
The scoring results of the manual evaluation are neural response diversity with frequency-aware cross-entropy loss.
In The World Wide Web Conference (pp. 2879-2885).
presented in Table 2. MKEMP demonstrates a significant
[8] Lin, Z., Madotto, A., Shin, J., Xu, P., & Fung, P. (2019). Moel: Mixture
advantage in terms of empathy and relevance. This indicates of empathetic listeners. arXiv preprint arXiv:1908.07687.
that our model effectively enhances the consistency of [9] Li, Q., Chen, H., Ren, Z., Ren, P., Tu, Z., & Chen, Z. (2019). EmpDG:
generated responses with the conversation's topic by capturing Multiresolution interactive empathetic dialogue generation. arXiv
multi-scale common-sense information. Furthermore, rich preprint arXiv:1911.08698.
common-sense knowledge helps the model better comprehend [10] Rashkin, H., Smith, E. M., Li, M., & Boureau, Y. L. (2018). Towards
the emotional context of the conversation, even in complex empathetic open-domain conversation models: A new benchmark and
dataset. arXiv preprint arXiv:1811.00207.
contexts. On the other hand, we observed that MKEMP's
fluency performance slightly differs from other baseline