0% found this document useful (0 votes)
3 views

C065

Uploaded by

espesso.amigos04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

C065

Uploaded by

espesso.amigos04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Mixed Knowledge-enhance

Empathetic Dialogue Generation

Abstract—Empathy plays a pivotal role in genuine human


communication, and thus, it is an essential capability that any
human-centered dialogue system should possess. Early research
in empathetic response generation often focused on directly
capturing the emotional state of the context and expressing
empathy accordingly. However, the logical aspects exhibited in
human conversations heavily rely on experiential and knowledge-
based resources within the brain. This implies that whether the
aim is to acquire more nuanced emotional states or to generate
responses enriched with comprehensive information, the
incorporation of external knowledge as supplementary
information in empathetic dialogue systems is imperative. In
response to this challenge, we propose an innovative approach
that integrates multi-scale common sense. This is achieved by
designing two components: a context-specific fine-grained
knowledge graph and COMET-based coarse-grained knowledge
acquisition, enabling the capture of external knowledge at two Figure 1: An Example of Real-world Empathetic Dialogue. In
scales. This, in turn, allows the model to attain a deeper the dialogue, emotionally relevant words are highlighted in
understanding of the user's context, thus enhancing the bold italics, with arrows pointing to concepts with higher
expression of empathy in dialogue systems. We conduct extensive emotional intensity. The arrows originating from the boxes
experiments on the EMPATHETICDIALOGUES dataset, and represent the coarse-grained knowledge of the conversation.
the results indicate that our method outperforms baseline models
in both automatic and human evaluations, reaffirming the
advantages of incorporating diverse external knowledge in Zheng et al. identified three fundamental elements in empathy
empathetic response generation. expression: communication mechanisms, dialogue behaviors,
and emotions [3]. They seamlessly integrated these elements
Keywords—empathetic dialogue generation, dialog system,
external knowledge, deep learning, Natural Language Processing
into a hierarchical modeling framework for empathetic dialog
systems. Most recently, Sabour et al. harnessed commonsense
knowledge to augment the model's comprehension of the
interlocutor's situation and emotions [4], thereby amplifying
I. INTRODUCTION empathy in dialogue systems.
Empathy refers to the capacity to perceive another person's While these approaches have undoubtedly enhanced the
situation and respond appropriately, playing a pivotal role in model's capacity to comprehend emotional states during
human interactions and communication. Previous research has
interactions, they frequently overlook the pivotal role of
consistently shown that empathetic dialogue models can personal knowledge in ensuring the richness and coherence of
elevate user satisfaction and elicit positive feedback across
human communication. Some researchers acknowledge the
diverse domains. Additionally, they can enhance the user- significance of external knowledge in empathetic dialogue
machine interaction experience. Therefore,the development of
generation but tend to focus predominantly on word-to-word or
effective methods to augment the empathetic capabilities of
sentence-to-sentence connections, neglecting the inherent
dialogue systems is of paramount importance. Recent studies
complexity of dialogue as a holistic process that demands
have explored a myriad of approaches in this domain.
meticulous attention to both the broader context and nuanced
Emptrando et al. [1], drawing inspiration from pretrained
details. This limitation impedes the ability to faithfully
language models, crafted a GPT-based model for generating
simulate authentic human communication.
empathetic responses. Welivita et al. introduced fine-grained
empathetic responses and intents into empathetic dialogues [2], In a scenario resembling real human communication, as
enabling the model to learn emotions with greater precision. illustrated in Figure 1, it becomes evident that the generation of

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


responses in the human brain results from the combined II. METHODOLOGY
influence of conversational context and individual cognition. The backbone of MKEMP is implemented based on the
While terms like "school" and "making new friends" may not Transformer model, and its structure is illustrated in Figure 2.
have been explicitly mentioned in the conversation history, MKEMP primarily consists of five stages: fine-grained
upon receiving the conversational input, the responder knowledge acquisition, coarse-grained knowledge acquisition,
establishes connections between their personal knowledge and context refinement, knowledge selection, and response
the conversation as a whole, as well as with specific words. generator. In our task, for each conversation history composed
This process not only deepens the responder's comprehension of N utterances, we represent it as X = u1 , u2 , u3 , . . . , un ,
of the interlocutor's emotions but also, under the joint influence
where the i-th utterance, denoted as ui = [ti1 , ti2 , ti3 , . . . , tiMi ], is
of cognition and emotions, leads to the production of high-
quality responses.Furthermore, the manner in which humans a sequence consisting of �� words.
link the content of a conversation to their personal knowledge A. Fine-grained Knowledge Acquisition
is multidimensional, considering both the overarching
conversation and its individual components simultaneously. During this stage, we employ commonsense knowledge
This aligns with the objective pattern that humans follow when from ConceptNet and the emotional lexicon NRC_VAD as
encountering new information, commencing with a surface- external knowledge bases to construct knowledge context
level understanding before delving into deeper layers and graphs. ConceptNet is a widely recognized large-scale
subsequently summarizing and integrating their findings. knowledge graph comprising 5.9 million tuples, encompassing
Drawing from a variety of real-world scenarios, such as the extensive knowledge related to human activities. NRC_VAD is
aforementioned example, we posit that in the course of a extensively used three-dimensional word lexicon in
dialogue and communication, humans must establish psychology, offering emotional descriptions for 20k English
associations between the overall conversation history, specific words across the dimensions of Valence, Arousal, and
words within it, and their personal knowledge. This process Dominance.
carries substantial significance for the enhancement of
empathetic dialogue systems. Context Graph
Initially, we flatten the entire conversation history into an
To achieve this goal, we propose a mixed knowledge- extended word sequence. To demarcate this segment within the
enhance approach for empathetic dialogue generation, which conversation history, we insert a [CLS] token at its inception,
we term MKEMP. MKEMP leverages two processes: coarse- referred to as H = CLS, x1 , x2 , . . . , xm . Subsequently, we
grained knowledge acquisition and fine-grained knowledge have devised a four-step process to construct an knowledge
acquisition, to extract information related to both the overall context graph: (1)For each non-stop word within the
conversation and specific words. This enhances the model's conversation history H, we independently query a set of
comprehension of the user's emotional state and current corresponding candidate triplets �� = �� , ��� , ��� , ��� �=1,...,� from
situation. The incorporation of external knowledge, provided as ConceptNet. (2)From this pool of candidate triplets, we apply
supplementary information to the model, equips the dialogue filtering criteria to identify tuples pertinent to relationships and
system with enhanced cognitive capabilities, enabling it to empathy, with a confidence level denoted as ��� , surpassing 0.1.
generate more empathetic expressions in empathetic response This results in the creation of a subset �� , derived from �� .
generation. Additionally, we employ a multi-task learning (3)For every word xi , we rank the emotional intensity values of
framework to jointly optimize our objectives. To assess our concepts ��� within �� , as per NRC_VAD. Subsequently, we
model's performance, we conduct extensive experiments using select the top K' tuples to compose an knowledge subgraph.
the benchmark dataset EmpatheticDialogues and compare it (4)Building upon this subgraph, we employ three distinct types
with five state-of-the-art empathetic dialogue models through of directed edges to interconnect vertices: (a) Temporary edges
both automated and human evaluations. Our results between adjacent words. (b) Knowledge edges between the
demonstrate that MKEMP produces responses that are richer in word �� and the concepts corresponding to its head. (c) Global
information and more empathetic. In summary, our primary edges between the [CLS] token at the beginning of the
contributions are as follows:(1) We introduce the MKEMP conversation history and other vertices.
method, which has the ability to accurately perceive and
appropriately express implicit emotions. MKEMP leverages Context Graph Encoding
external knowledge at both coarse-grained and fine-grained To prepare the data in a format suitable for the encoder, we
scales to enhance empathetic expression in dialogues. This begin by employing an embedding layer and a positional
represents the first attempt to enhance empathetic dialogue embedding layer to transform the vertices �� ∈ � of the
generation across multiple scales using external knowledge. knowledge context graph into vectors �� �� ∈ �� and
(2)We design emotion commonsense refinement encoders and �� �� ∈ �� , with d representing the embedding dimension. In
cognition commonsense refinement encoders separately for the order to enable the model to selectively capture information
emotional and cognitive aspects, establishing connections from both the conversation history and external knowledge, we
between external knowledge at both scales and the dialogue, introduce vertex state embeddings �� �� ∈ �� to distinguish
effectively leveraging rich external knowledge. (3) We conduct the source of �� . The vector representation of vertex ��
extensive experiments and analyze the effectiveness of comprises embeddings of three distinct types:
MKEMP from both automatic and human evaluation
perspectives. �� = �� �� + �� �� + �� �� (1)
Figure 2: Overview of our model.

Subsequently, to update the vector representations of B. Coarse-grained Knowledge Acquisition


vertices within the graph, we employed a multi-head graph Differing from fine-grained knowledge acquisition, which
attention mechanism for processing [5]. More specifically, this primarily focuses on extracting specific words from the
method enables each vertex �� to focus on all its neighboring dialogue history, coarse-grained knowledge acquisition
vertices �� �∈� , thereby achieving contextualization among emphasizes the entire conversation history. Here, we employ

them: the COMET commonsense reasoning model, an enhance-ment
based on BART, to achieve this task . COMET can output six
�� = �� + ||�
�=1 ���� ��� �� , commonsense reasoning relations for each event: xReact,
�∈��
xWant, xNeed, xIntent, xEffect, and xAttr. Due to the limited
relevance of the speaker's characteristic attributes to empathy,
���� �
= ��� (�� , ��), (2) we consider only the other five reasoning relations, excluding
xAttr.
Here, || signifies the concatenation of H attention heads. ��
denotes the neighbors of vertex �� in the adjacency matrix A, To establish a connection between the comprehensive
while ���� represents the attention mechanism of the n-th conversation history and external knowledge, for the input
attention head. sequence C, we merge different relations ([xReact], [xWant],
[xNeed], [xIntent], [xEffect]) into the last statement of the
The aforementioned operations primarily concern the dialogue history. COMET can generate corresponding
manipulation of local context, specifically neighboring vertices. commonsense reasoning ��1 , ��2 , . . . , ��5 based on this
Building upon this foundation, we utilize a Transformer layer combination of dialogue history and relations. Subsequently,
to update vertex vector representations with global context we concatenate the commonsense reasoning for different
information, encompassing all vertices. To be more specific, relations generated by COMET to obtain a commonsense
the following operations are executed for all vertices �� �=1,...,� : sequence �� = ��1 ⊕ ��2 ⊕ . . . ⊕ ��5 .
ℎ�� = ��������� ��−1
� + ������� ��−1
� , (3) To enable the model to more accurately learn the emotions
and knowledge implied in the entire conversation history, we
��� = ��������� ℎ�� + ��� ℎ�� , (4) need to categorize the five relations into two types: emotional
Among these components, LayerNorm is a commonly relations and cognitive relations. Our research indicates that for
employed layer normalization technique in deep neural the xReact relation, COMET often generates emotion-related
networks. MutiAtt represents the multi-head self-attention vocabulary (e.g., sad, happy, angry). Therefore, we categorize
sublayer, comprising H attention heads, while FFN signifies a xReact as an emotional relation, while categorizing the
two-layer feedforward neural network using ReLU as the remaining four relations as cognitive relations. Based on this
activation function. The hidden representation of the categorization, the sequences of these two types of relations are
knowledge context graph G is denoted as ℎ� = �� �=1,...,� , separately input into the affective encoder and cognitive
encoder:
where �� = ��� .
���� = ���������� ����� (5)
�� = ���������� ��� (6)
In the formula, ���� ∈ ��������×� and ���� ∈ ���×� , where Here �� ∈ ��×� represents the weight vector of the linear layer.
������� and �� represent the lengths of commonsense reasoning Throughout the training process, we refine these weights by
sequences, and r ∈ {xWant, xNeed, xIntent, xEffect}. minimizing the cross-entropy (CE) loss between the emotion
category distribution ���� and the ground truth label �∗:
Due to the nature of xReact as an emotional relation, its
commonsense reasoning is typically presented in the form of ���� =− ��� ���� �∗ (15)
words, while the commonsense reasoning for the other four
cognitive relations typically takes the form of sentences. D. Knowledge Selection
Therefore, we need to handle their hidden representations The reasoning involved in inferring common-sense
differently. For affective and cognitive relations, we represent relationships within coarse-grained knowledge is notably
these sequences using the average hidden representation and intricate. Similarly, the underlying reasoning processes in real
the hidden representation of a special token [CLS], respectively: human communication are equally complex. This intricacy
ℎ��� = ������� ���� (7) often leads the model to overemphasize emotions or the
ℎ� = �� 0 (8) conversational context, sometimes neglecting one or the other.
∈ �� . Such an imbalance can result in generating empathetic
Where ℎ��� , ℎ���
responses that deviate from reality. Our goal is to guide the
C. Context Refinement model in using this knowledge judiciously to produce more
On one hand, establishing a connection between the ideal outcomes. As a result, we concatenate, at the token level,
knowledge extracted from two scales is necessary. On the other the refined contexts corresponding to five commonsense
hand, the multifaceted and diverse nature of this knowledge reasoning relationships:
makes it challenging for the model to effectively utilize it. ����� � = ||4�=1 ����� , �� � (16)
Therefore, we need to further refine the context enriched with ������� � = ����� � ⊕ ����� � (17)
external knowledge. Similar to Majumder [6], we concatenate Where ������� ∈ ��×5� . To assess the influence of the refined
the representations of all relations (Equations 7 and 8) with the
context for each relationship on response generation, thus
token-level knowledge context graph representations:
emphasizing the relatively significant features, we calculate
���� � = ℎ� � ⊕ ℎ��� (9) their importance scores using the Sigmoid function.
� � � = ℎ� � ⊕ ℎ� (10) Subsequently, we multiply ������� by their respective
In this manner, the objective of refining the context is importance scores and input them into a multi-layer
accomplished. Unlike the connection at the sequence level, this perceptron (MLP) with RELU as the activation function. This
token-level connection enables the model to focus more on the process enables us to learn a context representation that
relationships between individual words, thereby acquiring amalgamates various relationship knowledge:
valuable additional knowledge. ���� = ��� � ������� ⊙ ������� (18)
Subsequently, in a manner akin to the procedures involved ��
Where ���� ∈ � , ⊙ signifies element-wise multiplication
in coarse-grained knowledge acquisition, we process the of matrices.
context representations that correspond to emotional context
refinement encoders and cognitive context refinement encoders: E. Response Generator
We integrate the Hybrid Context Representation ����
����� = ��������−��� ���� (11) into the encoder, concatenating it as a sequence. We employ
�����,� = ��������−��� �� (12) the encoder to generate the target response sequence Y =
Where ����� , �����, � ∈ ��×�. y1 , y2 , . . . , yr , comprising t words:
Emotion Classification P yt |y<t , C = Decoder Ey<t , HCTX (19)
Where Ey<t signifies the embedding of tokens that have
We assign an emotion label �∗ to each dialogue to annotate
already been generated. The negative log-likelihood of the
the genuine emotions expressed. In order to generate
target response Y is adopted as the generation loss function:
empathetic responses that accurately convey emotions, we
utilize the hidden representation of the [CLS] token within the ���� =− ��=1 ���� �� |�, �<� (20)
emotion-refined context representation ����� for emotion In our initial experiments, we observed that employing a
classification. This approach enables the model to predict the combination of generation loss and emotion loss as the final
speaker's emotions with higher accuracy: loss function tends to induce the model to generate similar
responses (e.g., "I'm sorry to hear that"). We posit that this
ℎ���� = ����� 0 (13) phenomenon is attributed to the emotionally rich expressions
In the context of ℎ���� ∈ �� , we employ a linear layer for present in the training dataset, which predispose the model
processing, followed by Softmax computation to derive the towards selecting emotionally congruent results. This often
emotion category distribution ���� ∈ �� , where e represents leads to the inclusion of more complex information in the
the number of emotion categories: conversation. The model encounters challenges in effectively
���� = ������� �� ℎ���� (14) integrating dialogue emotions and context, resulting in
emotionally accurate responses that may sometimes diverge
from the conversation's content but remain contextually
Models Acc PPL↓ Dist-1 Dist-2 Empathy Relevance Fluency
Transformer - 37.16 0.46 2.02 3.13 3.51 3.67
MoEL 32.43 36.92 0.44 2.10 3.37 3.72 3.61
MIME 34.19 37.09 0.49 1.98 3.34 3.69 3.63
EmpDG 34.25 37.29 0.50 2.07 3.39 3.74 3.68
CEM 39.11 36.11 0.63 2.69 3.43 3.80 3.66
MKEMP 39.84 35.23 0.67 2.82 3.54 3.87 3.66
Table 1: Evaluation results of all models.

Models Win Loss Tie tendency to generate similar answers to some extent.
Therefore, we believe that assigning a higher weight to the
MKEMP vs Transformer 45.6% 17.1% 37.3%
diversity loss is essential to address this issue. Through
MKEMP vs MoEL 38.1% 18.5% 43.4% multiple experiments, we have also confirmed that setting
MKEMP vs MIME 36.7% 18.9% 44.4% λ1 =1, λ2 =2, and λ3 =1.5 can yield outstanding results.
MKEMP vs EmpDG 36.2% 20.3% 43.5%
MKEMP vs CEM 34.8% 25.5% 39.7% III. EXPERIENCE
Table 2: Result of human A/B test. A. Implementation Details
The EmpatheticDialogues dataset was divided in an 8:1:1
appropriate. These responses often lack a profound ratio for our experiments. All models were implemented using
understanding of the specific context. While this approach PyTorch, and we initialized word embeddings with pre-trained
may reduce the generation of responses unrelated to the GLOVE vectors. The learning rate was initialized to 2�−5 , and
conversation topic, it does not align with the objectives of we incorporated a linear warm-up with 4000 warm-up steps to
dialogue system research. dynamically adjust the learning rate. Our choice of optimizer
Inspired by Jiang et al. [7], we address this issue by was Adam, with parameters set to �1 = 0.9 and
introducing an additional loss function known as Frequency- �2 = 0.999.Furthermore, we restricted the maximum allowed
Aware Cross-Entropy loss. This method penalizes frequently introductions of external concepts to 10 per dialogue and 5 per
occurring answers by applying weighted penalties. To token. The model training process was carried out on an RTX
accomplish this, we calculate the relative frequency �� of each A5000 GPU for efficient computation. During the inference
token �� in the training corpus before processing the next batch phase, we imposed a maximum decoding step limit of 30 to
of samples and derive their corresponding frequency-based ensure timely and accurate results.
weights �� :
�� = �
���� ��
(21) B. Baselines
�=1 ���� �� Transformer: A fully attention-based Seq2Seq model
� � = � × �� + 1 (22) trained using Maximum Likelihood Estimation (MLE) loss.
Here, V represents the size of the vocabulary, and a =− MoEL: A transformer-based model that discerns
−1
max1≤j≤V Rj represents the frequency slope. To ensure emotional states through context encoding and generates
that �� falls within the [0,1] range, we incorporate an responses for each emotion using multiple decoders [8].
additional bias factor of 1. Consequently, tokens with higher MIME: Another transformer-based model that utilizes
frequencies in the corpus are assigned lower weights. Finally, polarity-based emotion clusters and emotion imitation
we normalize �� to have a mean value of 1 and calculate the techniques to create empathetic responses.
diversity loss: EmpDG: A multi-scale adversarial model built on the
���� =− ��=1 ��=1 �� �� �� ���� �� |�<� , � (23) principles of Wasserstein Generative Adversarial Networks
1, �� = �� (WGAN). It excels at capturing nuanced human emotions by
�� �� = (24) employing both coarse-grained dialogue-level and fine-
0, ��ℎ���
grained token-level emotion modeling to produce empathetic
Where �� signifies candidate tokens, and �� �� denotes an
dialogues [9].
indicator function. Ultimately, we employ a multitask learning
CEM: A model that generates common-sense knowledge
approach to collectively optimize the generation loss, emotion
using pre-trained models and enhances empathetic response
loss, and diversity loss:
generation by incorporating it as supplementary information.
L = λ1 Lnll + λ2 Lemo +λ3 Ldiv (25)
Here, λ1 , λ2 , and λ3 are hyperparameters. In our C. Results
experiments, we observed that setting them to commonly used In order to validate the model's performance, we
values, such as λ1 =1, λ2 =2, and λ3 =1, still resulted in a employed both a set of automatic evaluation metrics and a set
of human evaluation metrics to assess its capabilities.
Models Acc PPL↓ Dist-1 Dist-2 models. We attribute this to the exceptional performance of
the transformer, resulting in responses generated by these
MKEMP 39.84 35.23 0.67 2.82 transformer-based models exhibiting good fluency. The A/B
w/o G 38.86 35.03 0.62 2.64 testing results of the manual evaluation are displayed in Table
w/o W 37.88 35.77 0.53 2.49 2, confirming that, from the perspective of real human
w/o D 39.51 34.97 0.41 2.27 evaluators, MKEMP performs better overall.
w/o R 34.69 36.05 0.54 2.55 Ablation study
Specifically, we created four variants of MKEMP by
Table 3: Result of ablation study. removing specific components from the model: (1)w/o G:
Removing the part that extracts fine-grained knowledge by
Additionally, to rigorously verify the effectiveness of each building an knowledge context graph. (2)w/o W: Eliminating
component within the MKEMP framework, we conducted a the part that utilizes COMET to extract coarse-grained
series of ablation experiments. knowledge. (3)w/o D: Excluding the diversity loss from the
Automatic evaluation loss function. (4)w/o R: Omitting the emotional context
Based on prior research, this paper adopts Perplexity refinement encoder and cognitive context refinement encoder.
(PPL), Distinct-1 (Dist-1), and Distinct-2 (Dist-2) as the The results of the ablation experiments are presented in Table
primary automatic evaluation metrics. PPL is a commonly 3, and we observe that all four variants perform significantly
employed metric for assessing overall generation quality. It worse than MKEMP on most metrics, demonstrating the
calculates the model's confidence within the candidate effectiveness of each component in our model.
response set, with higher confidence resulting in lower PPL.
Distinct-n measures generation diversity by quantifying the IV. CONCLUSION
proportion of different n-gram grammatical constructs within In this paper, we propose a Mixed Knowledge-Enhanced
the overall generated results. Higher values of Dist-1 and Dist- Empathetic Dialogue Generation Model (MKEMP) that
2 indicate greater diversity. Additionally, we also utilize generates high-quality empathetic responses by incorporating
Emotion Prediction Accuracy (Acc) as a metric. A higher Acc multi-scale external knowledge to capture rich emotional
implies superior emotion prediction capabilities for the model. states. Extensive automated and manual evaluations have
As illustrated in Table 1, MKEMP outperforms other empirically demonstrated the effectiveness of our approach in
models across all four automatic evaluation metrics. This empathetic response generation. Our future research may
suggests that MKEMP can generate responses that exhibit inspire other work to expand from single-source information
higher overall quality, enhanced readability, improved to multi-scale information for similar tasks.
grammatical correctness, and greater diversity. Furthermore,
the notable enhancement in Dist-2 can be attributed to the REFERENCES
introduction of multi-scale knowledge, empowering the model [1] Zandie, R., & Mahoor, M. H. (2020). Emptransfo: A multi-head
to generate diverse responses based on more abundant transformer architecture for creating empathetic dialog systems. arXiv
information. preprint arXiv:2003.02958.
Human evaluation [2] Welivita, A., & Pu, P. (2020). A taxonomy of empathetic response
intents in human social conversations. arXiv preprint arXiv:2012.04080.
To evaluate the model's performance more objectively,
[3] Zheng, C., Liu, Y., Chen, W., Leng, Y., & Huang, M. (2021). Comae: A
we conducted comprehensive manual evaluations. We enlisted multi-factor hierarchical framework for empathetic response
the expertise of three professional data annotators from a generation. arXiv preprint arXiv:2105.08316.
crowd sourcing company to assess 200 randomly selected sets [4] Sabour, S., Zheng, C., & Huang, M. (2022, June). Cem: Commonsense-
of dialogues. They were tasked with two specific assignments: aware empathetic response generation. In Proceedings of the AAAI
(1)Rating the responses generated by various models on a Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11229-11237).
scale of 1, 3, and 5, based on three predefined criteria: fluency, [5] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., &
Bengio, Y. (2017). Graph attention networks. arXiv preprint
relevance, and empathy [10]. (2)In a given scenario, the arXiv:1710.10903.
annotators compared responses from different baseline models [6] Majumder, N., Hong, P., Peng, S., Lu, J., Ghosal, D., Gelbukh, A., ... &
with those from MKEMP and selected the more suitable Poria, S. (2020). MIME: MIMicking emotions for empathetic response
response from the two models. The model with more generation. arXiv preprint arXiv:2010.01454.
selections was deemed to have better performance. [7] Jiang, S., Ren, P., Monz, C., & de Rijke, M. (2019, May). Improving
The scoring results of the manual evaluation are neural response diversity with frequency-aware cross-entropy loss.
In The World Wide Web Conference (pp. 2879-2885).
presented in Table 2. MKEMP demonstrates a significant
[8] Lin, Z., Madotto, A., Shin, J., Xu, P., & Fung, P. (2019). Moel: Mixture
advantage in terms of empathy and relevance. This indicates of empathetic listeners. arXiv preprint arXiv:1908.07687.
that our model effectively enhances the consistency of [9] Li, Q., Chen, H., Ren, Z., Ren, P., Tu, Z., & Chen, Z. (2019). EmpDG:
generated responses with the conversation's topic by capturing Multiresolution interactive empathetic dialogue generation. arXiv
multi-scale common-sense information. Furthermore, rich preprint arXiv:1911.08698.
common-sense knowledge helps the model better comprehend [10] Rashkin, H., Smith, E. M., Li, M., & Boureau, Y. L. (2018). Towards
the emotional context of the conversation, even in complex empathetic open-domain conversation models: A new benchmark and
dataset. arXiv preprint arXiv:1811.00207.
contexts. On the other hand, we observed that MKEMP's
fluency performance slightly differs from other baseline

You might also like