Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation

TelME: Teacher-leading Multimodal Fusion Network for Emotion
Recognition in Conversation
Taeyang Yun, Hyunkuk Lim, Jeonghwan Lee, Min Song∗

Yonsei University, Seoul, South Korea
{yuntaeyang0629, lsh950919, jeonghwan.ai, min.song}@yonsei.ac.kr
Abstract
Emotion Recognition in Conversation (ERC)
plays a crucial role in enabling dialogue sys-
tems to effectively respond to user requests.
arXiv:2401.12987v2 [cs.CL] 31 Mar 2024
The emotions in a conversation can be identi-

fied by the representations from various modal-
ities, such as audio, visual, and text. How-
ever, due to the weak contribution of non-verbal
modalities to recognize emotions, multimodal
ERC has always been considered a challenging
task. In this paper, we propose Teacher-leading
Multimodal fusion network for ERC (TelME). Figure 1: Examples of multimodal ERC. Even the same
TelME1 incorporates cross-modal knowledge "Okay" answer varies depending on the conversation
distillation to transfer information from a lan- situation and captures emotions in various modalities.
guage model acting as the teacher to the non-
verbal students, thereby optimizing the efficacy
of the weak modalities. We then combine multi-
modal features using a shifting fusion approach 2022; Majumder et al., 2019; Hu et al., 2022b;
in which student networks support the teacher. Chudasama et al., 2022). Figure 1 illustrates an
TelME achieves state-of-the-art performance in example of a multimodal ERC.
MELD, a multi-speaker conversation dataset Much research on ERC has mainly focused on
for ERC. Finally, we demonstrate the effec-
context modeling from text modality, disregarding
tiveness of our components through additional
experiments. the rich representations that can be obtained from
audio and visual modalities. Text-based ERC meth-
1 Introduction ods have demonstrated that contextual information
Emotion recognition holds paramount importance, derived from text data is a powerful resource for
enhancing the engagement of conversations by pro- emotion recognition (Kim and Vossen, 2021; Lee
viding appropriate responses to the emotions of and Lee, 2021; Song et al., 2022a,b). However,
users in dialogue systems (Ma et al., 2020). The ap- non-verbal cues such as facial expressions and tone
plication of emotion recognition spans various do- of voice, which are not covered by text-based meth-
mains, including chatbots, healthcare systems, and ods, provide important information that needs to
recommendation systems, demonstrating its versa- be explored in the field of ERC. Multimodal ap-
tility and potential to enhance a wide range of appli- proaches demonstrate the possibility of integrating
cations (Poria et al., 2019). Emotion Recognition features from three modalities to improve the ro-
in Conversation (ERC) aims to identify emotions bustness of ERC systems (Mao et al., 2021; Chu-
expressed by participants at each turn within a con- dasama et al., 2022; Hu et al., 2022b). Nevertheless,
versation. The dynamic emotions in a conversation these frameworks frequently ignore the varying de-
can be detected through multiple modalities such grees of impact the individual modalities have on
as textual utterances, facial expressions, and acous- emotion recognition and instead treat them as ho-
tic signals (Baltrušaitis et al., 2018; Liang et al., mogeneous components. This implies a promising
∗ opportunity to improve the ERC system by differ-
*Corresponding author
1
Our code can be found here: https://www.github. entiating the level of contribution made by each
com/yuntaeyang/TelME modality.
In this paper, we propose Teacher-leading Mul-
timodal fusion network for the ERC task (TelME)
that strengthens and fuses multimodal information
by accentuating the powerful modality while bol-
stering the weak modalities. Knowledge Distilla-
tion (KD) can be extended to transfer knowledge
across modalities, where a powerful modality can
play the role of a teacher to share knowledge with
a weak modality (Hinton et al., 2015; Xue et al.,
2022). While Figure 2 shows the robustness of text
in ERC tasks compared to the other two modalities,
the other modalities present valuable information
nonetheless. Thus, TelME enhances the representa-
tions of the two weak modalities through KD utiliz- Figure 2: Unimodal Performance on MELD dataset
ing the text encoder as the teacher. Our approach
aims to mitigate heterogeneity between modalities
while allowing students to learn the preferences of 2 Related Work
the teacher. TelME then incorporates Attention-
2.1 Emotion Recognition in Conversation
based modality Shifting Fusion, where the student
networks strengthened by the teacher at the dis- Recently, ERC has gained considerable attention
tillation stage assist the robust teacher encoder in in the field of emotion analysis. ERC can be cate-
reverse, providing details that may not be present gorized into text-based and multimodal methods,
in the text. Specifically, our fusion method creates depending on the input format. Text-based methods
displacement vectors from non-verbal modalities, primarily focus on context modeling and speaker
which are used to shift the emotion embeddings of relationships (Jiao et al., 2019; Li et al., 2020; Hu
the teacher. et al., 2021a). In recent studies (Lee and Lee, 2021;
We conduct experiments on two widely used Song et al., 2022a), context modeling has been car-
benchmark datasets and compare our proposed ried out to enhance the understanding of contextual
method with existing ERC methods. Our results information by pre-trained language models using
show that TelME performs well on both datasets dialogue-level input compositions. Additionally,
and particularly excels in multi-party conversations, there are graph-based approaches (Zhang et al.,
achieving state-of-the-art performance. The abla- 2019; Ishiwatari et al., 2020; Shen et al., 2021;
tion study also demonstrates the effectiveness of Ghosal et al., 2019) and approaches that utilize ex-
our knowledge distillation strategy and its interac- ternal knowledge (Zhong et al., 2019; Ghosal et al.,
tion with our fusion method. 2020; Zhu et al., 2021).
Our contributions can be summarized as follows: On the contrary, multimodal methods (Poria
et al., 2017; Hazarika et al., 2018a,b; Majumder
• We propose Teacher-leading Multimodal fu-
et al., 2019) reflect dialogue-level multimodal fea-
sion network for Emotion Recognition in Con-
tures through recurrent neural network-based mod-
versation (TelME). The proposed method con-
els. Other multimodal approaches (Mao et al.,
siders different contributions of text and non-
2021; Chudasama et al., 2022) integrate and manip-
verbal modalities to emotion recognition for
ulate utterance-level features through hierarchical
better prediction.
structures to extract dialogue-level features from
• To the best of our knowledge, we are the each modality. EmoCaps (Li et al., 2022) considers
first to enhance the effectiveness of weak non- both multimodal information and contextual emo-
verbal modalities for the ERC task through tional tendencies to predict emotions. UniMSE (Hu
cross-modal distillation. et al., 2022b) proposes a framework that leverages
complementary information between Multimodal
• TelME shows comparable performance in two Sentiment Analysis and ERC. Unlike these meth-
widely used benchmark datasets and espe- ods, our proposed TelME is one in which the strong
cially achieves state-of-the-art in multi-party teacher leads emotion recognition while simultane-
conversational scenarios. ously bolstering attributes from weaker modalities
Figure 3: The overview of TelME
to complement and enhance the teacher. si , sj ∈ S are the conversation participants. If

i = j, then si and sj refer to the same speaker.
2.2 Knowledge Distillation yk ∈ Y is the emotion of the k-th utterance in a
The initial proposition of KD (Hinton et al., 2015) conversation, which belongs to one of the prede-
involves transferring knowledge by reducing the fined emotion categories. Additionally, uk ∈ U is
KL divergence between the prediction logits of the k-th utterance. uk is provided in the format of a
teachers and students, demonstrating its effective- video clip, speech segment, and text transcript. i.e.,
ness through improved performance of the student uk = {tk , ak , vk }, where {t, a, v} denotes a text
models. Subsequently, KD has been extended transcript, speech segment, and a video clip. The
to distillation between intermediate features (Heo objective of ERC is to predict yk , the emotion cor-
et al., 2019). Furthermore, KD approaches (Gupta responding to the k-th utterance in a conversation.
et al., 2016; Jin et al., 2021; Tran et al., 2022) have
also been shown to transfer knowledge between 3.2 TelME
modalities effectively in multimodal studies. Li 3.2.1 Model Overview
et al. (2023b) mitigate multimodal heterogeneity by
constructing dynamic graphs in which each vertex We propose Teacher-leading Multimodal fusion
exhibits modality and each edge exhibits dynamic network for ERC (TelME), as illustrated in Fig-
KD. However, since this work is not a study of ure 3. This framework is devised based on the
ERC and is based on graph distillation, there is an hypothesis that by exploiting the varying levels of
intrinsic difference from our KD strategy. Ma et al. modality-specific contributions to emotion recog-
(2023) proposes a transformer-based model utiliz- nition, there is a potential to enhance the overall
ing self-distillation for ERC. Our proposed method, performance of an ERC system. Therefore, we in-
in contrast, uses response and feature-based distilla- troduce a strategic approach that focuses on accen-
tion simultaneously to maximize the effectiveness tuating the powerful modality while bolstering the
of two other modalities by the teacher network weak modalities. We first extract powerful emo-
based on text modality. tional representations through context modeling
from text modality while capturing auditory and vi-
3 Method sual features of the current speaker from non-verbal
modalities. However, due to the limited emotional
3.1 Problem Statement recognition capability of audio and visual features
Given a set of conversation participants S, ut- as well as the heterogeneity between the modali-
terances U , and emotion labels Y , a conver- ties, effective multimodal interactions cannot be
sation consisting of k utterances is represented guaranteed (Zheng et al., 2022). We thus mitigate
as [(si , u1 , y1 ), (sj , u2 , y2 , ..., (si , uk , yk )], where the heterogeneity between modalities while maxi-
mizing the effectiveness of non-verbal modalities encoder with data2vec (Baevski et al., 2022). To
by distilling emotion-relevant knowledge of the focus solely on the voice of the current speaker, we
teacher model into non-verbal students. We also only utilize a speech segment of the k-th utterance,
use a fusion method in which strong emotional denoted as ak . This speech segment is processed
features from the teacher encoder are shifted by re- according to the pre-trained processor. The audio
ferring to representations of students strengthened encoder then extracts emotional features from the
in reverse. In the subsequent sections, we discuss processed input as follows.
the three components of TelME: Feature Extrac-
tion, Knowledge Distillation, and Attention-based Fak = AudioEncoder(ak ) (4)
modality Shifting Fusion.
where Fak ∈ R1×d is the embeddings of ak and d
3.2.2 Feature Extraction is the dimension of the encoder.
Figure 3 visually illustrates how each modality en- Visual: Following the same reasoning as the
coder receives its corresponding input to extract audio modality, we configure the initial state of
emotional features. In this section, we explain the our visual encoder using Timesformer (Bertasius
methodologies employed to generate emotional fea- et al., 2021). In order to concentrate exclusively
tures corresponding to each modality’s input sig- on the facial expressions of the current speaker,
nals. we solely utilize a video clip of the k-th utterance,
Text: Following previous research (Lee and Lee, denoted as vk . We extract the frames corresponding
2021; Song et al., 2022a), we conduct context mod- to the k-th utterance from the video and construct
eling, considering all utterances from the inception vk through image processing. The visual encoder
of the conversation up to the k-th turn as the context. then extracts emotional features from the processed
To handle speaker dependencies and differentiate input as follows.
between speakers, we represent speakers using the
special token, < si >. Additionally, we construct Fvk = V isualEncoder(vk ) (5)
the prompt, "Now < si > feels < mask >" to
where Fvk ∈ R1×d is the embedding of vk and d is
emphasize the emotion of the most recent speaker.
the dimension of the encoder.
We report the effect of the prompt in Appendix A.1.
The emotional features are derived from the em- 3.2.3 Knowledge Distillation
bedding of the special token, < mask >. For our Addressing the challenge of heterogeneity between
text encoder, we employ the modified Roberta (Liu modalities and low emotional recognition contri-
et al., 2019), which has exhibited its efficacy across butions of non-verbal modalities holds great po-
various natural language processing tasks. We can tential in facilitating satisfactory multimodal in-
extract emotional features from the text encoder as teractions (Zheng et al., 2022). Thus, we distill
follows. strong emotion-related knowledge of a language
model that understands linguistic contexts, thereby
Ck = [< si >, t1 , < sj >, t2 , ..., < si >, tk ] (1) augmenting the emotional features extracted from
the other two modalities with comparatively lower
Pk = N ow < si > f eels < mask > (2)
contributions. We employ two distinct types of
FTk = T extEncoder(Ck < /s > Pk ) (3) knowledge distillation concurrently: response and
feature-based distillation. The overall loss for the
where < si > is the special token indicating the student can be composed of the classification loss,
speaker and < /s > is the separation token of response-based distillation loss, and feature-based
Roberta. FTk ∈ R1×d is the embeddings of the distillation loss, i.e.,
mask token, < mask > and d is the dimension of
the encoder. Lstudent = Lcls + αLresponse + βLf eature (6)
Audio: Self-supervised learning using Trans-
former has witnessed remarkable achievement, not where α and β are the factors for balancing the
only within the field of natural language processing losses.
but also in the realms of audio and video (Berta- Lresponse utilizes DIST (Huang et al., 2022), a
sius et al., 2021; Baevski et al., 2022). In line technique originally used in image networks, as
with this trend, we set the initial state of our audio a cross-modal distillation for ERC. As shown in
Figure 2, effective knowledge distillation can be
challenging due to the significant gap between the
text modality and the other two modalities. There-
fore, unlike conventional KD methods, we use a
KD approach(Lresponse ) that utilizes Pearson cor-
relation coefficients instead of KL divergence as
follows.
d(µ, υ) = 1 − ρ(µ, υ) (7)

where ρ(µ, υ) is the Pearson correlation coefficient
between two probability vectors µ and υ.
Specifically, Lresponse aims to distill prefer-
ences (relative rankings of predictions) by teach-
ers through the correlations between teacher and
student predictions, which can usefully perform
knowledge distillation even in the extreme differ-
ences between teacher and student. We gather the Figure 4: Attention-based modality Shifting Fusion
predicted probability distributions for all instances
within a batch and calculate the Pearson correla- students can faithfully support the teacher during
tion coefficient between the teacher and student for the multimodal fusion stage. Lf eature leverages the
inter-class and intra-class relations (Figure 3). Sub- similarity among normalized representation vectors
sequently, we transfer the inter-class and intra-class of the teacher and the student within a batch (Fig-
relation to the student. The specific formulation of ure 3). We construct the target similarity matrix
the response-based distillation can be described as by performing a dot product between the represen-
follows. tation matrix of the teacher and its transposition
matrix. By applying the softmax function to this
Yi,:t = sof tmax(Zi,:
t
/τ ) (8) matrix, we derive the target probability distribution
as follows.
Yi,:s = sof tmax(Zi,:
s
/τ ) (9)
exp(Mi,j /τ )
B Pi = PB , ∀i, j ∈ B (13)
τ2 X l=1 exp(Mi,l /τ )
Linter = d(Yi,:s , Yi,:t ) (10)
B where B is a training batch and M ∈ RB×B is
i=1
C the target similarity matrix. τ > 0 is a temperature
τ2 X s t parameter controlling the smoothness of the distri-
Lintra = d(Y:,j , Y:,j ) (11)
C bution. Pi is the target probability distribution.
j=1
Similarly, we can compute the similarity matrix
Lresponse = Linter + Lintra (12) between the teacher and the student by taking the
Given a training batch B and the emotion cate- dot product of their representations. Subsequently,
gories C, Z s ∈ RB×C is the prediction matrix of we can calculate the similarity probability distribu-
the student and Z t ∈ RB×C is the prediction mation as follows.
trix of the teacher. τ > 0 is a temperature parameter
to control the softness of logits. ′ /τ )
exp(Mi,j
However, rather than relying solely on Lresponse , Qi = PB , ∀i, j ∈ B (14)
exp(M ′ /τ )
l=1 i,l
we introduce Lf eature as an additional distillation
loss to better leverage the embedded information in where M ′ ∈ RB×B is the similarity matrix of stu-
the teacher network. Lf eature aims to mitigate the dent and teacher. Qi is the similarity probability
heterogeneity between the representations of the distribution of teacher and student.
teacher and student models, allowing us to distill With these two probability distributions, we com-
richer knowledge from the teacher compared to us- pute the KL divergence as the loss for the feature-
ing only Lresponse . Through this, the features of the based distillation.
IEMOCAP MELD
B
Dataset
1 X train dev test train dev test
Lf eature = KL(Pi ∥ Qi ) (15) Dialogue 108 12 31 1038 114 280
B Utterance 5163 647 1623 9989 1109 2610
i=1
where KL is the Kullback–Leibler divergence. Classes 6 7
3.2.4 Attention-based modality Shifting Table 1: Statistics of the two benchmark datasets.
Fusion
The emotional features from the enhanced student ∥Fk ∥2
λ = min( · θ, 1) (19)
networks can impact the teacher model’s emotion- ∥Hk ∥2
relevant representations, providing information that
where Z is the multimodal vector. We apply the
may not be captured from the text. To fully utilize
scaling factor λ to control the magnitude of the
these features, we adopt a multimodal fusion ap-
displacement vector and θ as a threshold hyperpa-
proach where feature vectors from the student mod-
rameter. ∥Fk ∥2 , ∥Hk ∥2 denote the L2 norm of the
els manipulate the representation vectors from the
Fk and Hk vectors respectively.
teacher, effectively incorporating non-verbal infor-
mation into the representation vector. To highlight 4 Experiments
non-verbal characteristics, we concatenate the vec-
tors of the student models and perform multi-head 4.1 Datasets
self-attention. The vectors of non-verbal informa- We evaluate our proposed network on MELD (Po-
tion generated through the multi-head self-attention ria et al., 2018) and IEMOCAP (Busso et al.,
process and emotional features of the teacher en- 2008) following other works on ERC listed in Ap-
coder enter the input of the shifting step (Figure pendix A.3. The statistics are shown in Table 1.
4). We are inspired by Rahman et al. (2020) to MELD is a multi-party dataset comprising over
construct the shifting step. In the shifting step, a 1400 dialogues and over 13,000 utterances ex-
gating vector is generated by concatenating and tracted from the TV series Friends. This dataset
transforming the vector of the teacher model and contains seven emotion categories for each utter-
the vector of the non-verbal information. ance: neutral, surprise, fear, sadness, joy, disgust,
k
gAV k
= R(W1 · < FTk , Fattention > +b1 ) (16) and anger.
IEMOCAP consists of 7433 utterances and 151
where <,> is the operation of vector concatenation, dialogues in 5 sessions, each involving two speak-
R(x) is a non-linear activation function, W1 is the ers per session. Each utterance is labeled as one
weight matrix for linear transform, and b1 is scalar of six emotional categories: happy, sad, angry, ex-
bias. Fattention is the emotional representation vec- cited, frustrated and neutral. The train and devel-
tors of non-verbal information. gAV is the gating opment datasets consist of the first four sessions
vector. The gating vector highlights the relevant in- randomly divided at a 9:1 ratio. The test dataset
formation in the non-verbal vector according to the consists of the last session.
representations of the teacher model. We define the We purposely exclude CMU-MOSEI (Zadeh
displacement vector by applying the gating vector et al., 2018), a well-known multimodal sentiment
as follows. analysis dataset, as it comprises single-speaker
k k
videos and is not suitable for ERC, where emo-
Hk = gAV · (W2 · Fattention + b2 ) (17) tions dynamically change within each conversation
where W2 is the weight matrix for linear trans- turn.
form and b2 is scalar bias. H is the non-verbal 4.2 Experiment Settings
information-based displacement vector.
We subsequently utilize the weighted sum be- We evaluate all experiments using the weighted av-
tween the representation vector of the teacher and erage F1 score on two class-imbalanced datasets.
the displacement vector to generate a multimodal We use the initial weight of the pre-trained mod-
vector. Finally, we predict emotions using the mul- els from Huggingface’s Transformers (Wolf et al.,
timodal vector. 2019). The output dimension of all encoders is uni-
fied to 768. The optimizer is AdamW and the initial
Zk = FTk + λ · Hk (18) learning rate is 1e-5. We use a linear schedule with
MELD: Emotion Categories IEMOCAP
Models
Neutral Surprise Fear Sadness Joy Disgust Anger F1 F1
DialogueRNN (Majumder et al., 2019) 73.50 49.40 1.20 23.80 50.70 1.70 41.50 57.03 62.75
ConGCN (Zhang et al., 2019) 76.70 50.30 8.70 28.50 53.10 10.60 46.80 59.40 64.18
MMGCN (Hu et al., 2021b) - - - - - - - 58.65 66.22
DialogueTRM (Mao et al., 2021) - - - - - - - 63.50 69.23
DAG-ERC (Shen et al., 2021) - - - - - - - 63.65 68.03
MM-DFN (Hu et al., 2022a) 77.76 50.69 - 22.94 54.78 - 47.82 59.46 68.18
M2FNet (Chudasama et al., 2022) - - - - - - - 66.71 69.86
EmoCaps (Li et al., 2022) 77.12 63.19 3.03 42.52 57.50 7.69 57.54 64.00 71.77
UniMSE (Hu et al., 2022b) - - - - - - - 65.51 70.66
GA2MIF (Li et al., 2023a) 76.92 49.08 - 27.18 51.87 - 48.52 58.94 70.00
FacialMMT (Zheng et al., 2023) 80.13 59.63 19.18 41.99 64.88 18.18 56.00 66.58 -
TelME 80.22 60.33 26.97 43.45 65.67 26.42 56.70 67.37 70.48
Table 2: Performance comparisons on MELD (7-way) and IEMOCAP
warmup for the learning rate scheduler. All experi- Methods Remarks IEMOCAP MELD
ments are conducted on a single NVIDIA GeForce Audio KD 48.11 46.60
RTX 3090. More details are in Appendix A.2. Visual KD 18.85 36.72
Text - 66.60 66.57
Text + Visual ASF 67.94 67.05
4.3 Main Results Text + Audio ASF 69.26 67.19
TelME 70.48 67.37
We compare TelME with various multimodal-based Table 3: Performance comparison for single modality
ERC methods (explained in Appendix A.3) on both and multiple multimodal combinations
datasets in Table 2. TelME demonstrates robust
results in both datasets and achieves state-of-the-
art performance on MELD. Specifically, TelME 4.4 The Impact of Each Modality
outperforms the previous state-of-the-art method Table 3 presents the results for single-modality and
(M2FNet) in MELD by 0.66%, and exhibits a sub- multimodal combinations. The single-modality per-
stantial 3.37% improvement in MELD compared formances for audio and visual are the results after
to EmoCaps, which currently achieves state-of-the- applying our knowledge distillation method, and
art performance in IEMOCAP. Previous methods, the same fusion approach as TelME is used for
such as EmoCaps and UniMSE, have also shown dual-modality results. The text modality performs
effectiveness in IEMOCAP but exhibit somewhat the best among the single-modality, which sup-
weaker performance on MELD. ports our decision to use the text encoder as the
As shown in Table 2, we report the performance teacher model. Additionally, the combination of
of various methods for emotion labels in MELD. non-verbal modalities and text modality achieves
TelME outperforms other models in all emotions superior performance compared to using only text.
except Surprise and Anger. However, assuming Our findings indicate that the audio modality sig-
that Surprise and Fear, as well as Disgust and nificantly contributes more to emotion recognition
Anger, are similar emotions, Emocaps shows a and holds greater importance compared to the vi-
bias towards Surprise and Anger during inference, sual modality. We speculate this can be attributed
only achieving 3.03% and 7.69% in F1 score for to its ability to capture the intensity of emotion
Fear and Disgust, respectively. On the other hand, through variations in the tone and pitch of the
TelME distinguishes these similar emotions bet- speaker. Overall, our method achieves 3.52% im-
ter, bringing the scores for Fear and Disgust up to provement in IEMOCAP and 0.8% in MELD over
26.97% and 26.42%. We speculate that our frame- using only text.
work predicts minority emotions more accurately
as the non-verbal modality information (e.g., inten- 4.5 Ablation Study
sity and pitch of an utterance) enhanced through We conduct an ablation study to validate our knowl-
our KD strategy better assists the teacher in judging edge distillation and fusion strategies in Table 4.
the confusing emotions. The initial row for each dataset represents the out-
Dataset ASF L_response L_feature F1
✗ ✗ ✗ 63.33
✓ ✗ ✗ 68.19
IEMOCAP
✓ ✓ ✗ 69.42
✓ ✓ ✓ 70.48
✗ ✗ ✗ 67.04
✓ ✗ ✗ 66.75
MELD
✓ ✓ ✗ 67.23
✓ ✓ ✓ 67.37
Table 4: Results of ablation study. Here, Lresponse is

our response-based distillation, Lf eature is our feature-
based distillation and ASF is our fusion method.
come of training each modality encoder using cross-

entropy loss and concatenating the embeddings
without incorporating distillation loss and our fu-
sion method.
Using our fusion method alone, IEMOCAP
showed performance improvement, but MELD
showed poor performance. The effectiveness of Figure 5: Individual performance of audio and visual
our fusion method in achieving optimal modality modalities according to knowledge distillation type.
interaction cannot be guaranteed without knowl-
edge distillation. Because each encoder is trained
independently, focusing solely on improving its per- bly because facial expressions are not effectively
formance without considering the multimodal in- captured in the limited image frames of a short
teraction. On the other hand, as our knowledge dis- utterance. However, even with lower individual
tillation components are added, these bring about performance, all modalities have been shown to
consistent improvements for both datasets. contribute to the improvement of emotion recogni-
When we examine the specific effects of the tion performance through our approach (Table 3).
KD strategy, we observe performance improve- 4.6 Study on Teacher Modality
ments for both datasets, even when using only
Lresponse , presenting its efficacy in closing the gap
MELD IEMOCAP
between the teacher and the students. Furthermore,
TelME (Audio Teacher) 56.28 49.36
adding Lf eature aimed to leverage the richer knowl-
TelME (Visual Teacher) 56.85 56.78
edge of the teacher is more effective in IEMOCAP
TelME (Text Teacher) 67.37 70.48
and shows marginal performance enhancements
in MELD. However, we speculate that the slight Table 5: TelME Performance by Teacher Modality
improvement in MELD may be attributed to class
imbalance. While TelME significantly outperforms To assess the optimality of employing text
existing approaches in minority classes of MELD, modality as the teacher, we conduct comparative
the weighted F1 score is only slightly improved due experiments by setting each modality as the teacher
to the low number of samples. We show an analysis modality, which is shown in Table 5. Our study
of this class imbalance problem in Section 4.7 as shows that the TelME framework performs best
well as an error analysis of the emotion classes in with the text encoder as the teacher, while treat-
Appendix A.4. ing the other modalities as the teacher significantly
Figure 5 shows the individual performance of hinders model performance.
the audio and visual modalities based on the distil- Additionally, Table 6 reports the individual per-
lation loss. We observe that applying both types of formance of the student models based on the
distillation loss is more effective compared to not teacher modality. The diagonals (cases where
applying them. The performance of visual modal- the teacher and student modalities are the same)
ity on the IEMOCAP dataset has declined, possi- in Table 6 represent results without performing
Audio Student Visual Student Text Student
Audio Teacher 44.55 34.86 54.83
balance is a contributing factor to the limited ob-
MELD Visual Teacher 40.18 36.14 59.72 served improvements associated with Lf eature in
Text Teacher 46.60 36.72 66.60
Audio Teacher 42.24 20.45 57.42 the MELD dataset compared to the IEMOCAP
IEMOCAP Visual Teacher 44.13 22.06 63.94
Text Teacher 48.11 18.85 66.57
dataset.
Table 6: Teacher Modality Study on MELD and IEMO-

5 Conclusion
CAP
This paper proposes Teacher-leading Multimodal

Knowledge Distillation (KD). Our comparative ex- fusion network for ERC (TelME), a novel mul-
periment results show that a robust text encoder timodal ERC framework. TelME incorporates a
can most effectively serve as the teacher. Specif- cross-modal distillation that transfers the knowl-
ically, designating the text encoder as the teacher edge of text encoders trained in linguistic contexts
enhances the performance of all student models to enhance the effectiveness of non-verbal modali-
except for the visual student in IEMOCAP. On the ties. Moreover, we employ the fusion method that
other hand, it is evident that treating a weak non- shifts the features of the teacher model by referring
verbal model as the teacher impairs student perfor- to non-verbal information. We show through ex-
mance, thereby performing suboptimally compared periments on two benchmarks that our approach
to having a text-based teacher. is practical in ERC. TelME delivers robust per-
formance in both datasets and especially achieves
4.7 Class Imbalance state-of-the-art results in the MELD dataset, which
consists of multi-party conversational scenarios.
We believe that this research presents a new direc-
tion that can incorporate multimodal information
for ERC.
Limitations
This study has a limitation wherein the visual

modality shows a lower capability to recognize
emotions compared to the audio modality. To ad-
dress this limitation, future research should focus
on developing techniques to accurately capture and
interpret the facial expressions of the speaker dur-
ing brief utterances. By improving the extraction of
visual features, the effectiveness of knowledge dis-
tillation can be significantly enhanced, thus show-
casing its potential to make a more substantial con-
Figure 6: Count distribution of emotion classes for both tribution to emotion recognition.
MELD and IEMOCAP datasets
Acknowledgements
Figure 6 illustrates the label distribution within
the MELD and IEMOCAP datasets. Notably, the This work was partly supported by Institute
MELD dataset exhibits a pronounced imbalance, of Information & Communications Technology
with the "neutral" class comprising the majority Planning & Evaluation (IITP) Grant funded by
at 47% of the data, followed by "joy" with 17% the Korea government (MSIT), Artificial Intel-
and "surprise" with 12%. This substantial class ligence Graduate School Program, Yonsei Uni-
imbalance presents a challenge in the context of versity, under Grant 2020-0-01361 and the Na-
distillation, specifically for the teacher encoder to tional Research Foundation of Korea(NRF) grant
initially identify the minority classes and subse- funded by the Korea government(MSIT) (No.
quently transfer this information to the non-verbal 2022R1A2B5B02002359).
student encoders. We believe that this class im-
References Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park,
Nojun Kwak, and Jin Young Choi. 2019. A compre-
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun hensive overhaul of feature distillation. In Proceed-
Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: ings of the IEEE/CVF International Conference on
A general framework for self-supervised learning Computer Vision, pages 1921–1930.
in speech, vision and language. In International
Conference on Machine Learning, pages 1298–1312. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
PMLR. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531.
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe
Morency. 2018. Multimodal machine learning: A
survey and taxonomy. IEEE transactions on pattern Dou Hu, Yinan Bao, Lingwei Wei, Wei Zhou, and
analysis and machine intelligence, 41(2):423–443. Songlin Hu. 2023. Supervised adversarial contrastive
learning for emotion recognition in conversations.
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. arXiv preprint arXiv:2306.01505.
2021. Is space-time attention all you need for video
understanding? In ICML, volume 2, page 4. Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang,
and Yang Mo. 2022a. Mm-dfn: Multimodal dynamic
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe fusion network for emotion recognition in conver-
Kazemzadeh, Emily Mower, Samuel Kim, Jean- sations. In ICASSP 2022-2022 IEEE International
nette N Chang, Sungbok Lee, and Shrikanth S Conference on Acoustics, Speech and Signal Process-
Narayanan. 2008. Iemocap: Interactive emotional ing (ICASSP), pages 7037–7041. IEEE.
dyadic motion capture database. Language resources
and evaluation, 42:335–359. Dou Hu, Lingwei Wei, and Xiaoyong Huai. 2021a. Di-
aloguecrn: Contextual reasoning networks for emo-
Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, tion recognition in conversations. arXiv preprint
Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. arXiv:2106.01978.
2022. M2fnet: multi-modal fusion network for emo-
tion recognition in conversation. In Proceedings of Guimin Hu, Ting-En Lin, Yi Zhao, Guangming
the IEEE/CVF Conference on Computer Vision and Lu, Yuchuan Wu, and Yongbin Li. 2022b.
Pattern Recognition, pages 4652–4661. Unimse: Towards unified multimodal sentiment
analysis and emotion recognition. arXiv preprint
Deepanway Ghosal, Navonil Majumder, Alexander Gel- arXiv:2211.11256.
bukh, Rada Mihalcea, and Soujanya Poria. 2020.
Cosmic: Commonsense knowledge for emotion Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin.
identification in conversations. arXiv preprint 2021b. Mmgcn: Multimodal fusion via deep graph
arXiv:2010.02795. convolution network for emotion recognition in con-
versation. arXiv preprint arXiv:2107.06779.
Deepanway Ghosal, Navonil Majumder, Soujanya Poria,
Niyati Chhaya, and Alexander Gelbukh. 2019. Dia- Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang
loguegcn: A graph convolutional neural network for Xu. 2022. Knowledge distillation from a stronger
emotion recognition in conversation. arXiv preprint teacher. arXiv preprint arXiv:2205.10536.
arXiv:1908.11540.
Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun
2016. Cross modal distillation for supervision trans- Goto. 2020. Relation-aware graph attention networks
fer. In Proceedings of the IEEE conference on com- with relational position encodings for emotion recog-
puter vision and pattern recognition, pages 2827– nition in conversations. In Proceedings of the 2020
2836. Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7360–7370.
Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea,
Erik Cambria, and Roger Zimmermann. 2018a. Icon: Wenxiang Jiao, Haiqin Yang, Irwin King, and Michael R
Interactive conversational memory network for multi- Lyu. 2019. Higru: Hierarchical gated recurrent
modal emotion detection. In Proceedings of the 2018 units for utterance-level emotion recognition. arXiv
conference on empirical methods in natural language preprint arXiv:1904.04446.
processing, pages 2594–2604.
Woojeong Jin, Maziar Sanjabi, Shaoliang Nie,
Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Liang Tan, Xiang Ren, and Hamed Firooz.
Erik Cambria, Louis-Philippe Morency, and Roger 2021. Msd: Saliency-aware knowledge distilla-
Zimmermann. 2018b. Conversational memory net- tion for multimodal understanding. arXiv preprint
work for emotion recognition in dyadic dialogue arXiv:2101.01881.
videos. In Proceedings of the conference. Associ-
ation for Computational Linguistics. North American Taewoon Kim and Piek Vossen. 2021. Emoberta:
Chapter. Meeting, volume 2018, page 2122. NIH Speaker-aware emotion recognition in conversation
Public Access. with roberta. arXiv preprint arXiv:2108.12009.
Joosung Lee and Wooin Lee. 2021. Compm: Context Soujanya Poria, Erik Cambria, Devamanyu Hazarika,
modeling with speaker’s pre-trained memory track- Navonil Majumder, Amir Zadeh, and Louis-Philippe
ing for emotion recognition in conversation. arXiv Morency. 2017. Context-dependent sentiment anal-
preprint arXiv:2108.11626. ysis in user-generated videos. In Proceedings of the
55th annual meeting of the association for compu-
Jiang Li, Xiaoping Wang, Guoqing Lv, and Zhigang tational linguistics (volume 1: Long papers), pages
Zeng. 2023a. Ga2mif: Graph and attention based 873–883.
two-stage multi-source information fusion for con-
versational emotion detection. IEEE Transactions on Soujanya Poria, Devamanyu Hazarika, Navonil Ma-
Affective Computing. jumder, Gautam Naik, Erik Cambria, and Rada Mi-
halcea. 2018. Meld: A multimodal multi-party
Jingye Li, Donghong Ji, Fei Li, Meishan Zhang, and dataset for emotion recognition in conversations.
Yijiang Liu. 2020. Hitrans: A transformer-based arXiv preprint arXiv:1810.02508.
context-and speaker-sensitive model for emotion de- Soujanya Poria, Navonil Majumder, Rada Mihalcea,
tection in conversations. In Proceedings of the 28th and Eduard Hovy. 2019. Emotion recognition in con-
International Conference on Computational Linguis- versation: Research challenges, datasets, and recent
tics, pages 4190–4200. advances. IEEE Access, 7:100943–100953.
Yong Li, Yuanzhi Wang, and Zhen Cui. 2023b. De- Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee,
coupled multimodal distilling for emotion recogni- Amir Zadeh, Chengfeng Mao, Louis-Philippe
tion. In Proceedings of the IEEE/CVF Conference Morency, and Ehsan Hoque. 2020. Integrating multi-
on Computer Vision and Pattern Recognition, pages modal information in large pretrained transformers.
6631–6640. In Proceedings of the conference. Association for
Computational Linguistics. Meeting, volume 2020,
Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. page 2359. NIH Public Access.
2022. Emocaps: Emotion capsule based model for
conversational emotion recognition. arXiv preprint Weizhou Shen, Siyue Wu, Yunyi Yang, and Xiaojun
arXiv:2203.13504. Quan. 2021. Directed acyclic graph network for
conversational emotion recognition. arXiv preprint
Paul Pu Liang, Amir Zadeh, and Louis-Philippe arXiv:2105.12907.
Morency. 2022. Foundations and recent trends
Xiaohui Song, Longtao Huang, Hui Xue, and Songlin
in multimodal machine learning: Principles, chal-
Hu. 2022a. Supervised prototypical contrastive learn-
lenges, and open questions. arXiv preprint
ing for emotion recognition in conversation. arXiv
arXiv:2209.03430.
preprint arXiv:2210.08713.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Xiaohui Song, Liangjun Zang, Rong Zhang, Songlin Hu,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, and Longtao Huang. 2022b. Emotionflow: Capture
Luke Zettlemoyer, and Veselin Stoyanov. 2019. the dialogue level emotion transitions. In ICASSP
Roberta: A robustly optimized bert pretraining ap- 2022-2022 IEEE International Conference on Acous-
proach. arXiv preprint arXiv:1907.11692. tics, Speech and Signal Processing (ICASSP), pages
8542–8546. IEEE.
Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia
Zhang, and Bo Xu. 2023. A transformer-based model Vinh Tran, Niranjan Balasubramanian, and Minh Hoai.
with self-distillation for multimodal emotion recog- 2022. From within to between: Knowledge distil-
nition in conversations. IEEE Transactions on Multi- lation for cross modality retrieval. In Proceedings
media. of the Asian Conference on Computer Vision, pages
3223–3240.
Yukun Ma, Khanh Linh Nguyen, Frank Z Xing, and Erik
Cambria. 2020. A survey on empathetic dialogue Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
systems. Information Fusion, 64:50–70. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
et al. 2019. Huggingface’s transformers: State-of-
Navonil Majumder, Soujanya Poria, Devamanyu Haz-
the-art natural language processing. arXiv preprint
arika, Rada Mihalcea, Alexander Gelbukh, and Erik
arXiv:1910.03771.
Cambria. 2019. Dialoguernn: An attentive rnn for
emotion detection in conversations. In Proceedings Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang
of the AAAI conference on artificial intelligence, vol- Zhao. 2022. The modality focusing hypothesis: To-
ume 33, pages 6818–6825. wards understanding crossmodal knowledge distilla-
tion. arXiv preprint arXiv:2206.06487.
Yuzhao Mao, Guang Liu, Xiaojie Wang, Weiguo Gao,
and Xuan Li. 2021. Dialoguetrm: Exploring multi- AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria,
modal emotional dynamics in a conversation. In Erik Cambria, and Louis-Philippe Morency. 2018.
Findings of the Association for Computational Lin- Multimodal language analysis in the wild: Cmu-
guistics: EMNLP 2021, pages 2694–2704. mosei dataset and interpretable dynamic fusion graph.
Hyperparameter IEMOCAP MELD
In Proceedings of the 56th Annual Meeting of the As-
Knowledge distillation
sociation for Computational Linguistics (Volume 1: Balance factors for Lstudent α=0.1 1
Long Papers), pages 2236–2246. Temperature for Lresponse 4 2
Temperature for Lf eature 1 1
Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Attention modality Shifting Fusion
Li, Qiaoming Zhu, and Guodong Zhou. 2019. Mod- Threshold parameter 0.01 0.1
eling both context-and speaker-sensitive dependence Dropout 0.2 0.1
for emotion detection in multi-speaker conversations. The number of heads for multi-head attention 4 3
In IJCAI, pages 5415–5421.
Table 8: hyperparameter settings of TelME on two
Jiahao Zheng, Sen Zhang, Zilu Wang, Xiaoping Wang, datasets
and Zhigang Zeng. 2022. Multi-channel weight-
sharing autoencoder based on cascade multi-head
attention for multimodal emotion recognition. IEEE set to 1 regardless of the dataset. We also use a
Transactions on Multimedia.
fusion method that shifts vectors in the teacher
Wenjie Zheng, Jianfei Yu, Rui Xia, and Shijin Wang. model, where the threshold parameter is set to 0.01
2023. A facial expression-aware multimodal multi- for IEMOCAP and 0.1 for MELD. Furthermore,
task learning framework for emotion recognition in Dropout is adjusted to 0.2 for MELD and 0.1 for
multi-party conversations. In Proceedings of the 61st
Annual Meeting of the Association for Computational IEMOCAP. The number of heads used in the multi-
Linguistics (Volume 1: Long Papers), pages 15445– head attention process is 4 for IEMOCAP and 3 for
15459. MELD.
Peixiang Zhong, Di Wang, and Chunyan Miao. 2019.
Knowledge-enriched transformer for emotion de- A.3 Compared Models
tection in textual conversations. arXiv preprint
arXiv:1909.10681.
We compare TelME against the following models:
DialogueRNN (Majumder et al., 2019) employs
Lixing Zhu, Gabriele Pergola, Lin Gui, Deyu Zhou, and Recurrent Neural Networks (RNNs) to capture the
Yulan He. 2021. Topic-driven and knowledge-aware speaker identity as well as the historical context and
transformer for dialogue emotion detection. arXiv
preprint arXiv:2106.01071. the emotions of past utterances to capture the nu-
ances of conversation dynamics. ConGCN (Zhang
A Appendix et al., 2019) utilizes a Graph Convolutional Net-
work (GCN) to represent relationships within a
A.1 Effect of the prompt graph that incorporates both context and speaker
information of multiple conversations. MMGCN
MELD IEMOCAP (Hu et al., 2021b) also proposes a GCN-based ap-
w/o prompt ([cls]+context) 65.25 66.48
proach, but captures representations of a conver-
context + prompt 66.57 66.60
sation through a graph that contains long-distance
Table 7: Comparison of the teacher performance based flow of information as well as speaker information.
on the use of the prompt DialogueTRM (Mao et al., 2021) focuses on mod-
eling both local and global context of conversations
Table 7 shows an ablation experiment on the to capture the temporal and spatial dependencies.
prompt. We remove the prompt and use the CLS DAG-ERC (Shen et al., 2021) studies how conver-
token to compare emotion prediction results with sation background affects information of the sur-
the results using the prompt. We observe from the rounding context of a conversation. MMDFN (Hu
results that the prompt helps to infer the emotion et al., 2022a) proposes a framework that aims to en-
of a recent speaker from a set of textual utterances. hance integration of multimodal features through
dynamic fusion. EmoCaps (Li et al., 2022) in-
A.2 Hyperparameter Settings troduces an emotion capsule that fuses informa-
Through our KD strategy, audio and visual en- tion from multiple modalities with emotional ten-
coders are trained using the loss functions men- dencies to provide a more nuanced understanding
tioned in Equation 6. In Lstudent , the balancing of emotions within a conversation. UniMSE (Hu
factors are all set to 1, excluding α for IEMOCAP. et al., 2022b) seeks to unify ERC with multimodal
The temperature parameter for the Lresponse func- sentiment analysis through a T5-based framework.
tion is adjusted to 4 for MELD and 2 for IEMO- GA2MIF (Li et al., 2023a) introduces a two-stage
CAP. The temperature parameter for Lf eature is multimodal fusion of information from a graph and
an attention network. FacialMMT (Zheng et al., SEED MELD IEMOCAP
2023) focuses on extracting the real speaker’s face 0 67.27 70.50
sequence from multi-party conversation videos and 1 67.41 70.69
then leverages auxiliary frame-level facial expres- 1234 67.44 70.21
sion recognition tasks to generate emotional visual
2023 67.24 69.95
representations.
42 67.37 70.48
mean 67.35 70.37
A.4 Error Analysis standard deviation 0.0781 0.2581
Table 9: Performance of the full framework for five
random seeds
A.5 Results of Random Seed Numbers

We report all outcomes based on the seed number
42 following previous studies (Lee and Lee, 2021;
Song et al., 2022a; Hu et al., 2022b). However,
to validate TelME’s robustness to randomness, we
present experiment results with different seed num-
bers in Table 9. The results in Table 10 demonstrate
that the performance of TelME is robust to seed
variations.
A.6 Utility of TelME
TelME Label Text Audio Visual

anger anger disgust anger neutral
Figure 7: Confusion Matrices on IEMOCAP and MELD neutral neutral surprise neutral neutral
joy joy surprise anger neutral
anger anger surprise neutral neutral
Figure 7 shows the normalized confusion matri- joy joy disgust joy joy
ces of the TelME and the understated model for two sadness sadness fear neutral neutral
datasets. We can evaluate the quality of the emotion joy joy surprise joy anger
prediction through the confusion matrix. TelME neutral neutral sadness neutral sadness
shows better True Positive results in almost all emo-
tion classes. This suggests that TelME is extracting Table 10: inference results of each unimodal model and
and fusing finer-grained features to infer emotions TelME on MELD
without bias. TelME better classifies similar emo-
tions compared to the understated model(e.g., ex- Table 10 presents a segment of the ground truth
cited and happy, angry and frustrated). However, label from the MELD dataset, along with inference
the result of misclassifying happy as exciting is outcomes of each unimodal model (Text teacher,
a little high. This result is due to the lowest per- non-verbal students) and TelME. The results indi-
centage of happy in IEMOCAP with unbalanced cate that student models can make different judg-
classes. Even in the case of MELD, the emotion ments than the text teacher even after knowledge
in which most emotion classes are misclassified is distillation. Moreover, the final decision of TelME,
neutral, with the highest count. We can observe a supported by complementary information from
similar misclassification tendency in other research non-verbal modalities, might diverge from the pre-
(Chudasama et al., 2022; Hu et al., 2023) as well. diction of the text teacher, rectifying any inaccura-
Hence, we suspect that the cause of misclassifica- cies. This implies that TelME utilizes multimodal
tion is not a problem with the method we proposed information instead of heavily depending on any of
but rather stems from a class imbalance issue. the three modalities.
IEMOCAP: Emotion Categories
Models
Happy Sad Neutral Anger Excited Frustrated F1
DialogueRNN (Majumder et al., 2019) 33.18 78.80 59.21 65.28 71.86 58.91 62.75
MMGCN (Hu et al., 2021b) 42.34 78.67 61.73 69.00 74.33 62.32 66.22
DialogueTRM (Mao et al., 2021) 48.70 77.52 74.12 66.27 70.24 67.23 69.23
DAG-ERC (Shen et al., 2021) - - - - - - 68.03
MM-DFN (Hu et al., 2022a) 42.22 78.98 66.42 69.77 75.56 66.33 68.18
M2FNet (Chudasama et al., 2022) - - - - - - 69.86
EmoCaps (Li et al., 2022) 71.91 85.06 64.48 68.99 78.41 66.76 71.77
UniMSE (Hu et al., 2022b) - - - - - - 70.66
GA2MIF (Li et al., 2023a) 46.15 84.50 68.38 70.29 75.99 66.49 70.00
TelME 49.46 83.48 67.42 68.49 77.38 68.63 70.48
Table 11: Performance comparisons on IEMOCAP (6-way)
A.7 Performance comparisons on IEMOCAP

Table 11 shows a performance comparison of our
approach and other approaches on IEMOCAP
dataset.

Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation

Uploaded by

Copyright:

Available Formats

Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation

Uploaded by

Copyright:

Available Formats

TelME: Teacher-leading Multimodal Fusion Network for Emotion

Taeyang Yun, Hyunkuk Lim, Jeonghwan Lee, Min Song∗

The emotions in a conversation can be identi-

to complement and enhance the teacher. si , sj ∈ S are the conversation participants. If

d(µ, υ) = 1 − ρ(µ, υ) (7)

Table 2: Performance comparisons on MELD (7-way) and IEMOCAP

Table 4: Results of ablation study. Here, Lresponse is

come of training each modality encoder using cross-

Table 6: Teacher Modality Study on MELD and IEMO-

This paper proposes Teacher-leading Multimodal

This study has a limitation wherein the visual

A.5 Results of Random Seed Numbers

A.6 Utility of TelME

TelME Label Text Audio Visual

Table 11: Performance comparisons on IEMOCAP (6-way)

A.7 Performance comparisons on IEMOCAP

You might also like