Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation
Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation
Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation
Recognition in Conversation
Abstract
Emotion Recognition in Conversation (ERC)
plays a crucial role in enabling dialogue sys-
tems to effectively respond to user requests.
arXiv:2401.12987v2 [cs.CL] 31 Mar 2024
3.2.4 Attention-based modality Shifting Table 1: Statistics of the two benchmark datasets.
Fusion
The emotional features from the enhanced student ∥Fk ∥2
λ = min( · θ, 1) (19)
networks can impact the teacher model’s emotion- ∥Hk ∥2
relevant representations, providing information that
where Z is the multimodal vector. We apply the
may not be captured from the text. To fully utilize
scaling factor λ to control the magnitude of the
these features, we adopt a multimodal fusion ap-
displacement vector and θ as a threshold hyperpa-
proach where feature vectors from the student mod-
rameter. ∥Fk ∥2 , ∥Hk ∥2 denote the L2 norm of the
els manipulate the representation vectors from the
Fk and Hk vectors respectively.
teacher, effectively incorporating non-verbal infor-
mation into the representation vector. To highlight 4 Experiments
non-verbal characteristics, we concatenate the vec-
tors of the student models and perform multi-head 4.1 Datasets
self-attention. The vectors of non-verbal informa- We evaluate our proposed network on MELD (Po-
tion generated through the multi-head self-attention ria et al., 2018) and IEMOCAP (Busso et al.,
process and emotional features of the teacher en- 2008) following other works on ERC listed in Ap-
coder enter the input of the shifting step (Figure pendix A.3. The statistics are shown in Table 1.
4). We are inspired by Rahman et al. (2020) to MELD is a multi-party dataset comprising over
construct the shifting step. In the shifting step, a 1400 dialogues and over 13,000 utterances ex-
gating vector is generated by concatenating and tracted from the TV series Friends. This dataset
transforming the vector of the teacher model and contains seven emotion categories for each utter-
the vector of the non-verbal information. ance: neutral, surprise, fear, sadness, joy, disgust,
k
gAV k
= R(W1 · < FTk , Fattention > +b1 ) (16) and anger.
IEMOCAP consists of 7433 utterances and 151
where <,> is the operation of vector concatenation, dialogues in 5 sessions, each involving two speak-
R(x) is a non-linear activation function, W1 is the ers per session. Each utterance is labeled as one
weight matrix for linear transform, and b1 is scalar of six emotional categories: happy, sad, angry, ex-
bias. Fattention is the emotional representation vec- cited, frustrated and neutral. The train and devel-
tors of non-verbal information. gAV is the gating opment datasets consist of the first four sessions
vector. The gating vector highlights the relevant in- randomly divided at a 9:1 ratio. The test dataset
formation in the non-verbal vector according to the consists of the last session.
representations of the teacher model. We define the We purposely exclude CMU-MOSEI (Zadeh
displacement vector by applying the gating vector et al., 2018), a well-known multimodal sentiment
as follows. analysis dataset, as it comprises single-speaker
k k
videos and is not suitable for ERC, where emo-
Hk = gAV · (W2 · Fattention + b2 ) (17) tions dynamically change within each conversation
where W2 is the weight matrix for linear trans- turn.
form and b2 is scalar bias. H is the non-verbal 4.2 Experiment Settings
information-based displacement vector.
We subsequently utilize the weighted sum be- We evaluate all experiments using the weighted av-
tween the representation vector of the teacher and erage F1 score on two class-imbalanced datasets.
the displacement vector to generate a multimodal We use the initial weight of the pre-trained mod-
vector. Finally, we predict emotions using the mul- els from Huggingface’s Transformers (Wolf et al.,
timodal vector. 2019). The output dimension of all encoders is uni-
fied to 768. The optimizer is AdamW and the initial
Zk = FTk + λ · Hk (18) learning rate is 1e-5. We use a linear schedule with
MELD: Emotion Categories IEMOCAP
Models
Neutral Surprise Fear Sadness Joy Disgust Anger F1 F1
DialogueRNN (Majumder et al., 2019) 73.50 49.40 1.20 23.80 50.70 1.70 41.50 57.03 62.75
ConGCN (Zhang et al., 2019) 76.70 50.30 8.70 28.50 53.10 10.60 46.80 59.40 64.18
MMGCN (Hu et al., 2021b) - - - - - - - 58.65 66.22
DialogueTRM (Mao et al., 2021) - - - - - - - 63.50 69.23
DAG-ERC (Shen et al., 2021) - - - - - - - 63.65 68.03
MM-DFN (Hu et al., 2022a) 77.76 50.69 - 22.94 54.78 - 47.82 59.46 68.18
M2FNet (Chudasama et al., 2022) - - - - - - - 66.71 69.86
EmoCaps (Li et al., 2022) 77.12 63.19 3.03 42.52 57.50 7.69 57.54 64.00 71.77
UniMSE (Hu et al., 2022b) - - - - - - - 65.51 70.66
GA2MIF (Li et al., 2023a) 76.92 49.08 - 27.18 51.87 - 48.52 58.94 70.00
FacialMMT (Zheng et al., 2023) 80.13 59.63 19.18 41.99 64.88 18.18 56.00 66.58 -
TelME 80.22 60.33 26.97 43.45 65.67 26.42 56.70 67.37 70.48
warmup for the learning rate scheduler. All experi- Methods Remarks IEMOCAP MELD
ments are conducted on a single NVIDIA GeForce Audio KD 48.11 46.60
RTX 3090. More details are in Appendix A.2. Visual KD 18.85 36.72
Text - 66.60 66.57
Text + Visual ASF 67.94 67.05
4.3 Main Results Text + Audio ASF 69.26 67.19
TelME 70.48 67.37
We compare TelME with various multimodal-based Table 3: Performance comparison for single modality
ERC methods (explained in Appendix A.3) on both and multiple multimodal combinations
datasets in Table 2. TelME demonstrates robust
results in both datasets and achieves state-of-the-
art performance on MELD. Specifically, TelME 4.4 The Impact of Each Modality
outperforms the previous state-of-the-art method Table 3 presents the results for single-modality and
(M2FNet) in MELD by 0.66%, and exhibits a sub- multimodal combinations. The single-modality per-
stantial 3.37% improvement in MELD compared formances for audio and visual are the results after
to EmoCaps, which currently achieves state-of-the- applying our knowledge distillation method, and
art performance in IEMOCAP. Previous methods, the same fusion approach as TelME is used for
such as EmoCaps and UniMSE, have also shown dual-modality results. The text modality performs
effectiveness in IEMOCAP but exhibit somewhat the best among the single-modality, which sup-
weaker performance on MELD. ports our decision to use the text encoder as the
As shown in Table 2, we report the performance teacher model. Additionally, the combination of
of various methods for emotion labels in MELD. non-verbal modalities and text modality achieves
TelME outperforms other models in all emotions superior performance compared to using only text.
except Surprise and Anger. However, assuming Our findings indicate that the audio modality sig-
that Surprise and Fear, as well as Disgust and nificantly contributes more to emotion recognition
Anger, are similar emotions, Emocaps shows a and holds greater importance compared to the vi-
bias towards Surprise and Anger during inference, sual modality. We speculate this can be attributed
only achieving 3.03% and 7.69% in F1 score for to its ability to capture the intensity of emotion
Fear and Disgust, respectively. On the other hand, through variations in the tone and pitch of the
TelME distinguishes these similar emotions bet- speaker. Overall, our method achieves 3.52% im-
ter, bringing the scores for Fear and Disgust up to provement in IEMOCAP and 0.8% in MELD over
26.97% and 26.42%. We speculate that our frame- using only text.
work predicts minority emotions more accurately
as the non-verbal modality information (e.g., inten- 4.5 Ablation Study
sity and pitch of an utterance) enhanced through We conduct an ablation study to validate our knowl-
our KD strategy better assists the teacher in judging edge distillation and fusion strategies in Table 4.
the confusing emotions. The initial row for each dataset represents the out-
Dataset ASF L_response L_feature F1
✗ ✗ ✗ 63.33
✓ ✗ ✗ 68.19
IEMOCAP
✓ ✓ ✗ 69.42
✓ ✓ ✓ 70.48
✗ ✗ ✗ 67.04
✓ ✗ ✗ 66.75
MELD
✓ ✓ ✗ 67.23
✓ ✓ ✓ 67.37
Limitations
Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun
2016. Cross modal distillation for supervision trans- Goto. 2020. Relation-aware graph attention networks
fer. In Proceedings of the IEEE conference on com- with relational position encodings for emotion recog-
puter vision and pattern recognition, pages 2827– nition in conversations. In Proceedings of the 2020
2836. Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7360–7370.
Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea,
Erik Cambria, and Roger Zimmermann. 2018a. Icon: Wenxiang Jiao, Haiqin Yang, Irwin King, and Michael R
Interactive conversational memory network for multi- Lyu. 2019. Higru: Hierarchical gated recurrent
modal emotion detection. In Proceedings of the 2018 units for utterance-level emotion recognition. arXiv
conference on empirical methods in natural language preprint arXiv:1904.04446.
processing, pages 2594–2604.
Woojeong Jin, Maziar Sanjabi, Shaoliang Nie,
Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Liang Tan, Xiang Ren, and Hamed Firooz.
Erik Cambria, Louis-Philippe Morency, and Roger 2021. Msd: Saliency-aware knowledge distilla-
Zimmermann. 2018b. Conversational memory net- tion for multimodal understanding. arXiv preprint
work for emotion recognition in dyadic dialogue arXiv:2101.01881.
videos. In Proceedings of the conference. Associ-
ation for Computational Linguistics. North American Taewoon Kim and Piek Vossen. 2021. Emoberta:
Chapter. Meeting, volume 2018, page 2122. NIH Speaker-aware emotion recognition in conversation
Public Access. with roberta. arXiv preprint arXiv:2108.12009.
Joosung Lee and Wooin Lee. 2021. Compm: Context Soujanya Poria, Erik Cambria, Devamanyu Hazarika,
modeling with speaker’s pre-trained memory track- Navonil Majumder, Amir Zadeh, and Louis-Philippe
ing for emotion recognition in conversation. arXiv Morency. 2017. Context-dependent sentiment anal-
preprint arXiv:2108.11626. ysis in user-generated videos. In Proceedings of the
55th annual meeting of the association for compu-
Jiang Li, Xiaoping Wang, Guoqing Lv, and Zhigang tational linguistics (volume 1: Long papers), pages
Zeng. 2023a. Ga2mif: Graph and attention based 873–883.
two-stage multi-source information fusion for con-
versational emotion detection. IEEE Transactions on Soujanya Poria, Devamanyu Hazarika, Navonil Ma-
Affective Computing. jumder, Gautam Naik, Erik Cambria, and Rada Mi-
halcea. 2018. Meld: A multimodal multi-party
Jingye Li, Donghong Ji, Fei Li, Meishan Zhang, and dataset for emotion recognition in conversations.
Yijiang Liu. 2020. Hitrans: A transformer-based arXiv preprint arXiv:1810.02508.
context-and speaker-sensitive model for emotion de- Soujanya Poria, Navonil Majumder, Rada Mihalcea,
tection in conversations. In Proceedings of the 28th and Eduard Hovy. 2019. Emotion recognition in con-
International Conference on Computational Linguis- versation: Research challenges, datasets, and recent
tics, pages 4190–4200. advances. IEEE Access, 7:100943–100953.
Yong Li, Yuanzhi Wang, and Zhen Cui. 2023b. De- Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee,
coupled multimodal distilling for emotion recogni- Amir Zadeh, Chengfeng Mao, Louis-Philippe
tion. In Proceedings of the IEEE/CVF Conference Morency, and Ehsan Hoque. 2020. Integrating multi-
on Computer Vision and Pattern Recognition, pages modal information in large pretrained transformers.
6631–6640. In Proceedings of the conference. Association for
Computational Linguistics. Meeting, volume 2020,
Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. page 2359. NIH Public Access.
2022. Emocaps: Emotion capsule based model for
conversational emotion recognition. arXiv preprint Weizhou Shen, Siyue Wu, Yunyi Yang, and Xiaojun
arXiv:2203.13504. Quan. 2021. Directed acyclic graph network for
conversational emotion recognition. arXiv preprint
Paul Pu Liang, Amir Zadeh, and Louis-Philippe arXiv:2105.12907.
Morency. 2022. Foundations and recent trends
Xiaohui Song, Longtao Huang, Hui Xue, and Songlin
in multimodal machine learning: Principles, chal-
Hu. 2022a. Supervised prototypical contrastive learn-
lenges, and open questions. arXiv preprint
ing for emotion recognition in conversation. arXiv
arXiv:2209.03430.
preprint arXiv:2210.08713.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Xiaohui Song, Liangjun Zang, Rong Zhang, Songlin Hu,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, and Longtao Huang. 2022b. Emotionflow: Capture
Luke Zettlemoyer, and Veselin Stoyanov. 2019. the dialogue level emotion transitions. In ICASSP
Roberta: A robustly optimized bert pretraining ap- 2022-2022 IEEE International Conference on Acous-
proach. arXiv preprint arXiv:1907.11692. tics, Speech and Signal Processing (ICASSP), pages
8542–8546. IEEE.
Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia
Zhang, and Bo Xu. 2023. A transformer-based model Vinh Tran, Niranjan Balasubramanian, and Minh Hoai.
with self-distillation for multimodal emotion recog- 2022. From within to between: Knowledge distil-
nition in conversations. IEEE Transactions on Multi- lation for cross modality retrieval. In Proceedings
media. of the Asian Conference on Computer Vision, pages
3223–3240.
Yukun Ma, Khanh Linh Nguyen, Frank Z Xing, and Erik
Cambria. 2020. A survey on empathetic dialogue Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
systems. Information Fusion, 64:50–70. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
et al. 2019. Huggingface’s transformers: State-of-
Navonil Majumder, Soujanya Poria, Devamanyu Haz-
the-art natural language processing. arXiv preprint
arika, Rada Mihalcea, Alexander Gelbukh, and Erik
arXiv:1910.03771.
Cambria. 2019. Dialoguernn: An attentive rnn for
emotion detection in conversations. In Proceedings Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang
of the AAAI conference on artificial intelligence, vol- Zhao. 2022. The modality focusing hypothesis: To-
ume 33, pages 6818–6825. wards understanding crossmodal knowledge distilla-
tion. arXiv preprint arXiv:2206.06487.
Yuzhao Mao, Guang Liu, Xiaojie Wang, Weiguo Gao,
and Xuan Li. 2021. Dialoguetrm: Exploring multi- AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria,
modal emotional dynamics in a conversation. In Erik Cambria, and Louis-Philippe Morency. 2018.
Findings of the Association for Computational Lin- Multimodal language analysis in the wild: Cmu-
guistics: EMNLP 2021, pages 2694–2704. mosei dataset and interpretable dynamic fusion graph.
Hyperparameter IEMOCAP MELD
In Proceedings of the 56th Annual Meeting of the As-
Knowledge distillation
sociation for Computational Linguistics (Volume 1: Balance factors for Lstudent α=0.1 1
Long Papers), pages 2236–2246. Temperature for Lresponse 4 2
Temperature for Lf eature 1 1
Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Attention modality Shifting Fusion
Li, Qiaoming Zhu, and Guodong Zhou. 2019. Mod- Threshold parameter 0.01 0.1
eling both context-and speaker-sensitive dependence Dropout 0.2 0.1
for emotion detection in multi-speaker conversations. The number of heads for multi-head attention 4 3
In IJCAI, pages 5415–5421.
Table 8: hyperparameter settings of TelME on two
Jiahao Zheng, Sen Zhang, Zilu Wang, Xiaoping Wang, datasets
and Zhigang Zeng. 2022. Multi-channel weight-
sharing autoencoder based on cascade multi-head
attention for multimodal emotion recognition. IEEE set to 1 regardless of the dataset. We also use a
Transactions on Multimedia.
fusion method that shifts vectors in the teacher
Wenjie Zheng, Jianfei Yu, Rui Xia, and Shijin Wang. model, where the threshold parameter is set to 0.01
2023. A facial expression-aware multimodal multi- for IEMOCAP and 0.1 for MELD. Furthermore,
task learning framework for emotion recognition in Dropout is adjusted to 0.2 for MELD and 0.1 for
multi-party conversations. In Proceedings of the 61st
Annual Meeting of the Association for Computational IEMOCAP. The number of heads used in the multi-
Linguistics (Volume 1: Long Papers), pages 15445– head attention process is 4 for IEMOCAP and 3 for
15459. MELD.
Peixiang Zhong, Di Wang, and Chunyan Miao. 2019.
Knowledge-enriched transformer for emotion de- A.3 Compared Models
tection in textual conversations. arXiv preprint
arXiv:1909.10681.
We compare TelME against the following models:
DialogueRNN (Majumder et al., 2019) employs
Lixing Zhu, Gabriele Pergola, Lin Gui, Deyu Zhou, and Recurrent Neural Networks (RNNs) to capture the
Yulan He. 2021. Topic-driven and knowledge-aware speaker identity as well as the historical context and
transformer for dialogue emotion detection. arXiv
preprint arXiv:2106.01071. the emotions of past utterances to capture the nu-
ances of conversation dynamics. ConGCN (Zhang
A Appendix et al., 2019) utilizes a Graph Convolutional Net-
work (GCN) to represent relationships within a
A.1 Effect of the prompt graph that incorporates both context and speaker
information of multiple conversations. MMGCN
MELD IEMOCAP (Hu et al., 2021b) also proposes a GCN-based ap-
w/o prompt ([cls]+context) 65.25 66.48
proach, but captures representations of a conver-
context + prompt 66.57 66.60
sation through a graph that contains long-distance
Table 7: Comparison of the teacher performance based flow of information as well as speaker information.
on the use of the prompt DialogueTRM (Mao et al., 2021) focuses on mod-
eling both local and global context of conversations
Table 7 shows an ablation experiment on the to capture the temporal and spatial dependencies.
prompt. We remove the prompt and use the CLS DAG-ERC (Shen et al., 2021) studies how conver-
token to compare emotion prediction results with sation background affects information of the sur-
the results using the prompt. We observe from the rounding context of a conversation. MMDFN (Hu
results that the prompt helps to infer the emotion et al., 2022a) proposes a framework that aims to en-
of a recent speaker from a set of textual utterances. hance integration of multimodal features through
dynamic fusion. EmoCaps (Li et al., 2022) in-
A.2 Hyperparameter Settings troduces an emotion capsule that fuses informa-
Through our KD strategy, audio and visual en- tion from multiple modalities with emotional ten-
coders are trained using the loss functions men- dencies to provide a more nuanced understanding
tioned in Equation 6. In Lstudent , the balancing of emotions within a conversation. UniMSE (Hu
factors are all set to 1, excluding α for IEMOCAP. et al., 2022b) seeks to unify ERC with multimodal
The temperature parameter for the Lresponse func- sentiment analysis through a T5-based framework.
tion is adjusted to 4 for MELD and 2 for IEMO- GA2MIF (Li et al., 2023a) introduces a two-stage
CAP. The temperature parameter for Lf eature is multimodal fusion of information from a graph and
an attention network. FacialMMT (Zheng et al., SEED MELD IEMOCAP
2023) focuses on extracting the real speaker’s face 0 67.27 70.50
sequence from multi-party conversation videos and 1 67.41 70.69
then leverages auxiliary frame-level facial expres- 1234 67.44 70.21
sion recognition tasks to generate emotional visual
2023 67.24 69.95
representations.
42 67.37 70.48
mean 67.35 70.37
A.4 Error Analysis standard deviation 0.0781 0.2581
Table 9: Performance of the full framework for five
random seeds