Textualized and Feature-based Models for Compound Multimodal Emotion
Recognition in the Wild

Nicolas Richet1,  Soufiane Belharbi1,  Haseeb Aslam1,  Meike Emilie Schadt1,  Manuela González-González2,3,
 Gustave Cortal4,6,  Alessandro Lameiras Koerich1,  Marco Pedersoli1,  Alain Finkel4,5,  Simon Bacon2,3,
and  Eric Granger1
1
LIVIA, ILLS, Dept. of Systems Engineering, ETS Montreal, Canada
2 Dept. of Health, Kinesiology & Applied Physiology, Concordia University, Montreal, Canada
3 Montreal Behavioural Medicine Centre, CIUSSS Nord-de-l’Ile-de-Montréal, Canada
4 Université Paris-Saclay, CNRS, ENS Paris-Saclay, LMF, 91190, Gif-sur-Yvette, France
5 Institut Universitaire de France, France
6 Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France
{nicolas.richet.1,muhammad-haseeb.aslam.1}@ens.etsmtl.ca
{soufiane.belharbi,eric.granger}@etsmtl.ca
Abstract

Systems for multimodal emotion recognition (ER) are commonly trained to extract features from different modalities (e.g., visual, audio, and textual) that are combined to predict individual basic emotions. However, compound emotions often occur in real-world scenarios, and the uncertainty of recognizing such complex emotions over diverse modalities is challenging for feature-based models. As an alternative, emerging large language models (LLMs) like BERT and LLaMA can rely on explicit non-verbal cues that may be translated from different non-textual modalities (e.g., audio and visual) into text. Textualization of modalities augments data with emotional cues to help the LLM encode the interconnections between all modalities in a shared text space. In such text-based models, prior knowledge of ER tasks is leveraged to textualize relevant non-verbal cues such as audio tone from vocal expressions, and action unit intensity from facial expressions. Since the pre-trained weights are publicly available for many LLMs, training on large-scale datasets is unnecessary, allowing to fine-tune for downstream tasks such as compound ER (CER). This paper compares the potential of text- and feature-based approaches for compound multimodal ER in videos. Experiments were conducted on the challenging C-EXPR-DB dataset in the wild for CER, and contrasted with results on the MELD dataset for basic ER. Our results indicate that multimodal textualization provides lower accuracy than feature-based models on C-EXPR-DB, where text transcripts are captured in the wild. However, higher accuracy can be achieved when the video data has rich transcripts. Our code is available at: github.com/nicolas-richet/feature-vs-text-compound-emotion.

Keywords: Emotion Recognition, Compound Expressions, Multimodal Learning, Multimodal Textualization, Large Language Models

Refer to caption
Figure 1: Models for compound multimodal ER in videos. (a) In the feature-based approach, a dedicated feature extractor produces embeddings for each input modality. A feature-level fusion module then combines embeddings from all different modalities to produce joint feature representations for classification. (b) In the text-based approach, textual descriptions are extracted for nonverbal modalities, such as audio and visual. These texts are combined with verbal cues (i.e., text transcripts) and fed to an LLM as a joint textual representation for classification.

1 Introduction

Emotion recognition (ER) plays a critical role in human behavior analysis, human-computer interaction, and affective computing [13, 60]. Research in ER mainly focuses on recognizing the seven basic emotions – anger, surprise, disgust, enjoyment, fear, sadness, and contempt [4, 46]. Recently, there has been growing interest in recognizing complex emotions commonly occurring in real-world scenarios, such as compound emotions, where a mixture of emotions is exhibited [26]. ER is more challenging for complex emotions because they are often ambiguous and subtle, and can be easily confused with basic emotions. Despite the availability of multiple modalities that can potentially help to recognize these complex emotions [70], they can introduce additional uncertainty and conflict [23, 63].

Multimodal information, e.g., faces, voice, and text extracted from videos, has been used extensively to develop robust ER models [1, 60, 62]. Convolutional and transformer-based backbones are commonly trained to extract discriminant features from each modality. These include vision backbones such as ResNet [18] and audio backbones such as VGGish [19]. A fusion model is required to combine features from verbal (spoken text) and nonverbal cues (visual and audio), thereby producing contextualized features to predict accurate emotion classes [39]. Multimodal learning allows for building complex joint feature representations that can achieve high accuracy for ER. This feature-based approach has driven much progress in ER [60]. Indeed, it requires simple and minimal emotion-class annotations – typically enough to allow these models to learn to automatically extract and combine relevant features from different modalities for prediction. However, using only a single emotion class for CER in real-world scenarios and without any other guidance, such as output supervision or input cues, is challenging, especially using videos captured in the wild [30].

Recently, another multimodal learning approach called TextMI has been proposed for sentiment analysis [17] in videos. In contrast with the feature-based approach, input modalities like audio and visual are textualized. Based on prior knowledge of the task, the authors propose extracting nonverbal cues deemed relevant to the task in the form of a descriptive text. This can include a textual description of action unit (AU) intensity from visual [12, 14] and the tone of the audio. This conversion of modalities to text could be seen as an expert-based data augmentation with emotion-related cues used as inputs for the model during training and evaluation. This augmentation provides the models with direct guidance and emotional context to learn the task at hand. Since all modalities are formatted as text, a language model is required to better understand the interconnections between words and modalities and their relations to emotions. However, state-of-the-art language models are typically large, and training them requires a considerable amount of text data, which is not always available. The recent surge of large language models (LLMs) [72] made their use possible [17]. Powerful LLMs such as BERT [10] and LLaMA [56] have been pre-trained, and their weights have been made public, allowing us to fine-tune these models for downstream tasks [22]. Their application in multimodal ER provides a simple method for multimodal fusion, and results shown in [17] are promising. Multimodal textualization remains largely unexplored in the literature. In a very recent work [5], modality descriptions are employed to model output supervision for emotional reasoning and recognition. This differs from TextMI [17], which uses these cues as inputs. In this paper, we analyze text-based approaches like TextMI for multimodal CER.

Refer to caption
Figure 2: A common feature-based approach used for multimodal CER.

This paper focuses on the following question: how does textualized modeling perform against feature-based modeling for CER in videos? As shown in Fig. 1(a), feature-based methods employ different backbone models to automatically extract features from each modality. Feature-level fusion models also allow us to automatically learn joint feature representations for ER, although combining diverse modalities over videos remains a challenge [39]. The textualized approach [17] simplifies fusion since all nonverbal modalities (visual and audio) are converted into a single textual modality (see Fig. 1(b)). However, a bottleneck of the approach is that textualizing nonverbal modalities requires the choice of textual descriptions. For instance, the same audio segment can be textualized into the high/low valence/arousal spectrum, or the vocal intonation can be textualized. Similarly, for the facial modality, the choice of AUs and the granularity and context window are some of the design choices that can heavily affect the overall performance. The manual selection and design of textual descriptions require the intervention of domain experts to construct relevant nonverbal cues, making them application-dependent. This is similar to the contrast between using learned versus handcrafted features and the challenges of the latter.

In this paper, we compare the performance of state-of-the-art deep learning models that follow standard feature-based vs. text-based modeling approaches. An extensive set of experiments was conducted on the challenging C-EXPR-DB video dataset for CER in the context of the 7th Workshop and Competition on Affective Behavior Analysis in-the-Wild (ABAW) [28, 33, 30, 29, 25, 24, 32, 31, 27]. To further assess the benefits of using a textualized approach in basic ER, experiments were also conducted on the MELD video dataset.

2 Feature-based Modeling

This approach extracts features from audio (vocal) and video (facial) modalities and text transcripts for multimodal CER in videos. Feature embedding is combined for feature-level classification (see Fig. 2). The rest of this section provides more details on this approach.

Feature Representation. In ER applications, the de facto strategy to leverage different modalities extracted from videos is to extract their features [1, 34, 54, 62, 75]. These modalities typically include vision and audio. Textual modality, such as audio transcripts, is also included when available. Other ER applications such as pain estimation leverage bio-signals such as physiological modalities [62, 64]: electrodermal activity (EDA), electromyograph (EMG), and electrocardiogram (ECG). The general motivation behind combining these modalities is to leverage their complementary information over a video sequence.

Each modality typically employs a dedicated pre-trained feature extractor, which can be pre-trained on different large-scale datasets. In addition, public weights pre-trained on related datasets can be employed. For visual modality, ResNet [18] backbone is commonly used, which is followed by a module that leverages temporal information such as temporal convolutional network (TCN) [2]. 3D models such as R3D-CNN [57] can better leverage spatio-temporal dependency between frames directly at feature extraction. For audio modality, a variety of public pre-trained feature extractors are available, such as VGGish [19], Wav2Vec 2.0 [61], and HuBERT [21]. In addition, traditional audio features can be easily computed, such as spectrograms and MFCCs [65]. Multiple text feature extractors are available for the text modality, such as BERT [10] and RoBERTa [43]. Feature extractors are typically kept frozen while the subsequent modules are fine-tuned to avoid expensive computational costs.

Refer to caption
Figure 3: A common text-based approach used for multimodal CER, where non-verbal modalities are textualized.

Feature-Level Fusion. A bottleneck in feature-based models is fusion (Fig. 1). Different methods rely on temporal models to combine features from over a video, like LSTMs [7, 54, 48], or rely on simple concatenation [34, 73]. Recent works focus more on self- and cross-modal attention and transformers [59] to perform attention-based fusion [35, 44, 51, 58, 62, 69, 74]. This has shown promising results as they can capture inter- and intra-modality relationships. Note that aligning modalities in a video setup is challenging as well. For example, what part of audio or text could be assigned to a single frame remains unclear. Usually, an empirical sliding window is employed to align other modalities with a frame.

This paper follows the recent work [69] to experiment with a feature-based approach for the ER task (Fig. 2). Their method achieved good results for continuous ER (valence/arousal) in videos in a recent ABAW challenge. In our case, it was adapted to perform emotion classification in videos as well. In particular, we use ResNet50 [18], pre-trained on MS-CELEB1M [16] and FER+ [3] datasets for visual modality. For audio modality, we employ VGGish [19], while BERT [10] is used over text modality. Temporal information is further exploited by using temporal convolutional network (TCN) [2] after each feature extractor. We employ the co-attention block [69] to attend to features from different modalities. This builds a single embedding per frame while leveraging a contextual window. The per-frame feature is then fed to a classifier head to predict emotions.

3 Text-based Modeling

This approach extracts textual descriptions from audio (vocal) and visual (facial) modalities for multimodal CER in videos, and combines them with text transcripts. This joint textual description is processed by an LLM for classification (see Fig. 3). The rest of this section provides more details on this approach.

Audio and Visual Text Description. The API of Hume Inc.111https://www.hume.ai is employed over a sliding window to analyze the tone of the audio. Their model is trained on millions of human interactions222It is based on an empathic large language model (eLLM), combining language modeling and text-to-speech.. The API scores each tone characteristic, allowing us to sort and select the top 10. Examples of tone characteristics include confusion, anxiety, disappointment, distress, and even basic emotions. The name of each tone characteristic and a "Low" or "High" prefix, determined using a threshold on the score, are used to describe the tone textually. We also use a fine-tuned Wav2Vec 2.0 model [61]333huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim to predict scores for arousal, valence and dominance for each audio clip. Those scores are then categorized as "Low" or "High" using a threshold. The textual description produced from audio concatenates all tone and arousal-valence-dominance textual descriptions.

Face cropping and alignment are first performed at each frame using RetinaFace [8]. Py-feat library [6] is then used to extract action units (AUs) intensity [12, 14] along with basic emotion probabilities. AUs codebook [4] maps each facial expression to a set of action units. For instance, the expression "Happy" is associated with "AU6" (Cheek raise), "AU12" (Lip corner puller), and "AU25" (lips part). Typically, a set of these facial AUs activate at once. Over a sequence of frames, we select the maximum intensity of each action unit over time. Then, we use a threshold to determine which AUs are activated. In this case, the text description will be the concatenation of the name of each selected action unit as depicted in Fig. 3. The same procedure is repeated with basic emotions, where the top 3 emotions are selected, and their textual name is used as a description.

Combination of Transcripts with Audio and Visual Texts. Some multimodal datasets like MELD [49] and CMU-MOSEI [68] provide text transcriptions. We use Whisper-large-v2 [52] to generate transcripts when they are unavailable. Once all textual descriptions are acquired, we combine them into a single prompt and feed it to an LLM such as LLaMA-3 [47], in our case. Our used prompts follow this template:
"
Speech transcription of the video : [transcription];
Facial Action Units activated during the video : [visual_AU_text];
Emotions predicted from visual modality: [visual_emotions_text];
Characteristics of the prosody : [tone_description];
Audio emotional state : [arousal-valence-dominance_text]
"
.
A fully connected layer (FC) follows LLaMA-3 to perform classification. Both the LLM and the FC layer are fine-tuned using ground-truth class labels.

4 Results and Discussion

This paper compares feature- and text-based approaches for multimodal CER in videos in the wild. For a fair comparison, our experimental setup is constrained to be as similar as possible for the two approaches. This section provides the experimental methodology, results, and discussion. We also include results on basic emotions in videos.

4.1 Experimental Methodology

(1) Datasets. Two video-based datasets for emotion recognition are used: C-EXPR-DB (compound emotion), and MELD (basic emotion).

a) C-EXPR-DB [26]: It is composed of 56 videos taken from 7th ABAW CER Challenge test set. The full C-EXPR-DB [26] contains 400 videos with 200,000 frames in total. However, we do not have access to the 400, but only to the 56 test videos of the challenge. Each frame has an annotation with twelve compound emotions. For the 7th ABAW Challenge, only seven compound emotions are considered from the original twelve, which are: Fearfully Surprised, Happily Surprised, Sadly Surprised, Disgustedly Surprised, Angrily Surprised, Sadly Fearful, and Sadly Angry. In this work, these 56 videos used for the test are referred to as C-EXPR-DB. The challenge organizers provide it without annotation. To perform experiments earlier than the challenge deadline, we annotated these 56 videos by our internal expert team. Each video may have different parts where there are compound emotions. The annotation is done at frame level with the same seven compound emotions of the challenge in addition to the class "Other" that represents any other emotion, compound or basic, that is different from the considered seven ones. Once annotated, we cut each original video into segments using the annotation timeline. Each segment contains only one compound emotion. We obtained 125 segments. We refer to this dataset as C-EXPR-DB. Its class distribution is presented in Table.1. We split C-EXPR-DB into 5-cross validation. Performance is reported on the validation set of each fold. The evaluation is done only on compound emotions while discarding the class "Other". However, the training may or may not include this extra class. More details about our annotation are provided later.

Compound class Number of segments Total duration (secs)
Angrily Surprised    8  53.92
Sadly Angry  16 100.60
Fearfully Surprised  24 98.85
Happily Surprised  15 150.57
Sadly Fearful  19 107.56
Disgustedly Surprised  10  82.45
Sadly Surprised  13 182.17
Other  20  71.00
Total 125 847.15
Table 1: Class distribution and duration of our internally labeled C-EXPR-DB dataset from the 56 videos of the 7th ABAW CER Challenge taken from C-EXPR-DB dataset.
: In some of our experiments, the class "Other" has been used only for training. Only the seven compound emotions are considered for evaluation.
Code Definition
Fearfully Surprised Experiencing a sudden shock or surprise accompanied by fear. This might happen if something unexpected occurs that also appears threatening or dangerous. Must include at least one cue related to fear and a cue related to surprise, either occurring simultaneously or closely following each other.
Happily Surprised Experiencing a sudden and unexpected event that brings joy or happiness. This type of surprise is pleasant and delightful. Must include at least one cue related to happiness and at least one cue related to surprise. They should occur simultaneously or closely following each other.
Sadly Surprised Experiencing a sudden and unexpected event that brings sadness or disappointment. This type of surprise is unpleasant and upsetting. Must include at least one cue related to sadness and at least one cue related to surprise. They should occur simultaneously or closely following each other.
Disgustedly Surprised Experiencing a sudden shock or surprise that also causes feelings of disgust or revulsion. This might happen if something unexpected occurs that is also repulsive. Must include at least one cue related to disgust and at least one cue related to surprise. They should occur simultaneously or closely following each other.
Angrily Surprised Experiencing a sudden shock or surprise that provokes anger. This might happen if something unexpected occurs that is also infuriating. Must include at least one cue related to anger and at least one cue related to surprise. They should occur simultaneously or closely following each other.
Sadly Fearful Feeling both sadness and fear simultaneously. This can occur when facing a situation that is both threatening and sorrowful. Must include at least one cue related to sadness and at least one cue related to fear. They should occur simultaneously or closely following each other.
Sadly Angry Feeling both sadness and anger simultaneously. This can happen when dealing with a situation that evokes both sorrow and frustration or rage. Must include at least one cue related to sadness and at least one cue related to anger. They should occur simultaneously or closely following each other.
Table 2: Codebook used by our experts to annotate compound C-EXPR-DB emotions.

b) MELD (basic emotions) [49]: This multi-party dataset was created from video clipping of the TV show "Friends" utterances. The train, validation, and test sets consist of 9988, 1108, and 2610 utterances, respectively. Each utterance has one global label from seven basic emotions: anger, sadness, joy, neutral, fear, surprise, or disgust. In addition, a transcript of each utterance is provided. This dataset is unbalanced, where neutral is the most dominant label with 4710 utterances, while disgust is the least frequent label with 271 utterances.

(2) Annotation of Compound Emotions in C-EXPR-DB. The 56 test videos of C-EXPR-DB for the 7th ABAW CER Challenge were annotated by two expert annotators, both with a psychology background and one with extensive experience with emotion recognition. Each video was annotated by both annotators and subsequently triangulated to create one unified annotation file. The triangulation included a discussion to create agreement on the compound emotions found in each video, the segment of the video where it could be found (time stamps), and the reasoning behind the choice of compound emotion identified.

Annotators followed a codebook (Table.2), created specifically for the challenge, where the seven compound emotions (Fearfully Surprised, Happily Surprised, Sadly Surprised, Disgustedly Surprised, Angrily Surprised, Sadly Fearful, Sadly Angry) were properly described. In addition, the individual emotions (Happiness, Sadness, Anger, Fear, Disgust, and Surprise) were described and broken down in specific cues, including behavioral responses and facial, language, audio, and body language markers for each of the basic emotions. This allowed for the use of multi-modalities to properly identify the presence of the emotions. Compound emotions were annotated when both emotions occurred simultaneously or closely followed each other. Additionally, other emotions not related to the compound emotions were tagged as "Other". Videos were annotated using the software ELAN 6.8444archive.mpi.nl/tla/elan since both annotators had previous experience with this annotation tool.

(3) Baseline Models. For a fair comparison, recent models are considered. For the feature-based approach, we follow the work in [69] where we used ResNet50 [18] for feature extraction over visual modality. It is pre-trained on MS-CELEB1M [16] dataset as a facial recognition task. Then, it is fine-tuned on the FER+ [3] dataset. VGGish [19] is used for audio modality. Over text modality, the BERT Base Uncased model is used [10]. To leverage temporal dependency from videos, we used temporal convolutional network (TCN) [2], which has shown to yield good results in previous ABAW challenges [50, 69]. For the fusion module, we used the co-attention method (LFAN) proposed in [69], followed by a classification head. All three feature extractors are frozen, and every subsequent module is fine-tuned. For the text-based approach, once the text of all modalities is acquired, it is fed to a language model, which is followed by a dense layer to perform classification. In our experiments, we used LLaMA-3 8B [47], a recent open LLM that can fit in an average GPU. Both modules are fine-tuned using emotion labels as supervision.

(4) Training Protocol. The following learning strategies are used:
a) 7th ABAW CER Challenge: We train our model on MELD over the seven basic emotions. We then evaluate the final model over the 56 unlabeled videos of the test set C-EXPR-DB, following the challenge protocol. Since the model is trained over basic emotions, for each pair of the seven compound emotions, we sum their corresponding probabilities and pick the pair with the highest score to predict the compound emotion. We submitted different cases of feature and text approaches based on the training size of MELD: with 1% and 100% of total training samples. We alsp explored zero-shot predictions of a multimodal large language model (MLLM) using the LLaVA-NeXT-Video [71], which uses visual modality.
b) Feature vs. text comparison: Over C-EXPR-DB and MELD, we perform supervised learning on each dataset separately. For C-EXPR-DB, we report the performance on the validation set of each of the 5-cross validation, while we report test performance on MELD.

The next presents our experimental details of each approach:
1) Feature-based approach: The pre-processing of videos is carried out as follows. For the visual modality, RetinaFace [8] is used to crop and align faces from each frame, which are then resized to 48×484848{48\times 48}48 × 48. A sliding window over frames is used for training. The size is estimated using validation among 16n16𝑛{16*n}16 ∗ italic_n where n𝑛{n}italic_n spans from 2 to 18. Windows overlaps with the size of 16. A window of frames is fed to a pre-trained ResNet50 [18]. The audio of a video is initially converted to WAV format with a sampling rate of 16k. To synchronize with frames, we set the hop length to be 1/frame rate1frame rate1/\text{frame rate}1 / frame rate. For audio feature extraction, we used VGGish [19]. For text modality, we used the transcripts provided by MELD. However, for C-EXPR-DB and C-EXPR-DB, we used Whisper-large-v2 [52] to generate each video transcript. BERT [10] is then used to extract features aligned with frames. The three feature extractors are frozen. Only the subsequent modules are fine-tuned. The model [69] is fine-tuned on frame-level. When only a video-level emotion label is available, this same label is transferred to each frame in the video. Stochastic gradient descent (SGD) is used for optimization with a batch size estimated by validation from 2,4,8,16,182481618{2,4,8,16,18}2 , 4 , 8 , 16 , 18 and weight decay of 104superscript104{10^{-4}}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. For MELD dataset with 1%, we trained for 1000 epochs, while 300 epochs are used for the case of full data (100%) due to long computation time (37 hours on an NVIDIA A100 GPU). For C-EXPR-DB, we trained for 100 epochs. Standard cross-entropy loss is used for training.
2) Text-based approach: The pre-processing of each modality is performed as follows. For the visual modality, the Py-Feat555github.com/cosanlab/py-feat library is used to extract the intensity of 20202020 Face Action Units for each frame of a video. A maximum is then applied over a sliding window of frames to summarize the action unit scores for each frame. This sliding window is used as the context for each frame. The audio parts of the video are used as input to the prosody model using the API of hume.ai, and the top-10 tone characteristics are then used. We also used a fine-tuned Wav2Vec 2.0 model [61]666huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim to predict scores for arousal, valence, and dominance for each audio clip. We use QLoRA [9] in order to efficiently fine-tune LLaMA-3 8B [47]. The training is done at the frame level, and in the case where only video-level emotion is available, this same label is transferred to each frame in the video. To reduce the computation time on the MELD dataset and avoid too many prompt duplicates, we only use a few frames with their context from each video during the training while the test is conducted over all frames. SGD is also used for optimization with a batch size estimated by validation from {8,10,14}81014{\{8,10,14\}}{ 8 , 10 , 14 } and weight decay of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. For MELD dataset with 1%percent11\%1 %, we use a learning rate estimated by validation from {2×103,5×103,7×103}2superscript1035superscript1037superscript103{\{2\times 10^{-3},5\times 10^{-3},7\times 10^{-3}\}}{ 2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 7 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT } while learning rates from {7×103,102,3×102}7superscript103superscript1023superscript102{\{7\times 10^{-3},10^{-2},3\times 10^{-2}\}}{ 7 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 3 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT } are used for the case of full data. For the C-EXPR-DB dataset, we set a window hop size to decrease the number of redundant prompts used in training. We estimate the batch size from {6,8}68{\{6,8\}}{ 6 , 8 } and the learning rate from {2×103,5×103,7×103}2superscript1035superscript1037superscript103{\{2\times 10^{-3},5\times 10^{-3},7\times 10^{-3}\}}{ 2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 7 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT }. The window size is also selected from {10,15,20,30}10152030{\{10,15,20,30\}}{ 10 , 15 , 20 , 30 }, and we use a hop size of 10. Standard cross-entropy loss is used for training.
3) Zero-shot MLLM: The previous two approaches are compared with a zero-shot MLLM based on LLaVA-Next-Video [71], an open-source MLLM trained on text-image data and fine-tuned on video data. This model excels at strong zero-shot modality transfer, outperforming existing open-source MLLMs specifically trained for videos and achieving comparable performance to proprietary MLLMs. For basic ER and CER, we used raw videos and specific prompts to constrain the MLLM’s output to single responses related to discrete emotions: "Are the persons in the video either surprised, angry, joyful, sad, fearful, disgusted, or neutral? Choose a single answer.", and "Identify the primary compound emotion displayed by the individuals in the video from the following options: disgust-surprise, anger-surprise, fear-sadness, anger-sadness,
fear-surprise, happy-surprise, or sad-surprise."
. This limitation affects the MLLM’s performance, as it was designed to provide a rich description for understanding video content rather than making decisions on a single category. As a result, for some short videos, the MLLM returned non-valid emotions, such as descriptions of scenes or other text. We post-processed the outputs in these cases, replacing them with the "neutral" category.

(5) Performance Measures. To assess model performance, we use the average F1𝐹1F1italic_F 1 score required in the 7th ABAW Challenge. It is defined as follows,

{F1c=2×Precisionc×RecallcPrecisionc+Recallc;Precisionc=TPcTPc+FPc;Recallc=TPcTPc+FNc;F1=c=17wc×F1c;\left\{\begin{aligned} &F1_{c}=\frac{2\times{\text{Precision}}_{c}\times{\text% {Recall}}_{c}}{{\text{Precision}}_{c}+{\text{Recall}}_{c}};\\ &{\text{Precision}}_{c}=\frac{{TP}_{c}}{{TP}_{c}+{FP}_{c}};\quad{\text{Recall}% }_{c}=\frac{{TP}_{c}}{{TP}_{c}+{FN}_{c}};\\ &\quad F1=\sum_{c=1}^{7}w_{c}\times F1_{c};\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_F 1 start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 2 × Precision start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × Recall start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG Precision start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + Recall start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ; end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL Precision start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ; Recall start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ; end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_F 1 = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_F 1 start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ; end_CELL end_ROW (1)

where c𝑐citalic_c represents the class ID, TP𝑇𝑃TPitalic_T italic_P represents True Positives, FP𝐹𝑃FPitalic_F italic_P represents False Positives, and FN𝐹𝑁FNitalic_F italic_N represents False Negatives. Average F1𝐹1{F1}italic_F 1 is computed with wc=1/7subscript𝑤𝑐17{w_{c}=1/7}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1 / 7 while weighted F1𝐹1{F1}italic_F 1 is computed with wcsubscript𝑤𝑐{w_{c}}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT set to be the proportion of each class.

Following the literature, on the MELD dataset, we use the weighted F1𝐹1F1italic_F 1 score, which accounts for unbalanced classes, similarly to C-EXPR-DB. We note that evaluation over C-EXPR-DB is done over frame level similarly to C-EXPR-DB.

With the MELD dataset, only global video-level labels are available where a class is assigned to the entire video. Evaluation must also be performed at the video level, not the frame level. For video-level prediction, we post-process the frame-level predictions using three different strategies:

1) Majority voting: Majority voting is performed over the predicted classes of all the frames. The winning class becomes the prediction for the video.

2) Average logits: For each class, the average its logits are assesses across frames. This yields an average logit vector. The video class is the class with the maximum logit.

3) Average probabilities: This is similar to average logits, but it is perfromed over probabilities instead.

(6) Frame-based Ensembling. For the ABAW CER challenge, we perform a submission with ensembling over prediction labels of different models at the frame level. Given a set of models, we first perform single model prediction at frame level. Then, we perform label prediction aggregation at each frame to have a final frame label. To do so, at a time t𝑡titalic_t, we consider a window of length 10 that covers the frame t𝑡titalic_t and its previous frames. Then, we perform majority voting over the prediction label across all frames and models within that window. The label result of the vote is assigned as the final label prediction for the frame t𝑡titalic_t.

Feature-based Approach Text-based Approach
Modalities MELD C-EXPR-DB MELD C-EXPR-DB
Visual only 37.88 53.4653.46\bm{53.46}bold_53.46 25.18 19.82
Audio only 39.51 25.45 43.97 50.9650.96\bm{50.96}bold_50.96
Text (transcript) only 59.7859.78\bm{59.78}bold_59.78 33.09 62.5562.55\bm{62.55}bold_62.55 46.11
All 3 modalities 60.87 42.74 64.22 48.09
Table 3: Ablation study: weighted F1𝐹1F1italic_F 1 scores for unimodal (visual, audio, and text) and multimodal cases. Results are reported for video-level (MELD) and frame-level (C-EXPR-DB). For C-EXPR-DB dataset, we consider fold 0 and the case w/o "Other".

4.2 Results for Basic (MELD) and Compound (C-EXPR-DB) Emotions

Ablation studies were conducted on text- and feature-based approaches over single and multimodal cases (see Table 3). Both datasets were considered – MELD with relatively controlled setup, and C-EXPR-DB, an extreme case of in-the-wild. The first observation is the contribution of a single modality, which depends on the dataset. On MELD, where there are rich transcripts, both approaches achieved their highest F1𝐹1F1italic_F 1 score on the text modality with 59.78% for feature-based and 62.55% for text-based. However, other modalities contributed less. We note that visual textualization achieved a poor F1𝐹1F1italic_F 1 score of 25.18% compared to audio with 43.97%. This suggests that either there is less information in the visual modality or the textual cues used are less efficient in capturing the emotion. This pattern is consistent with the feature-based approach but with less decline in performance between the two modalities.

On C-EXPR-DB, the text transcript is very limited. In most videos, people shout, scream, talk very briefly, or do not talk and only express compound emotions visually. Such characteristics seem to be tied to compound emotions, making their prediction difficult. Interestingly, each approach leverages a different modality to deal with compound emotion prediction. In the feature-based method, visual modality seems to be the strongest, yielding an F1𝐹1F1italic_F 1 score of 53.46%. However, in the text-based method, audio seems to be the strongest modality with an F1𝐹1F1italic_F 1 score of 50.96%, while text modality ranks second with 46.11%, and visual modality yields a poor score of 19.82%. This very low score of visual modality may indicate that its textualization yields inadequate information since this result contradicts the feature-based approach which scores the highest with 53.46% suggesting that visual modality holds rich information. The choice of converting a modality into a set of textual descriptions can lead to drastic and irreversible loss of information. Handcrafting these textual cues requires expertise, which may lead to poor performance. Modality textualization is very difficult to apply in real-world applications. An additional observation that can be drawn from these results is that finetuned LLMs, as in our case where we used a fully connected layer for classification on top of it, are powerful over single text modality when the text is rich. This can be observed over MELD dataset where the text-based method yields a score of 62.55% over transcript only. However, the performance drops when using a poorer transcript over the compound emotion dataset (C-EXPR-DB) with a score of 46.11%.

Another observation from Table 3 is the impact of multiple modalities on performance. On MELD, using multimodal seems to improve performance compared to a single modality but leads to a considerable performance drop on compound emotion for both approaches. This may be explained by the conflicting emotions seen simultaneously through each modality. For example, the transcript "OK" can carry out the emotion ’Neutral’. However, the way it is said, and the facial appearance of saying it can change the emotion to suggest the dual emotions of ’Surprise’ and ’Happy’. In other cases, multi-conflicting instances of the same modality can be an issue, like in audio, where it is difficult to separate the sound source. For example, the case where a person is commenting on an event where we are interested in the emotion of the commentator. The overlay of the event and a person’s audio signals could easily reflect conflicting emotions. Videos in the wild are extremely challenging, and there are many different cases to consider. Focusing on the right instance in a modality remains a challenge [40], but it may be easier with the visual modality than with audio and text.

Feature-based Text-based MLLM zero-shot
Prediction Method 1%percent1{1\%}1 % 100%percent100{100\%}100 % 1%percent1{1\%}1 % 100%percent100{100\%}100 % LLaVa-NeXT-Video [71] (Visual)
Majority voting 43.64 59.92 33.3133.31\bm{33.31}bold_33.31 65.5065.50\bm{65.50}bold_65.50
Average logits 45.5845.58\bm{45.58}bold_45.58 60.8760.87\bm{60.87}bold_60.87 33.12 65.35 36.25
Average probabilities 43.34 59.72 33.09 65.29
Table 4: Video-level weighted F1𝐹1F1italic_F 1 score on MELD test set for feature- and text-based approaches. Three methods are used to extract video-level prediction from frame-level prediction. Training is performed on 1% and 100% of the original MELD training set.

A comparison between text- and feature-based approaches for basic (MELD) and compound (C-EXPR-DB) ER is presented in Tables 4 and 5, respectively. On MELD, the text-based approach yields the best performance with 4%absentpercent4{\approx 4\%}≈ 4 % above the feature-based approach. As discussed earlier, this is mainly due to the high quality of transcripts and the LLM. However, on C-EXPR-DB, the feature-based approach is ahead of the text-based approach. Given the poor transcript quality, textualization does not bridge the gap with the feature-based approach. Based on our experiments, one should attempt to use textualization over feature-based methods only when rich transcripts are available.

Methods Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Mean ±plus-or-minus\pm± std
Training w/ "Other"
\cdot Text-based 34.82 54.47 36.47 51.06 48.08 44.98 ±plus-or-minus\pm± 07.90
\cdot Feature-based 39.0739.07\bm{39.07}bold_39.07 64.7364.73\bm{64.73}bold_64.73 41.6141.61\bm{41.61}bold_41.61 55.6555.65\bm{55.65}bold_55.65 70.0870.08\bm{70.08}bold_70.08 54.22±12.26plus-or-minus54.2212.26\bm{54.22\pm 12.26}bold_54.22 bold_± bold_12.26
Training w/o "Other"
\cdot Text-based 48.0948.09\bm{48.09}bold_48.09 48.09 34.59 59.2459.24\bm{59.24}bold_59.24 55.83 49.17 ±plus-or-minus\pm± 08.49
\cdot Feature-based 42.74 65.5665.56\bm{65.56}bold_65.56 41.6641.66\bm{41.66}bold_41.66 54.74 70.6570.65\bm{70.65}bold_70.65 55.07±11.70plus-or-minus55.0711.70\bm{55.07\pm 11.70}bold_55.07 bold_± bold_11.70
Table 5: Frame-level weighted F1𝐹1F1italic_F 1 score on C-EXPR-DB validation sets.

Experiments on C-EXPR-DB with ensembling at the frame level (see Table 6) indicate that combining the prediction of relatively good models yields better performance. Regarding the performance of Zero-shot MLLM LLaVa-NeXT-Video [71], we obtained low performance over both datasets. Acknowledging that we have constrained its output to be a single label is important. This may have limited its performance since a single label may not be enough to express an emotion. Despite this, the model produced decent performance over MELD with an F1𝐹1F1italic_F 1 score of 36.25%. However, limited performance is reported over compound emotion dataset C-EXPR-DB. We observed that the model predicts the class ’Fearfully Surprised’ on this specific dataset almost at every frame. Predicting two emotions simultaneously seems more challenging than a single emotion for this model.

Methods Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Mean ±plus-or-minus\pm± Std
Text-based 48.09 48.09 34.59 59.24 55.83 49.16± 9.49plus-or-minus49.16 9.49{49.16\pm\text{\ }9.49}49.16 ± 9.49
Feature-based 42.74 65.56 41.66 54.74 70.65 55.07±13.08plus-or-minus55.0713.08{55.07\pm 13.08}55.07 ± 13.08
Zero-shot MLLM LLaVa-NeXT-Video [71]  9.31  4.83  8.27 12.18  8.87  8.69± 2.63plus-or-minus 8.69 2.63{\text{\ }8.69\pm\text{\ }2.63}8.69 ± 2.63
Frame-based ensembling over all 3 previous methods 54.2154.21\bm{54.21}bold_54.21 65.86 35.63 63.39 69.04 58.48±15.40plus-or-minus58.4815.40{58.48\pm 15.40}58.48 ± 15.40
Weighted Frame-based ensembling: Feature- and text-based only 50.75 69.0869.08\bm{69.08}bold_69.08 49.5749.57\bm{49.57}bold_49.57 63.6263.62\bm{63.62}bold_63.62 75.1575.15\bm{75.15}bold_75.15 61.63±11.24plus-or-minus61.6311.24{\bm{61.63\pm 11.24}}bold_61.63 bold_± bold_11.24
Table 6: Frame-level weighted F1𝐹1F1italic_F 1 score on C-EXPR-DB validation sets for our three compared methods in addition to their predictions ensembling. We consider the case w/o "Other" in Table 5. The weighted frame-based ensembling gives twice the weight to the feature-based predictions compared to text-based predictions.

We note that our obtained results over MELD are competitive with state-of-the-art performance as presented in Table 7. The text-based method over the case without contextual information achieved a new state-of-the-art F1𝐹1F1italic_F 1 score of 65.50%.

Model (year) Modalities Context Info Weighted F1
HCAM (’24)[11] T+A 65.80
SDT (’23)[45] T+A+V 66.60
DF-ERC (’23)[37] T+A+V 67.03
EACL (’24)[66] T 67.12
TelME (’24)[67] T+A+V 67.37
InstructERC (’23)[36] T 69.15
CKERC (’24)[15] T 69.27
SMCN (’22)[20] T+A 62.3
HCAM Stage I (’24)[11] T 63.3
SSE-FT (’20)[55] T+A+V 63.9
Text-based T+A+V 65.50
Feature-based T+A+V 60.87
Zero-shot MLLM LLaVa-NeXT-Video [71] (’24) V 36.25
Table 7: Comparison of recent results on the MELD dataset.

We conducted additional experiments over C-EXPR-DB where we investigated the impact of weights initialization by comparing random vs MELD pretrained weights over both text- and feature-based method. Results are reported in Table 8. These results convey a mixed message where in some cases pretrained weights can help and sometimes, it is better to start from random initialization depending on the fold. However, we note that feature-based method is more robust to initialization as we observe only slight performance shift (around 2%) between random and pretrained case. However, for the case of text-based method, in some folds the performance shift is large an can go up to 8% and 14%.

Table 10 present video-level weighted F1𝐹1F1italic_F 1 score on C-EXPR-DB validation sets. At video-level, feature-based method has a clear winning marge compared to text-based method overall.

Methods Fold 0 Fold 1 Fold 2 Fold 3 Fold 4
Feature-based
\cdot Random init. 42.74 65.56 41.6641.66\bm{41.66}bold_41.66 54.7454.74\bm{54.74}bold_54.74 70.6570.65\bm{70.65}bold_70.65
\cdot Init. over MELD 43.9643.96\bm{43.96}bold_43.96 68.1468.14\bm{68.14}bold_68.14 38.87 54.07 68.45
Text-based
\cdot Random init. 48.0948.09\bm{48.09}bold_48.09 48.09 34.59 59.2459.24\bm{59.24}bold_59.24 55.83
\cdot Init. over MELD 34.99 56.3056.30\bm{56.30}bold_56.30 37.9437.94\bm{37.94}bold_37.94 58.53 57.1357.13\bm{57.13}bold_57.13
Table 8: Impact of pre-training: Frame-level weighted F1𝐹1F1italic_F 1 score on C-EXPR-DB validation sets with different weights initialization: random vs pre-trained over MELD. We consider the case w/o "Other". In the pre-trained case, the fully connected last layer is randomly initialized.
Case/Approach Text-based Feature-based Zero-shot MLLM LLaVa-NeXT-Video [71]
Train time 1 epoch
MELD 9.3min 6.5min
C-EXPR-DB 2.1min 23sec
Inference time per-frame 35ms 0.12ms 715ms
Total n. params. 7.51B 223.91M 7.06B
N. learnable params. 3.44M 5.00M
N. FLOPs (TFLOPs) 597.52 1.87
N. MACs 298.75TMACs 938.76 GMACs
Table 9: Comparison of computation time, number of parameters, number of FLOPs/MACs of text- and feature-based approach used in our experiments, in addition to a zero-shot MLLM.
Folds / Case Training w/ "Other" Training w/o "Other" Zero-shot LMM LLaVa-NeXT-Video [71] (Visual)
Text-based Feature-based Text-based Feature-based
Majority voting
Fold 0 17.96 32.83 32.91 42.74
Fold 1 44.46 46.59 30.08 49.47
Fold 2 18.61 45.45 31.87 45.67
Fold 3 35.10 30.15 34.06 40.73
Fold 4 28.15 54.32 26.05 53.96
Average logits
Fold 0 20.76 36.56 32.91 46.52 12.02
Fold 1 42.12 49.01 26.92 43.98 34.20
Fold 2 18.61 45.45 31.87 44.66 25.06
Fold 3 35.10 29.90 39.21 37.14 29.23
Fold 4 28.15 54.32 16.50 53.96 26.45
Average probabilities
Fold 0 17.96 36.56 32.91 42.74
Fold 1 44.46 40.98 26.92 40.38
Fold 2 18.61 36.55 31.87 34.80
Fold 3 35.10 29.90 34.06 40.73
Fold 4 28.15 54.32 22.35 53.96
Table 10: Video-level weighted F1𝐹1F1italic_F 1 score on C-EXPR-DB validation sets.

4.3 7th ABAW CER Challenge Results

Table 11 presents our results in the 7th ABAW CER Challenge Results. In our first four submissions, we presented our text-based (100% of MELD), feature-based (1% and 100% of MELD), and zero-shot MLLM models. Except for the zero-shot case, the training is done over basic emotion dataset MELD, as described in Section 4.1. Then, we combined the predicted labels to construct a compound emotion prediction on C-EXPR-DB. Over these four submissions, the feature-based method leads with an F1𝐹1F1italic_F 1 score of 22.64%, followed by the text-based method with a score of 19.86%. Our final submission, which performs frame-level prediction fusion across the four submissions, achieved the highest score of 25.90%, leading to rank three in this challenge.

Methods Average F1𝐹1F1italic_F 1 Score
Netease Fuxi AI Lab, Liu et al [41] 60.63
HSEmotion, Savchenko [53] 32.43
HFUT-MAC2, Liu et al [42] 22.81
AIPL-BME-SEU, Li et al [38] 16.44
ETS-LIVIA (Ours methods)
\cdot Text-based (100% of MELD) 19.86
\cdot Feature-based (100% of MELD) 22.64
\cdot Feature-based (1% of MELD) 16.20
\cdot Zero-shot MLLM LLaVa-NeXT-Video [71] 17.67
\cdot Frame-based ensembling of our 4 submissions 25.90
Table 11: ABAW CER Challenge: frame-level average F1𝐹1F1italic_F 1 score on C-EXPR-DB test set.

4.4 Computation Time

We present in Table 9 the computation time and the number of parameters of different methods. Computations are done on an NVIDIA A100 GPU. The reported time here does not account for data pre-processing such as face detection/cropping/alignment, audio transcript extraction, cues extraction (action units, basic emotions, audio tone), and feature extractions in frozen encoders (feature approach). The train time is measured with a batch size of 8. A window of 224 with a hop length of 16 is used for the feature-based. A window of 15 with a hop length of 10 is used for text-based. The number of parameters in the feature-based method is decomposed as follows: 218.91M for the three feature extractors (37.28M for ResNet50 [18], 72.14M for VGGish [19], 109.48M for BERT [10]), and 5M for the fusion module. VGGish and BERT models are used only to pre-compute and store the features on disk to speed up training and evaluation. They are not used in any computations afterward. The number of parameters in the text-based method is controlled by QLoRA [9].

The total number of FLOPs and MACs is computed on the same device, an NVIDIA A100 GPU using the library pypi.org/project/calflops (V0.3.2) over a sequence of 224 frames with roughly 9 seconds length. For the feature-based model, the entire sequence is processed at once. The text input has roughly 12 words. The total studied model includes the 3 backbones, with 1.87 TFLOPs and 938.76 GMACs in total. This amounts to 8.39 GFLOPs and 4.19 GMACs per frame. Our experiments use offline audio and text backbones to extract the corresponding features. This leads to 1.36 TFLOPs and 680.31 GMACs for the process of the full sequence. Most of the total computations are consumed by the visual backbone (ResNet50 [18]) alone, with 1.35 TFLOPs and 678.10 GMACs. For the text-based model, we use one prompt per frame with a context window size of 20. This amounts to a total of 597.52 TFLOPs and 298.75 TMACs, which is equivalent to 2.67 TFLOPs and 1.33 TMACs per frame. These results indicate that text-based model consumes more than 300 times the computation needed by feature-based. This makes text-based approach quite computationally expensive. This approach also requires a lot of preprocessing by large models to extract additional textual information such as action units, emotions, and audio cues from visual and audio modalities. Note that all measurements reported here depend on the sequence length and account only for the forward computations while excluding all preprocessing steps such as face detection.

4.5 Challenges of Textualizing Modalities

Using deep feature-based models for multimodal ER is the most common approach. It is easy to use and it requires less effort and expertise from the user to implement. Adequate features can be automatically extracted and combined from different modalities by a feature extractor without the need for manual intervention. Since we provide full data to the model, we can assume that there is no loss of information at the model input. In addition, it is easily transferable to other tasks without much change. The publicly available pre-trained feature extractors make it more attractive. However, a well-known bottleneck of this approach is the multimodal and spatio-temporal fusion of diverse modalities [39].

Modality textualization is a very recent topic [17]. While it can leverage the very recent progress of LLMs, it still faces several limitations to be a practical and competitive approach to the feature-based method. We can mention two main limitations. The first one is the need for domain experts to select the cues to be extracted from each modality and how they should be textualized. For example, cues other than AUs could be extracted in the visual modality, and there could also be different ways to convert them into text. Handcrafting of cues is challenging as they are not guaranteed to provide optimum textualization. Moreover, each modality requires its specific textualization, which depends on the task. Changing the application or task requires new domain experts and new textual cues that are most suitable. Models are less transferable to other applications, which limits their usage.

Another issue is the loss of information. The process of textualization performs a discretization of a modality from row data to textual descriptions. This mapping will most likely lead to information loss and poor performance. This has been observed on C-EXPR-DB where the feature-based approach outperforms on visual, whereas its counterpart yields the lowest performance, indicating a potential loss of information during our choice for textualization.

A main benefit of using a text-based approach is to leverage the potential of LLMs when dealing with a dataset with rich transcripts. Combining rich transcripts with textualized modalities may lead to higher ER accuracy than a feature-based method, as observed over MELD. However, conflicting modalities could lead to poor performance over compound emotions. Feature-based methods leverage fusion, especially late fusion, which may balance this conflict. However, text-based approaches lack such explicit modules as all modalities are treated indistinguishably as they are all processed by the same module. Although the authors of [17] motivated textualization as an easy way to perform multimodal fusion, such said fusion could be its limitation. Performing very early fusion by manually converting all modalities into text to be processed by the same module, such as an LLM, could hinder the benefit of a modality and make it less efficient, as most of the information could already be lost. Most of the work needs to be done by the LLM to recover the missing information. However, in the feature-based approach, the specialized feature extractor per modality does a lot of work leading to reliable features. This eases the late fusion. We note that text-based approach could be computationally expensive with 597.52 TFLOPs for inference over a sequence of 224 frames while feature-based method yields 1.87 TFLOPs.

The choice of the best approach for CER or basic ER remains an open question. Similarly, the choice and design of textual cues of different modalities are still in their early stages. The feature-based models are easier to apply, yet text-based models can yield better results when dealing with rich text.

5 Conclusion

Multimodal ER is a challenging task, especially when recognizing the complex compound emotions that are captured in real-world unconstrained videos. The central question in this study is: how textualized modeling performs compared to feature-based modeling for CER in videos?. We performed several experiments to compare the performance of deep feature-based and text-based models on CER and basic ER datasets. Our results have uncovered several challenges related to modality textualization on C-EXPR-DB, where text transcripts are captured in the wild. Feature-based methods may still provide better accuracy in this case. However, a text-based approach may yield better results when videos have high-quality transcripts.

Acknowledgement

This work was supported by the Fonds de recherche du Québec – Santé (FRQS), the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Foundation for Innovation (CFI), and the Digital Research Alliance of Canada.

References

  • Aslam et al., [2024] Aslam, M., Zeeshan, M., Belharbi, S., Pedersoli, M., Koerich, A., Bacon, S., and Granger, E. (2024). Distilling privileged multimodal information for expression recognition using optimal transport. In International Conference on Automatic Face and Gesture Recognition.
  • Bai et al., [2018] Bai, S., Kolter, J., and Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR, abs/1803.01271.
  • Barsoum et al., [2016] Barsoum, E., Zhang, C., Canton-Ferrer, C., and Zhang, Z. (2016). Training deep networks for facial expression recognition with crowd-sourced label distribution. In ICLM.
  • Belharbi et al., [2024] Belharbi, S., Pedersoli, M., Koerich, A. L., Bacon, S., and Granger, E. (2024). Guided interpretable facial expression recognition via spatial action unit cues. In International Conference on Automatic Face and Gesture Recognition.
  • Cheng et al., [2024] Cheng, Z., Cheng, Z., He, J., Sun, J., Wang, K., Lin, Y., Lian, Z., Peng, X., and Hauptmann, A. (2024). Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning. CoRR, abs/2406.11161.
  • Cheong et al., [2023] Cheong, J. H., Jolly, E., Xie, T., Byrne, S., Kenney, M., and Chang, L. J. (2023). Py-feat: Python facial expression analysis toolbox. Affective Science, 4(4):781–796.
  • Deng et al., [2021] Deng, D., Wu, L., and Shi, B. (2021). Iterative distillation for better uncertainty estimates in multitask emotion recognition. In ICCVw.
  • Deng et al., [2019] Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., and Zafeiriou, S. (2019). Retinaface: Single-stage dense face localisation in the wild. CoRR, abs/1905.00641.
  • Dettmers et al., [2023] Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms. In NeurIPS.
  • Devlin et al., [2019] Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186.
  • Dutta and Ganapathy, [2024] Dutta, S. and Ganapathy, S. (2024). Hcam – hierarchical cross attention model for multi-modal emotion recognition. CoRR, abs/2304.06910.
  • Ekman and Friesen, [1978] Ekman, P. and Friesen, W. V. (1978). Facial action coding system. Environmental Psychology & Nonverbal Behavior.
  • Ezzameli and Mahersia, [2023] Ezzameli, K. and Mahersia, H. (2023). Emotion recognition from unimodal to multimodal analysis: A review. Information Fusion, 99:101847.
  • Friesen and Ekman, [1978] Friesen, E. and Ekman, P. (1978). Facial action coding system: a technique for the measurement of facial movement. Palo Alto, 3(2):5.
  • Fu, [2024] Fu, Y. (2024). Ckerc : Joint large language models with commonsense knowledge for emotion recognition in conversation. CoRR, abs/2403.07260.
  • Guo et al., [2016] Guo, Y., Zhang, L., Hu, Y., He, X., and Gao, J. (2016). Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In ECCV.
  • Hasan et al., [2023] Hasan, M., Islam, M., Lee, S., Rahman, W., Naim, I., Khan, M., and Hoque, E. (2023). Textmi: Textualize multimodal information for integrating non-verbal cues in pre-trained language models. CoRR, abs/2303.15430.
  • He et al., [2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
  • Hershey et al., [2017] Hershey, S., Chaudhuri, S., Ellis, D., Gemmeke, J., Jansen, A., Moore, R., Plakal, M., Platt, D., Saurous, R., Seybold, B., Slaney, M., Weiss, R., and Wilson, K. (2017). Cnn architectures for large-scale audio classification. In ICASSP.
  • Hou et al., [2022] Hou, M., Zhang, Z., and Lu, G. (2022). Multi-modal emotion recognition with self-guided modality calibration. In ICASSP.
  • Hsu et al., [2021] Hsu, W., Bolte, B., Tsai, Y. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio Speech and Language Processing, 29:3451–3460.
  • J et al., [2024] J, M. R., VM, K., Warrier, H., and Gupta, Y. (2024). Fine tuning LLM for enterprise: Practical guidelines and recommendations. CoRR, abs/2404.10779.
  • Ji et al., [2023] Ji, Y., Wang, J., Gong, Y., Zhang, L., Zhu, Y., Wang, H., Zhang, J., Sakai, T., and Yang, Y. (2023). MAP: multimodal uncertainty-aware vision-language pre-training model. In CVPR.
  • Kollias, [2022] Kollias, D. (2022). Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. In CVPR.
  • [25] Kollias, D. (2023a). Abaw: Learning from synthetic data & multi-task learning challenges. In ECCV.
  • [26] Kollias, D. (2023b). Multi-label compound expression recognition: C-expr database & network. In CVPR.
  • Kollias et al., [2020] Kollias, D., Schulc, A., Hajiyev, E., and Zafeiriou, S. (2020). Analysing affective behavior in the first abaw 2020 competition. In International Conference on Automatic Face and Gesture Recognition.
  • [28] Kollias, D., Sharmanska, V., and Zafeiriou, S. (2024a). Distribution matching for multi-task learning of classification tasks: a large-scale study on faces & beyond. CoRR, abs/2401.01219.
  • Kollias et al., [2023] Kollias, D., Tzirakis, P., Baird, A., Cowen, A., and Zafeiriou, S. (2023). Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. In CVPR.
  • [30] Kollias, D., Tzirakis, P., Cowen, A., Zafeiriou, S., Shao, C., and Hu, G. (2024b). The 6th affective behavior analysis in-the-wild (abaw) competition. CoRR, abs/2402.19344.
  • [31] Kollias, D. and Zafeiriou, S. (2021a). Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. CoRR, abs/2103.15792.
  • [32] Kollias, D. and Zafeiriou, S. (2021b). Analysing affective behavior in the second abaw2 competition. In ICCV.
  • [33] Kollias, D., Zafeiriou, S., Kotsia, I., Dhall, A., Ghosh, S., Shao, C., and Hu, G. (2024c). 7th abaw competition: Multi-task learning and compound expression recognition. CoRR, abs/2407.03835.
  • Kuhnke et al., [2020] Kuhnke, F., Rumberg, L., and Ostermann, J. (2020). Two-stream aural-visual affect analysis in the wild. In International Conference on Automatic Face and Gesture Recognition.
  • Le et al., [2023] Le, H.-D., Lee, G.-S., Kim, S.-H., Kim, S., and Yang, H.-J. (2023). Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access, 11:14742–14751.
  • Lei et al., [2024] Lei, S., Dong, G., Wang, X., Wang, K., and Wang, S. (2024). Instructerc: Reforming emotion recognition in conversation with a retrieval multi-task llms framework. CoRR, abs/2309.11911.
  • Li et al., [2023] Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.-S., Ji, D., and Li, F. (2023). Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In International Conference on Multimedia.
  • Li et al., [2024] Li, S., Lian, H., Lu, C., Zhao, Y., Qi, T., Yang, H., Zong, Y., and Zheng, W. (2024). Temporal label hierachical network for compound emotion recognition. CoRR, abs/2407.12973.
  • Liang et al., [2024] Liang, P., Zadeh, A., and Morency, L. (2024). Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):264.
  • Liao et al., [2023] Liao, J., Duan, H., Feng, K., Zhao, W., Yang, Y., and Chen, L. (2023). A light weight model for active speaker detection. In CVPR.
  • [41] Liu, C., Zhang, W., Qiu, F., Li, L., and Yu, X. (2024a). Affective behaviour analysis via progressive learning. CoRR, abs/2407.16945.
  • [42] Liu, X., Shen, K., Yao, J., Wang, B., Liu, M., An, L., Cui, Z., Feng, W., and Sun, X. (2024b). Compound expression recognition via multi model ensemble for the abaw7 challenge. CoRR, abs/2407.12257.
  • Liu et al., [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  • Lu et al., [2023] Lu, Z., Ozek, B., and Kamarthi, S. (2023). Transformer encoder with multiscale deep learning for pain classification using physiological signals. Frontiers in Physiology, 14:1294577.
  • Ma et al., [2023] Ma, H., Wang, J., Lin, H., Zhang, B., Zhang, Y., and Xu, B. (2023). A transformer-based model with self-distillation for multimodal emotion recognition in conversations. IEEE Transactions on Multimedia.
  • Matsumoto, [1992] Matsumoto, D. (1992). More evidence for the universality of a contempt expression. Motivation and Emotion, 16:363–368.
  • Meta LLaMA Team, [2024] Meta LLaMA Team (2024). Introducing meta llama 3: The most capable openly available llm to date.
  • Nguyen et al., [2021] Nguyen, D., Nguyen, D., Zeng, R., Nguyen, T., Tran, S., Nguyen, T., Sridharan, S., and Fookes, C. (2021). Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition. IEEE Transactions on Multimedia, 24:1313–1324.
  • Poria et al., [2019] Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., and Mihalcea, R. (2019). MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Conference of the Association for Computational Linguistics, pages 527–536.
  • Praveen and Alam, [2024] Praveen, R. and Alam, J. (2024). Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition. In CVPRw.
  • Praveen et al., [2021] Praveen, R., Granger, E., and Cardinal, P. (2021). Cross attentional audio-visual fusion for dimensional emotion recognition. In International Conference on Automatic Face and Gesture Recognition.
  • Radford et al., [2023] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In ICLM.
  • Savchenko, [2024] Savchenko, A. (2024). Hsemotion team at the 7th abaw challenge: Multi-task learning and compound facial expression recognition. CoRR, abs/2407.13184.
  • Schoneveld et al., [2021] Schoneveld, L., Othmani, A., and Abdelkawy, H. (2021). Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146:1–7.
  • Siriwardhana et al., [2020] Siriwardhana, S., Kaluarachchi, T., Billinghurst, M., and Nanayakkara, S. (2020). Multimodal emotion recognition with transformer-based self supervised feature fusion. IEEE Access, 8:176274–176285.
  • Touvron et al., [2023] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  • Tran et al., [2018] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR.
  • Tran and Soleymani, [2022] Tran, M. and Soleymani, M. (2022). A pre-trained audio-visual transformer for emotion recognition. In ICASSP.
  • Vaswani et al., [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. NeurIPS.
  • Vijayaraghavan et al., [2024] Vijayaraghavan, G., T., M., Dhanasekaran, P., and E., U. (2024). Multimodal emotion recognition with deep learning: Advancements, challenges, and future directions. Information Fusion, 105:102218.
  • Wagner et al., [2023] Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., and Schuller, B. (2023). Dawn of the transformer era in speech emotion recognition: Closing the valence gap. TPAMI, pages 1–13.
  • Waligora et al., [2024] Waligora, P., Aslam, H., Zeeshan, O., Belharbi, S., Koerich, A. L., Pedersoli, M., Bacon, S., and Granger, E. (2024). Joint multimodal transformer for emotion recognition in the wild. In CVPRw.
  • Wang et al., [2022] Wang, H., Zhang, J., Chen, Y., Ma, C., Avery, J., Hull, L., and Carneiro, G. (2022). Uncertainty-aware multi-modal learning via cross-modal random network prediction. In ECCV.
  • Werner et al., [2014] Werner, P., Al-Hamadi, A., Niese, R., Walter, S., Gruss, S., and Traue, H. (2014). Automatic pain recognition from video and biomedical signals. In International Conference on Pattern Recognition.
  • Xu et al., [2004] Xu, M., Duan, L.-Y., Cai, J., Chia, L.-T., Xu, C., and Tian, Q. (2004). Hmm-based audio keyword generation. In Pacific-Rim Conference on Multimedia.
  • Yu et al., [2024] Yu, F., Guo, J., Wu, Z., and Dai, X. (2024). Emotion-anchored contrastive learning framework for emotion recognition in conversation. CoRR, abs/2403.20289.
  • Yun et al., [2024] Yun, T., Lim, H., Lee, J., and Song, M. (2024). Telme: Teacher-leading multimodal fusion network for emotion recognition in conversation. CoRR, abs/2401.12987.
  • Zadeh et al., [2018] Zadeh, A., Liang, P. P., Poria, S., Cambria, E., and Morency, L. (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Association for Computational Linguistics.
  • Zhang et al., [2022] Zhang, S., An, R., Ding, Y., and Guan, C. (2022). Continuous emotion recognition using visual-audio-linguistic information: A technical report for ABAW3. In CVPRw.
  • [70] Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., and Yu, X. (2024a). Affective behaviour analysis via integrating multi-modal knowledge. In CVPRw.
  • [71] Zhang, Y., Li, B., Liu, H., Lee, Y., Gui, L., Fu, D., Feng, J., Liu, Z., and Li, C. (2024b). Llava-next: A strong zero-shot video understanding model.
  • Zhao et al., [2023] Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J., and Wen, J. (2023). A survey of large language models. CoRR, abs/2303.18223.
  • Zhi et al., [2021] Zhi, R., Zhou, C., Yu, J., Li, T., and Zamzmi, G. (2021). Multimodal-based stream integrated neural networks for pain assessment. IEICE Transactions on Information and Systems, 104-D(12):2184–2194.
  • Zhou et al., [2023] Zhou, W., Lu, J., Xiong, Z., and Wang, W. (2023). Leveraging TCN and transformer for effective visual-audio fusion in continuous emotion recognition. In CVPRw.
  • Zong et al., [2023] Zong, Y., Aodha, O. M., and Hospedales, T. (2023). Self-supervised multimodal learning: A survey. CoRR, abs/2304.01008.