1. Introduction
Sentiment analysis traditionally focuses on text to determine the underlying emotional tone as positive, negative, or neutral[
1,
2,
42].With the advent of multimedia sharing platforms, there is an upsurge in video content featuring expressive modalities beyond text—spoken words [
36,
43], vocal dynamics, and visual cues. Multimodal sentiment analysis seeks to combine these elements to provide a more comprehensive understanding of sentiments expressed in videos [
3,
4].
Multimodal sentiment analysis is an increasingly crucial area of study within the domain of artificial intelligence, focusing on the interpretation and analysis of emotions from multiple types of data, such as text, audio, and video. This multidisciplinary field leverages advancements from several areas including natural language processing, computer vision, and audio analysis, to understand the nuances of human sentiment more comprehensively. The traditional approach to sentiment analysis primarily dealt with text. However, with the growth of video content on platforms like YouTube, Instagram, and TikTok, the need to analyze sentiment from not only text but also audio and visual cues has become essential. Each mode—text, audio, visual—can provide complementary information which, when integrated, offers a fuller understanding of the communicator’s emotional state [
1,
5,
6].
While the multimodal approach offers depth, the annotated datasets available for training are modest in size and variability [
1,
5,
6]. This limitation often causes models to associate sentiment with incidental attributes of speakers, such as wearing glasses, rather than the intended sentiment indicators [
49,
50]. For instance, a scenario where models trained on sparse data associate negative sentiment with visual indicators such as eyewear illustrates the challenge of dataset bias. These biases act as confounding factors, statistically skewing results and reducing the efficacy of sentiment analysis models across different datasets [
7,
8]. An examination of the MOSI dataset revealed significant dependencies between sentiments expressed and individual identities, indicating potential biases in model training that could lead to erroneous sentiment predictions based on personal characteristics rather than actual sentiment [
6,
58].
One of the major challenges in multimodal sentiment analysis is the fusion of different modalities. How best to integrate these sources of data remains a significant research question. Early fusion and late fusion are common strategies [
25,
27]; the former integrates features at the beginning of the process [
59,
60], while the latter combines the outputs near the end. More sophisticated approaches, like hybrid fusion, attempt to capture the strengths of both methods [
11]. Another challenge is the scarcity of labeled multimodal datasets, which are crucial for training machine learning models. The cost of obtaining high-quality annotations across multiple modalities can be prohibitive, limiting the availability of large-scale datasets [
15].
To address these challenges, this paper proposes the Enhanced Modal Fusion Learning (EMFL) approach, a robust framework designed for neural networks, particularly convolutional neural networks. EMFL involves a dual-phase process where the first phase (Selection) identifies and isolates confounding factors, and the second phase (Addition) minimizes their impact by integrating stochastic disturbances into the data representation. This process ensures that the learning mechanism prioritizes relevant sentiment indicators over misleading biases.
We extensively evaluate the effectiveness of EMFL through a series of experiments conducted in a person-independent manner, ensuring that no individual appears in both training and testing datasets. Our findings confirm that EMFL not only improves within-dataset accuracy but also ensures robust generalization across multiple external datasets.
2. Related Work
Research in multimodal data utilization spans a broad spectrum of applications aimed at enhancing our understanding and interaction with human behavioral patterns. These applications include, but are not limited to, person detection and identification [
9,
10,
62], human action recognition [
11,
12,
64], and face recognition [
13,
14]. Each of these fields contributes to the foundational technologies essential for advanced sentiment analysis systems.
Sentiment analysis initially focused on textual data, exploring various levels of granularity from words [
15,
73] to phrases [
16], sentences [
17], and entire documents [
2,
70]. The evolution of this field saw the integration of deep learning models, notably recursive neural networks, which significantly enhanced the ability to understand complex emotional expressions in text.
Recent advancements in multimodal sentiment analysis emphasize the importance of effective modality integration techniques to improve the accuracy and robustness of sentiment detection. Researchers have explored various approaches to modality fusion, which include feature-level fusion, decision-level fusion, and hybrid approaches that aim to leverage the unique advantages of each method [
12,
51]. For instance, feature-level fusion involves combining features from different modalities before inputting them into the classifier, enhancing the model’s ability to capture interdependencies between modalities [
14]. On the other hand, decision-level fusion integrates the outputs of separate classifiers for each modality at a later stage, which has been shown to increase the system’s resilience to errors in individual modal predictors [
17].
Moreover, the development of hybrid fusion techniques has seen significant interest, combining both feature and decision-level methods to optimize performance. These approaches typically employ machine learning techniques such as Support Vector Machines (SVM) and neural networks to dynamically weigh the contribution of each modality based on its reliability and relevance to the sentiment analysis task [
23]. This dynamic weighting system can adapt to the context of the data, potentially offering superior performance over static fusion methods.
In the realm of audio data, traditional methods often involve transcribing spoken content to text, followed by sentiment analysis [
18]. Additionally, there is a growing interest in directly assessing the emotional state from vocal characteristics without the need for transcription [
19]. This direct approach aligns with developments in recognizing emotional states through the Facial Action Coding System, which has been foundational for analyzing facial expressions [
20,
81]. The advent of convolutional neural networks has further revolutionized this area by pinpointing affective regions in images to assess sentiments [
21].
The integration of textual, acoustic, and visual data—often referred to as multimodal sentiment analysis—has recently seen significant interest and development [
1]. Various innovative methods have been explored, ranging from simple data concatenation to sophisticated hybrid models that intelligently fuse modalities to predict sentiments more accurately [
22,
23,
24]. Among these, convolutional neural networks stand out for their effectiveness in handling multimodal data fusion, achieving state-of-the-art results [
25].
Building on this body of work, we propose the Enhanced Modal Fusion Learning (EMFL) methodology, which significantly improves the generalizability of neural networks across different data modalities. Our approach is designed to address the inherent challenges of multimodal sentiment analysis, such as modal disparity and data sparsity. We present extensive experimental results that demonstrate the superiority of EMFL in enhancing prediction accuracy not only within individual modalities—verbal, acoustic, and visual—but also in their integrated form. The next section details the EMFL methodology, focusing on its novel selection and addition phases designed to mitigate bias and enhance data interoperability.
3. Enhanced Multimodal Feature Learning (EMFL)
The primary objective of our research is to enhance the generalizability of multimodal sentiment analysis models by encouraging the model to prioritize sentiment-related features (e.g., individuals smiling when conveying positive emotions) over identity-specific attributes (e.g., wearing glasses). By doing so, we aim to mitigate the influence of confounding factors that may otherwise bias the sentiment prediction.
We formalize this challenge by introducing an input feature matrix X of dimensions , where n represents the number of utterances and p denotes the total number of features extracted from verbal, acoustic, and visual modalities. Additionally, we define a sentiment vector y of size corresponding to the sentiment label of each utterance. To account for speaker identities, we introduce a one-hot encoded matrix Z of size , where m is the number of unique individuals in the dataset.
Our proposed EMFL framework is designed to augment an existing (pre-trained) discriminative neural network, enhancing its robustness against confounding variables. To formally describe EMFL, we identify two fundamental components commonly found in discriminative neural network classifiers (such as Convolutional Neural Networks, CNNs): a representation learning component and a classification component.
For simplicity, let us denote the representation learning component by , where signifies its parameters. We hypothesize that confounding factors are confined to a subset of the dimensions within . Similarly, the classification component is denoted by , with representing its parameters. Consequently, the complete neural network classifier can be expressed as .
Within the EMFL framework, the representation learner captures identity-related features, which we term as identity-related confounding dimensions. The EMFL approach involves two main steps: firstly, identifying these confounding dimensions, and secondly, diminishing their impact by introducing noise.
To accurately select the identity-related confounding dimensions, EMFL incorporates a straightforward neural network component denoted by , where represents its parameters. This component is tasked with predicting the identity-related confounding dimensions based on individual identities Z, by minimizing the discrepancy between and . As a result, effectively isolates the identity-related confounding dimensions within .
To compel the model to disregard the
identity-related confounding dimensions, EMFL introduces Gaussian noise to these specific dimensions while simultaneously minimizing the prediction error. This ensures that the classification component
learns to focus on the relevant representations, effectively ignoring the noised confounding dimensions. The noise addition is facilitated through a Gaussian Sampling Layer as described in [
26].
The uppermost section represents the Decision Making Layers (DML, denoted as
in equations), responsible for making the final prediction. In our experiments, the DML is structured as a single-layer traditional neural network followed by a Logistic Regression Layer. The Gaussian Sampling Layer (GSL) assists the DML in fine-tuning parameters to differentiate between relevant and confounding representations. The lower left section, highlighted in purple, performs Mixed Representation Learning (MRL, denoted as
in equations) and can utilize any representation learner such as CNNs, autoencoders, LSTMs, or even pre-trained models. In our experiments, we employ a state-of-the-art CNN as outlined in [
25]. The lower right section, marked in red, undertakes Confounding Representation Learning (CRL, denoted as
in equations). For our experiments, the CRL is implemented using a single-layer traditional neural network.
The Gaussian Sampling Layer integrates EMFL into the original neural network architecture through a Gaussian sampling process, where the mean is determined by the representations learned from the original bottom layers, and the variance is dictated by the representations learned from EMFL.
3.1. Gaussian Sampling Layer
The Gaussian Sampling Layer enhances a conventional neural network layer by incorporating a variance term, as discussed in [
26]. Unlike a traditional neural network layer, whose output
T is a deterministic function of the input
M:
where
W,
B, and
denote the weights, bias, and activation function respectively, the Gaussian Sampling Layer produces its output by sampling from a Gaussian distribution defined by the deterministic features input
M and the variance features input
:
For brevity, we denote
as
in the subsequent discussions.
With EMFL integrated into the original model architecture, the ultimate objective of parameter learning is to optimize the following function:
where
. This optimization problem is inherently non-convex and poses significant challenges even when
,
, and
are linear functions. To address this complexity, we propose a heuristic algorithm named Enhanced Multimodal Feature Learning (EMFL), which efficiently navigates the search space for
,
, and
in an iterative manner to achieve the optimization goal.
3.2. Enhanced Multimodal Feature Learning Algorithm
A fundamental prerequisite for our Enhanced Multimodal Feature Learning (EMFL) approach is the initial training of a discriminative neural classifier. In our experimental setup, this is accomplished by minimizing the following loss function:
This loss function is a standard choice in discriminative neural networks, as evidenced in [
27].
Our EMFL methodology is versatile and can be seamlessly integrated into various deep learning models, provided that the model can be decomposed into the classification component and the representation learning component . Moreover, EMFL can be applied to pre-existing published models to enhance their performance without necessitating significant architectural changes.
The EMFL algorithm is structured into three distinct phases: 1) Representation Learning Phase, 2) Confounding Representation Matching Phase, and 3) Confounding Representation Elimination Phase.
These phases are delineated based on the specific representations they target. The overarching aim is to optimize the following objective:
This objective mirrors that of a traditional neural network. However, when combined with the optimization of additional functions, our EMFL algorithm significantly enhances the generalization capabilities of both
and
, thereby enabling superior performance on unseen test data.
The Gaussian Sampling Layer (GSL) is utilized exclusively during the training phase, as its variance term is derived from the confounding representation. During the validation and testing phases, since identity information Z is unavailable, Z is represented as a vector of zeros, resulting in the output . The subsequent layers operate as detailed below.
The lower left section comprises Mixed Representation Learning (MRL) models, which process the input
X to generate a comprehensive mixed representation. The representation learners can be CNNs, autoencoders, or other suitable architectures. In our experiments, we adopt a state-of-the-art CNN configuration as described in [
25]. The feature set
X is consistently available across the training, validation, and testing phases.
Conversely, the lower right section consists of Confounding Representation Learning (CRL) models, which utilize the input Z to extract confounding representations. In our experiments, the CRL is implemented using a single-layer traditional neural network. It is important to note that Z is only accessible during the training phase as a set of one-hot vectors encoding individual identities. During validation and testing phases, Z is a vector of zeros.
3.2.1. Representation Learning Phase
The Representation Learning Phase focuses on fine-tuning the parameters of the Mixed Representation Learning (MRL) component to solve Equation
2. In this phase, only the input feature matrix
X is utilized, and the representation is computed as
. This phase is analogous to training a standard sentiment analysis deep learning model without considering confounding factors.
The figure illustrates the Decision Making Layers (DML) alongside their associated weights (represented as lines) and the intermediate representations (depicted as squares) that serve as inputs. The representation fed into the DML is a composite of both relevant (blue squares) and confounding (red squares) representations. At this stage, the DML lacks the capability to differentiate between these two types of representations, resulting in all its weights remaining active and contributing equally to the final prediction.
This phase is primarily intended to adjust the parameters of the MRL and DML components. Consequently, if pre-trained models are available, this phase can be omitted to expedite the training process.
3.2.2. Confounding Representation Matching Phase
Upon establishing the initial representation
, the Confounding Representation Matching Phase seeks to identify and isolate the
identity-related confounding dimensions. This is achieved by optimizing a new loss function specifically designed to uncover these dimensions. The optimization is carried out by tuning the parameters
using the following loss function:
Here,
is a hyperparameter that controls the strength of the sparsity regularization imposed on
. This regularization is crucial to prevent overfitting, especially given that the output dimension of
typically exceeds that of its input dimension.
During this phase, both X and Z are available as inputs. However, only the parameters of the confounding representation learner are updated.
The objective of this phase is to selectively identify the identity-related confounding dimensions within the original representation . By minimizing the difference between and , and given that Z encapsulates solely identity information, the optimization ensures that aligns with the identity-related components of . The L1 regularization term promotes sparsity in , ensuring that only the most pertinent dimensions are selected as confounders. The original model’s weights (represented by the purple circle) remain fully active and interconnected with every dimension. In contrast, only a subset of weights within (depicted by the red circle) are active, corresponding to the identified identity-related confounding dimensions.
3.2.3. Confounding Representation Elimination Phase
Following the successful identification of the identity-related confounding dimensions in the Confounding Representation Matching Phase, the next step is to train a new neural network classifier that effectively "masks" these confounding dimensions. This is accomplished by introducing Gaussian noise to the identified dimensions, thereby rendering them non-informative. The classifier is then trained to focus on the remaining, relevant dimensions.
The Confounding Representation Elimination Phase is governed by the following loss function:
where
and ∘ denotes the element-wise product.
In this phase, only the parameters of the classification component are updated. The input to now comprises the original representation augmented with the noised confounding representation . The introduction of noise ensures that the identity-related confounding dimensions do not carry meaningful information, compelling the classifier to learn to ignore these dimensions and rely solely on the relevant features for accurate sentiment prediction. The identity-related confounding dimensions are contaminated with Gaussian noise. This contamination forces the model to disregard these non-informative dimensions, thereby allowing the weights to be optimized to concentrate on the remaining, informative dimensions.
It is important to note that EMFL introduces an additional set of parameters (
) to be trained and requires the selection of two hyperparameters:
in Equation
3 and
in Equation
4. Strategies for selecting optimal values for
and
are elaborated upon in the supplementary material.
3.3. Algorithmic Workflow of EMFL
The EMFL algorithm operates through an iterative process encompassing the three aforementioned phases. The workflow is summarized as follows:
Initialization: Begin with a pre-trained discriminative neural network, comprising the representation learner and the classifier .
Representation Learning Phase: Fine-tune the parameters
and
by minimizing the primary loss function (Equation
2), thereby optimizing the mixed representation learning without considering confounders.
Confounding Representation Matching Phase: Introduce the confounding representation learner
and optimize
by minimizing the loss function in Equation
3. This phase isolates the identity-related confounding dimensions within the representation.
Confounding Representation Elimination Phase: Incorporate Gaussian noise into the identified confounding dimensions and retrain the classifier
by minimizing the loss function in Equation
4. This step ensures that the classifier learns to ignore the noised confounders.
Iteration: Repeat steps 2 through 4 until convergence is achieved, i.e., when further iterations do not yield significant improvements in performance.
Final Model: The resultant model, now robust against confounding factors, is evaluated on unseen test data to assess its generalization capabilities.
Through this iterative process, EMFL effectively disentangles relevant sentiment-associated features from identity-related confounders, thereby enhancing the model’s ability to generalize across diverse datasets and scenarios.
3.4. Advantages of EMFL
The Enhanced Multimodal Feature Learning (EMFL) framework offers several key advantages:
Robustness to Confounders: By explicitly identifying and mitigating the influence of identity-related confounding dimensions, EMFL ensures that the sentiment predictions are not biased by irrelevant identity features.
Versatility: EMFL is model-agnostic and can be integrated into a wide range of deep learning architectures, including CNNs, LSTMs, and autoencoders, as well as pre-trained models, thereby broadening its applicability.
Improved Generalization: By focusing on sentiment-relevant features, EMFL enhances the model’s ability to generalize to new, unseen data, thereby improving overall prediction performance.
Scalability: The framework is scalable to large datasets with numerous features and identities, making it suitable for real-world applications where data complexity is high.
Ease of Integration: The three-phase approach of EMFL allows for straightforward integration into existing training pipelines without necessitating significant architectural modifications.
These advantages make EMFL a compelling approach for advancing the state-of-the-art in multimodal sentiment analysis and potentially other domains where confounding factors may impede model performance.
3.5. Implementation Considerations
When implementing the EMFL framework, several considerations must be addressed to ensure optimal performance:
Hyperparameter Tuning: The selection of appropriate values for the hyperparameters and is crucial. These parameters control the sparsity of the confounding dimension selection and the magnitude of the introduced noise, respectively. Techniques such as cross-validation and grid search can be employed to identify optimal values.
Computational Overhead: Introducing additional components such as the confounding representation learner and the Gaussian Sampling Layer (GSL) may increase computational requirements. Efficient implementation strategies and hardware acceleration can mitigate potential performance bottlenecks.
Data Quality and Representation: The effectiveness of EMFL is contingent upon the quality and representativeness of the input features. Ensuring comprehensive and relevant feature extraction across all modalities is essential for the successful identification of confounding dimensions.
Scalability to Multiple Confounders: While EMFL is designed to handle identity-related confounders, extending the framework to account for multiple types of confounders may require additional modifications and complexity.
Evaluation Metrics: Employing appropriate evaluation metrics that accurately reflect the model’s ability to generalize and its robustness to confounders is vital. Metrics such as accuracy, F1-score, and area under the ROC curve (AUC) can provide comprehensive insights into model performance.
Addressing these considerations during implementation will enhance the efficacy and reliability of the EMFL framework in practical applications.
4. Experiments
In this section, we conduct a comprehensive series of experiments across three distinct datasets to evaluate the efficacy of our proposed Enhanced Multimodal Feature Learning (EMFL) framework in enhancing the generalizability of discriminative neural classifiers. The primary metric for assessing generalizability involves cross-dataset validation, wherein models trained on one dataset are exclusively tested on the remaining two datasets. This approach ensures that the models are evaluated on unseen data distributions, thereby providing a robust measure of their generalization capabilities. All experiments adhere to a person-independent methodology, ensuring that no individual present in the training data appears in the test datasets, thereby eliminating potential data leakage and bias.
Table 1.
Classification accuracy across three datasets for CNN and EMFL-CNN models across various modalities and their multimodal integrations. Models are initially trained and validated on a subset of 62 individuals from the MOSI dataset, and subsequently evaluated on two distinct test datasets comprising 31 individuals from MOSI, 47 individuals from YouTube, and 55 individuals from MOUD respectively.
Table 1.
Classification accuracy across three datasets for CNN and EMFL-CNN models across various modalities and their multimodal integrations. Models are initially trained and validated on a subset of 62 individuals from the MOSI dataset, and subsequently evaluated on two distinct test datasets comprising 31 individuals from MOSI, 47 individuals from YouTube, and 55 individuals from MOUD respectively.
|
|
Within Dataset |
Across Datasets |
|
|
MOSI |
YouTube |
MOUD |
|
|
CNN |
EMFL-CNN |
CNN |
EMFL-CNN |
CNN |
EMFL-CNN |
|
Text |
0.678 |
0.732 |
0.605 |
0.657 |
0.522 |
0.569 |
Single Modality |
Audio |
0.588 |
0.618 |
0.441 |
0.564 |
0.455 |
0.549 |
|
Video |
0.572 |
0.636 |
0.492 |
0.549 |
0.555 |
0.548 |
|
Text+Audio |
0.687 |
0.725 |
0.642 |
0.652 |
0.515 |
0.574 |
Double Modalities |
Text+Video |
0.706 |
0.73 |
0.642 |
0.667 |
0.542 |
0.574 |
|
Audio+Video |
0.661 |
0.621 |
0.452 |
0.559 |
0.533 |
0.554 |
All Modalities |
0.715 |
0.73 |
0.611 |
0.667 |
0.531 |
0.574 |
4.1. Model Architectures
To thoroughly assess the performance improvements introduced by the EMFL framework, we compare the following models:
CNN: We utilize a state-of-the-art seven-layer Convolutional Neural Network (CNN) architecture, previously established for multimodal sentiment analysis [
25]. This CNN serves as our baseline model, leveraging its proven efficacy in capturing intricate patterns across multiple data modalities.
EMFL-CNN: Building upon the baseline CNN, we integrate the EMFL framework to enhance its generalizability. After the CNN is fully trained, EMFL is applied to mitigate the influence of confounding factors, thereby refining the model’s ability to focus on sentiment-relevant features. The confounding representation learner
within EMFL is implemented as a neural perceptron [
27], ensuring effective identification and suppression of identity-related confounders.
4.2. Datasets Utilized
Our experimental evaluation is conducted on three prominent multimodal sentiment analysis datasets:
MOSI (Multimodal Opinion-level Sentiment Intensity): Comprising 93 YouTube videos, the MOSI dataset encapsulates opinions from 93 unique individuals. It contains 2,199 utterances, each manually segmented and annotated for sentiment intensity [
6]. The dataset is well-suited for sentiment analysis tasks due to its diverse expressions and comprehensive annotations.
YouTube: This dataset consists of 47 opinionated YouTube videos, encompassing a total of 280 utterances. Each video reflects the sentiments of one distinct individual, with sentiments manually annotated [
1]. The YouTube dataset is characterized by its varying recording qualities and processing methodologies, providing a challenging testbed for cross-dataset generalization.
MOUD (Multimodal Opinion Utterance Dataset): The MOUD dataset includes 498 Spanish-language opinion utterances sourced from 55 unique individuals [
5]. This dataset introduces an additional layer of complexity due to the necessity of translating Spanish transcripts into English, thereby testing the robustness of models across linguistic variations.
Despite originating primarily from the YouTube platform, these datasets exhibit significant differences in recording quality and post-curation processing steps. Furthermore, verbal features are extracted using disparate Automatic Speech Recognition (ASR) tools across the datasets. The MOUD dataset’s verbal features require an extra translation step from Spanish to English, introducing potential variability in linguistic representation. These distinctions render the three datasets ideal candidates for evaluating the cross-dataset generalization capabilities of our models.
4.3. Feature Extraction Techniques
For each dataset, we employed a meticulous feature extraction process tailored to capture the nuances across textual, acoustic, and visual modalities:
Textual Features: We extracted word embeddings using a pre-trained Word2Vec model on the Google News corpus [
28]. To enrich the semantic representation, a 6-dimensional binary vector indicating the Part-of-Speech (POS) tags—noun, verb, adjective, adverb, preposition, conjunction—was appended to each word embedding. Each utterance’s text feature was constructed by concatenating the embeddings of all constituent words, padding with zeros to ensure uniform dimensionality. We set the maximum utterance length to 60 words, discarding any additional words beyond this limit, which accounted for only approximately 0.5% of utterances across datasets. The penultimate fully connected layer of the CNN was extracted to serve as the final textual input for training.
For the YouTube dataset, transcripts were generated using IBM Bluemix’s Speech-to-Text API
1. In contrast, the MOUD dataset required translation of Spanish transcripts into English to maintain consistency across all datasets.
Acoustic Features: Utilizing the openSMILE toolkit [
29], we extracted low-level audio descriptors for each utterance. These descriptors included Mel-Frequency Cepstral Coefficients (MFCCs), pitch, and various voice quality metrics. Each utterance was segmented into 50 trunks, and the features within each trunk were averaged, resulting in a comprehensive 1,950-dimensional feature vector per utterance.
Visual Features: Each video frame was processed to extract facial characteristic points using the CLM-Z library [
30]. Frames corresponding to specific utterances were identified through audio-visual synchrony. Each utterance was divided into 5 trunks, with features within each trunk averaged to produce a 2,075-dimensional visual feature vector per utterance.
This multi-faceted feature extraction approach ensures that the models are equipped with rich and diverse representations, capturing the essential aspects of sentiment expression across modalities.
4.4. Experimental Setup
To emulate real-world scenarios and rigorously evaluate the generalization performance of our models, we employed a stringent experimental setup:
Dataset Partitioning: We selected the first 62 individuals from the MOSI dataset to constitute the training and validation sets, encompassing a total of approximately 1,250 utterances. These utterances were randomly shuffled and partitioned into 1,000 training cases and 250 validation cases, maintaining an 80-20 split. This partitioning ensures that the model is trained and validated on a substantial and diverse subset of the data.
Test Sets: Three distinct test sets were constructed to evaluate cross-dataset performance:
MOSI Test Set: Comprising 546 utterances from the remaining 31 individuals in the MOSI dataset.
YouTube Test Set: Consisting of 195 utterances from 47 unique individuals in the YouTube dataset.
MOUD Test Set: Encompassing 450 utterances from 55 individuals in the MOUD dataset.
This partitioning ensures that the test sets are entirely disjoint from the training and validation sets, thereby providing an unbiased assessment of the models’ generalization capabilities.
Training Protocol: The CNN model was initially trained on the training set and validated on the validation set. The model achieving the highest validation accuracy was selected for subsequent evaluations. Following this, the EMFL framework was applied to the pre-trained CNN to produce the EMFL-CNN model. This model underwent further training to minimize the validation loss, ensuring optimal performance.
Evaluation Metrics: Classification accuracy was employed as the primary metric for evaluating model performance across different modalities and datasets. Additionally, statistical significance of performance improvements was assessed using permutation tests, with p-values reported to substantiate the efficacy of EMFL.
Handling Neutral Sentiments: Given the scarcity of neutral sentiment utterances across all datasets, neutral data points were excluded from the experiments. This decision focuses the evaluation on binary sentiment classification (positive and negative), which is more prevalent and statistically robust.
Baseline Verification: To ensure the reliability of our textual feature extraction, we compared the CNN’s performance against several recently published sentiment analysis tools trained on large-scale movie reviews. The testing accuracy of these tools ranged from 0.68 to 0.70, which was slightly lower than the CNN’s validation accuracy of 0.724, thereby validating the robustness of our baseline model.
4.5. Experimental Results
4.5.1. Within-Dataset Evaluation
Table 2 presents the classification accuracy for both CNN and EMFL-CNN models evaluated within the MOSI dataset, specifically on the subset of 31 individuals excluded from the training and validation phases. The results unequivocally demonstrate that the integration of EMFL consistently enhances the model’s performance across most modalities. Notably, the EMFL-CNN achieves superior accuracy in verbal, acoustic, and visual modalities individually, as well as in most bimodal combinations. This improvement underscores EMFL’s effectiveness in mitigating the influence of confounding identity-related factors, thereby enabling the model to focus more intently on sentiment-relevant features.
4.5.2. Cross-Dataset Evaluation
Table 3 delineates the performance of CNN and EMFL-CNN models on the YouTube and MOUD test datasets. The findings reveal that the CNN baseline occasionally underperforms, with some accuracies falling below chance levels. This underperformance substantiates the presence of generalization challenges when models trained on one dataset are applied to others with differing characteristics.
The EMFL-CNN model consistently outperforms the CNN baseline across almost all modalities and dataset combinations. Specifically, in the YouTube dataset, EMFL-CNN improves verbal, acoustic, and visual modality accuracies from 0.605 to 0.657, 0.441 to 0.564, and 0.492 to 0.549, respectively. Similarly, in the MOUD dataset, EMFL-CNN enhances accuracies from 0.522 to 0.569 in the verbal modality and from 0.455 to 0.549 in the acoustic modality, while achieving competitive performance in the visual modality.
Modality-Specific Insights
Textual modality consistently exhibits the highest classification accuracy across both within-dataset and cross-dataset evaluations. This superiority can be attributed to two primary factors:
Semantic Richness: Sentiment is inherently more nuanced and accurately captured through textual expressions, which provide explicit cues compared to the more ambiguous visual or acoustic signals.
Language Independence from Identity: Textual information is less likely to be influenced by an individual’s identity, reducing the risk of confounding factors impacting sentiment prediction.
However, despite the robustness of textual features, our results indicate that even textual modalities can be susceptible to confounding by individual-specific language preferences and stylistic variations, particularly in spoken language contexts.
The disparate nature of the YouTube and MOUD datasets, despite originating from the same web platform, introduces variations in recording quality and post-processing techniques. These discrepancies manifest as differing data distributions, adversely affecting the CNN’s performance in acoustic and visual modalities. For instance, in the acoustic modality, CNN accuracy on YouTube drops to 0.441 from within-dataset training, while EMFL-CNN achieves a substantial improvement to 0.564. Similar trends are observed in the visual modality, where EMFL-CNN maintains competitive performance despite inherent dataset variations.
In multimodal settings, combining features from multiple modalities generally enhances sentiment prediction accuracy. Our experiments confirm that EMFL-CNN consistently improves multimodal fusion results across both within-dataset and cross-dataset evaluations. For example, in the MOSI dataset, integrating verbal and visual modalities yields an accuracy of 0.73 with EMFL-CNN compared to 0.706 with CNN. This enhancement underscores EMFL’s capability to effectively disentangle and leverage complementary information from multiple modalities, further boosting model robustness and accuracy.
To validate the significance of the observed performance improvements, we conducted permutation tests comparing the CNN and EMFL-CNN models. The null hypothesis posits no improvement with EMFL-CNN. The resultant p-values— for MOSI, for YouTube, and for MOUD—reject the null hypothesis, thereby confirming that the performance gains achieved by EMFL-CNN are statistically significant.
While EMFL-CNN generally outperforms the CNN baseline, there are a few exceptions. Specifically, in the MOUD dataset’s visual modality and the MOSI dataset’s acoustic and visual bimodal combinations, CNN occasionally matches or slightly outperforms EMFL-CNN. These instances may arise from dataset-specific nuances or the inherent limitations of EMFL in certain multimodal configurations. Further investigation is warranted to understand and address these anomalies, potentially through model refinement or additional confounder mitigation strategies.
The experimental results robustly demonstrate that the EMFL framework significantly enhances the generalizability and robustness of discriminative neural classifiers across diverse datasets and modalities. By effectively mitigating the influence of identity-related confounders, EMFL-CNN models exhibit superior performance, particularly in challenging cross-dataset scenarios where data distributions vary substantially. These findings validate the efficacy of EMFL in advancing the state-of-the-art in multimodal sentiment analysis.
While the current study focuses on identity-related confounders, future research could explore the extension of the EMFL framework to account for multiple types of confounders, such as contextual or environmental factors. Additionally, integrating advanced feature extraction techniques and exploring deeper neural architectures could further bolster the framework’s performance. Expanding evaluations to include more diverse and larger-scale datasets would also provide a more comprehensive assessment of EMFL’s generalizability across various real-world applications.
5. Conclusions and Future Work
5.1. Conclusion
In the realm of automatic multimodal sentiment analysis, the scarcity of high-quality datasets poses a significant challenge. Typically, the datasets available for training machine learning models consist of only a few thousand samples. This limitation inherently restricts the models’ ability to generalize effectively across diverse and unseen data, primarily due to the presence of confounding factors that can bias the sentiment predictions. These confounders, often stemming from identity-related features such as individual speaking styles or unique visual cues, can detract from the model’s focus on sentiment-associated features, thereby diminishing overall performance and reliability.
To address this critical issue, we introduced the Enhanced Multimodal Feature Learning (EMFL) framework, a novel approach designed to mitigate the adverse effects of confounding factors on sentiment analysis models. EMFL operates by identifying and suppressing identity-related confounding dimensions within the feature space, thereby allowing the model to concentrate more accurately on features that are genuinely indicative of sentiment. This selective enhancement ensures that the model’s predictive capabilities are anchored in sentiment-relevant information, rather than being inadvertently influenced by irrelevant identity-specific attributes.
Our extensive experimental evaluations across three distinct datasets—MOSI, YouTube, and MOUD—demonstrate the efficacy of the EMFL framework in significantly improving the generalizability of state-of-the-art multimodal sentiment analysis models. The results indicate substantial gains in prediction accuracy across all three modalities: verbal, acoustic, and visual. Furthermore, the integration of EMFL into existing models not only enhances unimodal performance but also fortifies multimodal fusion strategies, leading to more robust and reliable sentiment predictions.
A key highlight of our findings is EMFL’s ability to maintain high prediction accuracy even when models are tested across different datasets. This cross-dataset robustness underscores EMFL’s effectiveness in overcoming dataset-specific biases and variations, thereby ensuring that the models remain resilient and perform consistently well in diverse real-world scenarios. Such generalizability is crucial for the deployment of sentiment analysis systems in dynamic environments where data distributions can vary widely.
In summary, the Enhanced Multimodal Feature Learning (EMFL) framework offers a significant advancement in the field of multimodal sentiment analysis by effectively addressing the limitations imposed by limited and confounded datasets. By fostering a more focused and discriminative feature learning process, EMFL not only enhances the accuracy of sentiment predictions but also ensures that these models are better equipped to generalize across varied and unseen data landscapes.
5.2. Future Work
While the EMFL framework has demonstrated considerable improvements in mitigating confounding factors and enhancing model generalizability, there remain several avenues for future research to further refine and extend its capabilities.
Firstly, exploring the application of EMFL to larger and more diverse datasets could provide deeper insights into its scalability and effectiveness in even more complex sentiment analysis tasks. Additionally, integrating EMFL with more advanced neural architectures, such as Transformer-based models, may unlock further performance gains and adaptability across different modalities.
Secondly, extending the framework to handle multiple types of confounding factors beyond identity-related features could broaden its applicability. For instance, environmental factors or contextual elements that influence sentiment expression could be incorporated into the model, enabling a more comprehensive disentanglement of relevant and irrelevant features.
Thirdly, investigating semi-supervised or unsupervised variants of EMFL could alleviate the dependency on labeled data, making the framework more versatile in scenarios where annotated datasets are scarce or costly to obtain. This approach could leverage unlabeled data to further enhance feature learning and model robustness.
Lastly, applying the EMFL framework to other domains beyond sentiment analysis, such as emotion recognition, intent detection, or human-computer interaction, could demonstrate its versatility and potential for widespread impact across various fields. Each of these directions presents an opportunity to build upon the foundational strengths of EMFL, driving forward the capabilities of multimodal machine learning models.
By addressing these future research directions, the EMFL framework can continue to evolve, contributing to the development of more sophisticated, reliable, and generalizable multimodal sentiment analysis systems.