MMSA is a unified framework for Multimodal Sentiment Analysis.
M-SENA is an open-sourced platform for Multimodal Sentiment Analysis. It aims to facilitate advanced research by providing flexible toolkits, reliable benchmarks, and intuitive demonstrations. The platform features a fully modular video sentiment analysis framework consisting of data management, feature extraction, model training, and result analysis modules. In this paper, we first illustrate the overall architecture of the M-SENA platform and then introduce features of the core modules. Reliable baseline results of different modality features and MSA benchmarks are also reported. Moreover, we use model evaluation and analysis tools provided by M-SENA to present intermediate representation visualization, on-the-fly instance test, and generalization ability test results. The source code of the platform is publicly available at https://github.com/thuiar/M-SENA.READ FULL TEXT VIEW PDF
MMSA is a unified framework for Multimodal Sentiment Analysis.
Multimodal Sentiment Analysis (MSA) aims to judge the speaker’s sentiment from video segments Mihalcea (2012); Soleymani et al. (2017); Guo et al. (2019). It has attracted increasing attention due to the booming of user-generated online content. Although impressive improvements have been witnessed in recent MSA researches Tsai et al. (2019); Rahman et al. (2020); Yu et al. (2021), building an end-to-end video sentiment analysis system for real-world scenarios is still full of challenges.
The first challenge lies in effective acoustic and visual feature extraction. Most previous approaches Zadeh et al. (2017a); Hazarika et al. (2020); Han et al. (2021a) are developed on the provided modality sequences from CMU-MultimodalSDK111Features provided by CMU
. However, reproducing exact identical acoustic and visual feature extraction is almost impossible due to the the vague description of feature selection and backbone selection (both COVAREP222https://github.com/covarep/covarep and Facet333https://imotions.com can not be directly used in Python). Moreover, recent literature Tsai et al. (2019); Gkoumas et al. (2021); Han et al. (2021b) observe that the text modality stands in the predominant position while acoustic and visual modalities have few contributions to the final sentiment classification. Such results further arouse the attention on effective feature extraction of acoustic and visual modalities.
With the awareness of the importance of acoustic and visual feature extraction, researchers attempt to develop models based on customized modality sequences instead of provided features Dai et al. (2021); Hazarika et al. (2020). However, performance comparison with different modality features is unfair. Therefore, the demand for reliable comparison of modality features and fusion methods is increasingly urgent.
Another factor that limits the application of existing MSA models in real scenarios is the lack of comprehensive model evaluation and analysis approaches. Models obtained outstanding performance on the given test set might degrade in real-world scenarios due to the distribution discrepancy or random modality perturbations Liang et al. (2019); Zhao et al. (2021); Yuan et al. (2021). Besides, effective model analysis is also crucial for researchers to explain the improvements and perform model refinement.
The Multimodal SENtiment Analysis platform (M-SENA) is developed to address the above challenges. For acoustic and visual features, the platform integrates Librosa McFee et al. (2015), OpenSmile Eyben et al. (2010), OpenFace Baltrusaitis et al. (2018), MediaPipe Lugaresi et al. (2019) and provides a highly customized feature extraction API in Python. With the modular MSA pipeline, fair comparison between different features and MSA fusion models can be achieved. The results can be regarded as reliable baselines for future MSA research. Furthermore, the platform provides comprehensive model evaluation and analysis tools to reflect the model performance in real-world scenarios, including intermediate result visualization, on-the-fly instance demonstration, and generalization ability test. The contributions of this work are briefly summarized as follows:
By providing a highly customized feature extraction toolkit, the platform familiarizes researchers with the composition of modality features. Also, the platform bridges the gap between designing MSA models with provided, fixed modality features and building a real-world video sentiment analysis system.
The unified MSA pipeline guarantees fair comparison between different combinations of modality features and fusion models.
To help researchers evaluate and analyze MSA models, the platform provides tools such as intermediate result visualization, on-the-fly instance demonstration, and generalization ability test.
M-SENA platform features convenient data access, customized feature extraction, unified model training pipeline, and comprehensive model evaluation. It provides a graphical web interface as well as Python packages for researchers with all features above. The platform currently supports three popular MSA datasets across two languages, seven feature extraction backbones, and fourteen benchmark MSA models. Figure 1 illustrates the overall architecture of the M-SENA platform. In the remaining parts of this section, features of each module in Figure 1 will be described in detail.
The data management module is designed to ease the access of multimedia data on servers. Besides providing existing benchmark datasets, the module also enables researchers to build and manage their own datasets.
Benchmark Datasets. M-SENA currently supports three benchmark MSA datasets, including CMU-MOSI Zadeh et al. (2016), CMU-MOSEI Zadeh et al. (2018c) in English, and CH-SIMS Yu et al. (2020) in Chinese. Details of integrated datasets are shown in Appendix A. Users can filter and view raw videos conveniently without downloading them to the local environment.
Building Private Datasets. The M-SENA platform also provides a graphical interface for researchers to construct their own datasets using uploaded videos. Following the literature Yu et al. (2020), M-SENA supports unimodal sentiment labelling along with multimodal sentiment labelling. The constructed datasets can be directly used for model training and evaluation on the platform.
Emotion-bearing modality feature extraction is still an open challenge for MSA tasks. To facilitate effective modality feature extraction for MSA, M-SENA integrates seven most commonly used feature extraction tools and provides a unified Python API as well as a graphical interface. Part of the supported features for each modality are listed in Table 1 and described below:
|Acoustic Feature Sets|
|ComParE_2016 Schuller et al. (2016b)||Static (HSFs)|
|eGeMAPS Eyben et al. (2015)||Static (LLDs)|
|wav2vec2.0 Baevski et al. (2020)||Learnable|
|Visual Feature Sets|
|Facial Landmarks Zadeh et al. (2017b)||Static|
|Eyes Gaze Wood et al. (2015)||Static|
|Action Unit Baltrušaitis et al. (2015)||Static|
|Textual Feature Sets|
|GloVe6B Pennington et al. (2014)||Static|
|BERT Devlin et al. (2018)||Learnable|
|RoBerta Liu et al. (2019)||Learnable|
Acoustic Modality. Various acoustic features have been proven effective for emotion recognition El Ayadi et al. (2011); Akçay and Oğuz (2020). Hand-crafted acoustic features can be divided into two classes, low level descriptors (LLDs), and high level statistics functions (HSFs). LLDs features, including prosodies, spectral domain features and others, are calculated on a frame-basis, while HSFs features are calculated on an entire utterance level. In addition to the hand-crafted features, M-SENA also provides pretrained acoustic model wav2vec2.0 Baevski et al. (2020) as a learnable feature extractor. Researchers can also design and build their own customized acoustic features using the provided Librosa extractor.
Visual Modality. In existing MSA research, facial Landmarks, eyes gaze, and facial action units are commonly used visual features. The M-SENA platform enables researchers to extract visual feature combinations flexibly using OpenFace and MediaPipe extractors.
Compared with acoustic and visual features, semantic text embeddings are much more mature with the rapid development of pretrained language modelsQiu et al. (2020). Following previous works Zadeh et al. (2017a); Rahman et al. (2020); Lian et al. (2022), M-SENA supports GloVe6B Pennington et al. (2014), pretrained BERT Devlin et al. (2018), and pretrained RoBerta Liu et al. (2019) as textual feature extractors.
All feature extractors above are available through both Python API and Graphical User Interface(GUI). Listing 1 shows a simple example of default acoustic feature extraction using Python API. The process is similar for other modalities. Advanced usage and detailed documentation is available at Github Wiki444https://github.com/thuiar/MMSA-FET/wiki.
|Easy||10 (en:4 ch:6)||8 (en:4 ch:4)||8 (en:4 ch:4)|
|Common||9 (en:4 ch:5)||11 (en:6 ch:5)||8 (en:4 ch:4)|
|Difficult||9 (en:4 ch:5)||9 (en:5 ch:4)||8 (en:4 ch:4)|
|Noise||9 (en:4 ch:5)||8 (en:4 ch:4)||7 (en:2 ch:5)|
|Missing||9 (en:4 ch:5)||9 (en:5 ch:4)||7 (en:3 ch:4)|
|Acc-2 (%)||F1 (%)||Acc-2 (%)||F1 (%)||Acc-2 (%)||F1 (%)||Acc-2 (%)||F1 (%)|
M-SENA provides a unified training module which currently integrates 14 MSA benchmarks, including tensor fusion methods, TFNZadeh et al. (2017a), LMF Liu et al. (2018), modality factorization methods, MFM Tsai et al. (2018), MISA Hazarika et al. (2020), SELF-MM Yu et al. (2021), word-level fusion methods, MulT Tsai et al. (2019), BERT-MAG Rahman et al. (2020), multi-view learning methods: MFN Zadeh et al. (2018a), GMFN Zadeh et al. (2018c), and other MSA methods. Detailed introduction of the integrated baseline methods is provided in Appendix B. We will continue following advanced MSA benchmarks and put our best effort into providing reliable benchmark results for future MSA research.
The proposed M-SENA platform provides comprehensive model evaluation tools including intermediate result visualization, on-the-fly instance test, and generalization ability test. A brief introduction of each component is given below, while a detailed demonstration is shown in Section 4.
Intermediate Result Visualization.
The discrimination of multimodal representations is one of the crucial metrics for the evaluation of different fusion methods. The M-SENA platform records the final multimodal fusion results and illustrates them after decomposition with Principal Component Analysis (PCA). Training loss, binary accuracy, F1 score curves are also provided in M-SENA for detailed analysis.
Live Demo Module. In the hope of bridging the gap between MSA research and real-world video sentiment analysis scenarios, M-SENA provides a live demo module, which performs on-the-fly instance tests. Researchers can validate the effectiveness and robustness of the selected MSA model by uploading or live-feeding videos to the platform.
Generalization Ability Test. Compared to the provided test set of benchmark MSA datasets, real-world scenarios are often more complicated. Future MSA models need to be robust against modality noise as well as effective on the test set. Driven by the demand from real-world applications and observations, the M-SENA platform provides a generalization ability test dataset (consists of 68 Chinese and 61 English samples), simulating as many complicated and diverse real-world scenarios as possible. The statistics of the proposed dataset is shown in Table 2. In general, the dataset contains three scenarios and five instance types. Specifically, the three scenarios refers to films, variety shows, and user-uploaded vlogs, while the five instance types refer to easy samples, common samples, difficult samples, samples with modality noise, samples with modality missing. In addition, the dataset is balanced in terms of gender and scenario to avoid irrelevant factors. Examples of the generalization ability test dataset are shown in Appendix C.
In this section, we report experiments conducted on the M-SENA platform. Comparison of different modality features are shown in Section 3.1, and comparison of different fusion models are shown in Section 3.2. All reported results are the mean performances of five different seeds.
In the following experiments, we take BERT [T1], eGeMAPS (LLDs) [A1], and Action Unit [V1] as default modality features, and compare them with the other six feature sets. Specifically, we utilize GloVe6B [T2], RoBerta [T3] for text modality comparison; customized acoustic feature[A2](including 20 dimensional MFCC, 12 dimensional CQT, and 1 dimensional f0), wav2vec2.0 features [A3] for acoustic modality comparison; facial landmarks [V2], facial landmarks and action units [V3] for visual modality comparison. Besides, we also report the model performances using the modality features provided in CMU-MultimodalSDK.
Table 3 shows the experiment results for feature selection. For Bert-MAG which is designed upon the Bert backbone, experiments are conducted only for Bert as text feature. It can be observed that, in most cases, using appropriate features instead of original features in CMU-MultimodalSDK helps to improve model performance. For textual modality, Roberta feature performs best for TFN and GMFN model, while Bert feature performs best for MISA model. For acoustic modality, wav2vec2.0 embeddings (without finetune) perform best for GMFN and Bert-MAG model. According to literature Chen and Rudnicky (2021); Pepino et al. (2021), finetuning wav2vec2.0 can further improve model performance which might provide more effective acoustic features for future MSA research. For Visual modality, the combination of facial landmarks and action units achieves the overall best result, revealing the effectiveness of both landmarks and action units for sentiment classification.
Experiment results of benchmark MSA models are shown in Table 4. All models are improved using Bert as text embeddings while using original acoustic and visual features provided in CMU-MultimodalSDK. Besides recording reliable benchmark results, the M-SENA platform also provides researchers with a convenient approach to reproduce the benchmarks. Again, both GUI and Python API are available. We show an example of the proposed Python API in Listing 2. Detailed and Advanced usage is included in our documentation at Github555https://github.com/thuiar/MMSA/wiki. We will continuously catch up on new MSA approaches and update their performances.
This section demonstrates model analysis results using the M-SENA platform. Intermediate result analysis is presented in Section 4.1, on-the-fly instance analysis is shown in Section 4.2, and generalization ability analysis is illustrated in Section 4.3.
The intermediate result analysis submodule is designed to monitor and visualize the training process. Figure 2
shows an example of training TFN model on MOSI dataset. Epoch results of binary accuracy, f1-score and loss value are plotted. Moreover, the learned multimodal fusion representations are illustrated in an interactive 3D figure with the aim of helping users gain a better intuition about the multimodal feature representations and the fusion process. Unimodal representations of text, acoustic, and visual are also shown for models containing explicit unimodal representations.
M-SENA enables researchers to validate the proposed MSA approaches using uploaded or live-recorded instances. Figure 3
presents an example of the live demonstration. Besides model prediction results, the platform also provides feature visualization, including short-time Fourier transform (STFT) for acoustic modality and facial landmarks, eye gaze, head poses for visual modality. We will continuously update the demonstration to make it a even more intuitive and playable MSA model evaluation tool.
We utilized the model trained on MOSI dataset with [T1]-[A1]-[V3] modality features in Section 3.1 for generalization ability test. Experimental results are reported in Table 5. It can be concluded that all models present a performance gap between original test set and real-world scenarios, especially for the instances with noisy or missing modalities. Another observation is that the noisy instances are usually more challenging than modality missing for MSA models, revealing that noisy modality feature is worse than none at all. In the future, for the demand of real-world applications, MSA researchers may consider analyzing model robustness as well as performances on the test set, and design a more robust MSA model against random modality noise.
|Acc-2 / F1||Acc-2 / F1||Acc-2 / F1||Acc-2 / F1|
|Easy||83.3 / 84.4||75.0 / 76.1||75.0 / 76.7||66.7 / 66.7|
|Common||71.4 / 74.5||85.7 / 82.3||71.4 / 75.8||78.6 / 78.6|
|Difficult||69.2 / 69.2||61.5 / 60.5||53.9 / 54.4||84.6 / 84.6|
|Noise||60.0 / 50.5||50.0 / 44.9||50.0 / 35.7||60.0 / 51.7|
|Missing||63.6 / 60.6||81.8 / 77.8||63.6 / 60.6||63.6 / 61.5|
|Avg||70.0 / 68.4||71.7 / 69.3||63.3 / 62.4||71.7 / 69.7|
To the best of our knowledge, there are two widely used open-source repositories from CMU team666https://github.com/A2Zadeh/CMU-MultimodalSDK and SUTD team777https://github.com/declare-lab/multimodal-deep-learning. Both of them provide tools to load well-known MSA datasets and implement several benchmarks methods. So far, their works have attracted considerable attention and facilitated the birth of new MSA models such as MulT Tsai et al. (2019) and MMIM Han et al. (2021b).
In this paper, we propose M-SENA, compared to previous works, the M-SENA platform is novel from the following aspects. For data management, previous work directly loads the extracted features, while the M-SENA platform focuses on intuitive raw video demonstration, and provides user with a convenient means for private dataset construction. For modality features, M-SENA platform first provides user-customized feature extraction toolkit and a transparent feature extraction process. Following the tutorial, Users can easily reproduce the feature extraction steps and develop their research on designed feature set. For model training, the M-SENA platform first utilizes a unified MSA framework and provide an easy-to-reproduce model training API integrating fourteen MSA benchmarks on three popular MSA dataset. For model evaluation, the M-SENA is the first MSA platform consisting of comprehensive evaluation means stressing model robustness for real-world scenarios, which aims to bridge the gap between MSA research and applications.
In this work, we introduce M-SENA, an integrated platform that contains step-by-step recipes for data management, feature extraction, model training, and model analysis for MSA researchers. The platform evaluates MSA model in an end-to-end manner and reports reliable benchmark results for future research. Moreover, we further investigate comprehensive model evaluation and analysis methods and provide a series of user-friendly visualization and demonstration tools including intermediate representation visualization, on-the-fly instance test, and generalization ability test. In the future, we will continuously catch up on advanced MSA research progress and update new benchmarks on the M-SENA platform.
This paper is funded by The National Natural Science Foundation of China (Grant No. 62173195) and Beijing Academy of Artificial Intelligence (BAAI). The authors thank the anonymous reviewers for their valuable suggestions.
Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers.Speech Communication, 116:56–76.
Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 192–198.
Glove: Global vectors for word representation.In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Rendering of eyes for eye-shape registration and gaze estimation.In
Proceedings of the IEEE International Conference on Computer Vision, pages 3756–3764.
CMU-MOSI. The MOSI Zadeh et al. (2016) dataset is a widely-used dataset that consists of a collection of 2,199 video segments from 93 YouTube movie review videos.
CMU-MOSEI. The MOSEI Zadeh et al. (2018c) dataset expands the MOSI dataset by enlarging the number of utterances and enriching the variety of samples, speakers, and topics. For both MOSI and MOSEI datasets, instances are annotated with a sentiment intensity score ranging from -3 to 3 (strongly negative to strongly positive).
CH-SIMS. The SIMS dataset Yu et al. (2020) is a Chinese unimodal and multimodal sentiment analysis dataset. It contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations of a sentiment intensity score ranging from -1 to 1 (negative to positive, the score interval is 0.2).
The Late Fusion Deep Neural NetworkCambria et al. (2017) first extracts modality features separately and performs late fusion strategy for final predictions.
The Early Fusion Long-Short Term MemoryCambria et al. (2017) is based on input-level feature fusion and conducts Long-Short Term Memory (LSTM) to learn multimodal representations.
TFN. The Tensor Fusion Network (TFN) Zadeh et al. (2017a) calculates a multi-dimensional tensor (based on outer product) to capture uni-, bi-, and tri-modal interactions.
LMF. The Low-rank Multimodal Fusion (LMF) Liu et al. (2018) is an improvement over TFN, where the low-rank multimodal tensors fusion technique is performed to improve efficiency.
MFN. The Memory Fusion Network (MFN) Zadeh et al. (2018a) accounts for continuously modeling the view specific and cross-view interactions and summarizing them through time with a Multi-view Gated Memory.
Graph-MFN. The Graph Memory Fusion Network Zadeh et al. (2018c) is an improvement of MFN, which can change the fusion structure dynamically to obtain the interaction between the modalities and improve the interpretability.
MulT. The Multimodal Transformer (MulT) Tsai et al. (2019) extends multimodal transformer architecture with directional pairwise cross-modal attention which translates one modality to another using directional pairwise cross-attention.
BERT-MAG. The Multimodal Adaptation Gate for Bert (MAG-BERT) Rahman et al. (2020) is an improvement over RAVEN on aligned data with applying multimodal adaptation gate at different layers of the BERT backbone.
MISA. The Modality-Invariant and -Specific Representations Hazarika et al. (2020) is made up of a combination of losses including similarity loss, orthogonal loss, reconstruction loss and prediction loss to learn modality-invariant and modality-specific representation.
MFM. The Multimodal Factorization Model Tsai et al. (2018) is a robust model, which can learn multimodal-discriminative and modality-specific generative factors, then reconstructs missing reconstruct missing modalities by adjusting for independent factors.
MLF_DNN. The Multi-Task Late Fusion Deep Neural Network Yu et al. (2020) first extracts modality features separately and performs late fusion strategy for final predictions through unimodal labels training.
MTFN. The Multi-Task Tensor Fusion Network Yu et al. (2020) calculates a multi-dimensional tensor (based on outer product) to capture uni-, bi-, and tri-modal interactions through unimodal labels training.
MLMF. The Multi-Task Low-rank Multimodal Fusion Yu et al. (2020) is an improvement over MTFN, where low-rank multimodal tensors fusion technique is performed to improve efficiency through unimodal labels training.
The examples of the proposed generalization ability test dataset are shown in Figure 4.