The availability of multimodal data enables us to perform many downstream tasks with cross-modal information, such as conversation generation, multimodal sentiment analysis, etc. In the field of sentiment analysis (MSA), recently researchers leverage the rich information contained in different modalities (e.g., audio, visual, language) to design multimodal models, and existing works mainly focus on exploring cross-modal dynamics and designing sophisticated fusion methods Mai et al. (2020a); Pham et al. (2019); Poria et al. (2017a); Hazarika et al. (2020); Mai et al. (2021a).
While existing MSA models are mostly optimized by multimodal loss, the design towards the optimization of unimodal networks in MSA models is often neglected. However, the reach of optimal unimodal networks determines the lower limit of the whole MSA models, which should specifically addressed for the higher performance of the models. Besides, an optimal solution for each modality also ensures the performance of MSA models even with the absence of any modality.
Moreover, even with satisfactory unimodal networks, it is not always the case that multimodal models reach higher performance than the unimodal ones Mai et al. (2021b). The reason may be that, a modality may not contain useful information in some utterances and may even carry noises, which hinders the learning of correct multimodal embedding. Some attention-based methods leverage attention mechanism to determine modality importance Chauhan et al. (2019); Akhtar et al. (2019), which can filter out noise information in a certain degree, but those methods introduce a large amount of parameters and increase the risk of overfitting. Moreover, despite the attention on informative modalities, the noisy modalities cannot be explicitly filtered out.
Based on the aforementioned problems, we mainly concern about two questions: how to obtain an optimal unimodal network; which modality is informative and how to filter out noisy modalities. We hold the intuition that each modality carries modality-specific information, whose importance varies from one another. Moreover, the role of the same modality also varies (the amount of useful and noisy information varies in different utterances). To address these concerns, we propose a novel Modulation Model for Multimodal Sentiment Analysis to modulate the training of different modalities.
Specifically, modulation loss and modality filter module are designed to identify import modalities and reduce the negative impact of noisy information. To learn an optimal unimodal network, modulation loss is proposed to modulate the training of each unimodal network. The core idea is that during the training stage, the modulation function manages to modulate the loss contribution of each modality according to the confidence of all the modalities George and Marcel (2021), which enables the model to balance multi-modal information and identify the importance of each modality at each utterance. In this way, the model can dynamically adjust the contribution from different modalities so as to better leverage the importance information hidden within each modality to update the unimodal networks. With our proposed modulation loss, the training of individual unimodal networks is modulated and they can be better optimized by reducing the inference of the noisy modalities at each utterance.
Besides, to obtain correct multimodal embedding, we design a modality filter module (MFM) to identify modality importance and explicitly filter out noisy modalities. We present two possible candidates of the filter of MFM, i.e., a hard-filter and a soft-filter, where the hard-filter provides a binary choice to retain or filter out individual modalities, while the soft-filter outputs a number between [0, 1] to filter out noisy information based on the noise level. Moreover, instead of directly removing the noisy modalities or tokens Chen et al. (2017); Zhang et al. (2019), inspired by Schlichtkrull et al. (2020), we innovative to train a baseline embedding for each modality and replace the noisy embedding with it, such that our method can be fitted into any fusion mechanisms and compensate for the loss of unimodal information.
In brief, the contributions can be summarized as:
We propose a novel framework to modulate the training of MSA models, which aims to explore optimal solution for unimodal networks and multimodal embedding.
A cross-modal modulation loss is devised to modulate the contribution of each modality based on the confidence of individual modalities during the training stage, and it can reduce the interference from noisy modalities so that unimodal networks can be better optimized, which is often neglected in existing works.
A modality filter module (MFM) is designed to identify noisy modalities and filter them out where soft-filter, hard-filter and unimodal embedding baselines are proposed, so as to minimize the negative impact of noisy information and obtain correct multimodal embedding. Compared with attention-based methods, MFM introduces much less parameters and can explicitly filter out noisy modalities.
Our proposed method is compared with several models on public datasets and achieves state-of-the-art performance, which demonstrates its effectiveness and superiority.
2 Related Work
In the field of MSA, each sample is an utterance that captures different views with complementary information. Most previous works focus on elaborately designing various fusion strategies so that the model can explore inter-modal dynamics to sufficiently learn a joint embedding, including simple ways like early fusion and late fusion W¨ollmer et al. (2013); Rozgic et al. (2012); Poria et al. (2016, 2017b)
, and more advanced fusion strategies like tensor-based fusionLiu et al. (2018); Zadeh et al. (2017); Mai et al. (2019), graph fusion Mai et al. (2020a); Zadeh et al. (2018b); Mai et al. (2020b), factorization methods Tsai et al. (2019); Liang et al. (2019), fine-tuning BERT Rahman et al. (2020); Yang et al. (2020) etc.
The above-mentioned methods focus on exploring more advanced fusion strategies, and optimize the whole network mostly based on multimodal loss so as to achieve higher performance for MSA task. While more attention is paid on the optimization of multimodal networks, specifically designed method for optimizing individual unimodal networks is neglected. We hold that apart from the learning of cross-modal dynamics, it is also important to reach an optimal solution for the optimization of unimodal networks. To achieve this goal, we specifically design a modulation loss to modulate the loss contribution of unimodal networks based on their confidence. We train all unimodal networks with the modulation loss across all data points with the aim to reaching optimal parameters on the corresponding dataset.
Another problem in the field of MSA is the interference between modalities. Noisy modalities can interfere the learning of other modalities and the correct multimodal embedding. Some attention-based fusion methods such as Context-aware Interactive Attention (CIA) Chauhan et al. (2019), Multi-Task Learning (MTL) Akhtar et al. (2019) and Multilogue-Net Shenoy and Sardana (2020)
that apply cross-modal attention mechanism consider the importance of different modalities and assign different weights to them. But they focus on identifying and highlighting important modalities, and can not explicitly filter out noisy modalities. Although these models have considered modality importance, we format it from a different perspective instead of learning attention weights. Specifically, we focus on identifying and filtering out noisy modalities with a modality filter module (MFM), which introduces much few parameters than attention mechanisms and can explicitly filters out noisy information. Actually, there also exists works that aim to filter out the noisy modalities or the tokens within modality using reinforcement learning (RL)Chen et al. (2017); Zhang et al. (2019). However, RL is unstable in training and suffers from high variants and control variates that requires auxiliary models or multiple evaluations of the network Louizos et al. (2017); Mnih and Gregor (2014). Moreover, they provide a binary choice to retain or filter out the whole noisy modality, and modality-specific information may be lost. Unlike it, our proposed MFM is much more easier to train, and at the same time MFM considers the baseline embedding to compensate the loss of modality-specific unimodal information.
3.1 Notations and Problem Formulation
Our task is to perform multimodal sentiment analysis with multimodal data by scoring the sentiment intensity. The input to the model is an utterance Olson (1977) (i.e., a segment of a video bounded by pauses and breaths), each of which has three modalities, i.e., acoustic (), visual (), and language (). The sequences of acoustic, visual, and language modalities are denoted as , , and , where , and represent the length of the audio, visual and language modality, respectively, and , and denote the dimensionality of the audio, visual and language modality, respectively.
3.2 Overall Algorithm
Formally, a traditional multimodal learning system can be formulated as:
where is the prediction, parameterized by and parameterized by refer to the unimodal and multimodal network, respectively. is the input raw feature of modality where is the sequence length. To update the parameters of the multimodal system, we have the following equation:
where is the ground truth label, , is the learning rate, and is mean absolute error (MAE).
Unlike the traditional multimodal learning system which mostly focuses on optimizing the whole multimodal framework, we decouple the learning procedure of unimodal and multimodal networks, introduce modulation losses to specifically optimize the unimodal networks for learning better unimodal representations, and design modality filter module (MFM) for identifying and filtering out noisy modalities. As illustrated in Fig. 1, given an input utterance of three modalities, we first obtain the unimodal representations via unimodal networks. Modulation loss is specifically designed to train individual unimodal networks by modulating the loss contributions of each modality.Besides, the output of each unimodal network will be sent to the MFM, and in this way, noisy modalities can be identified and filtered out. With our proposed method, we can modulate the learning of correct unimodal and multimodal dynamics, and minimize the negative impact of noisy information. In a word, our multimodal learning system is formulated as:
is the classifier that takes encoded representation as input and outputs the sentiment prediction, which is shared across unimodal and multimodal networks to force the learned unimodal and multimodal representations to have approximately same distributions. As illustrated in Eq.6, the unimodal losses are adjusted by a Modulation function, which helps to identify the contribution of each modality of the current utterance to the optimization of the respective unimodal network. is used to update the respective unimodal network. Moreover, in Eq. 7, MFM is introduced to identify and replace the uninformative modalities with the learned unimodal baseline embeddings to filter out the noisy information that interferes the learning of the cross-modal interactions. The detailed introduction of the modulation function and the MFM is shown in Section 3.3 and 3.4, respectively.
Unlike most existing works which need sophisticated designed fusion methods to sufficiently explore cross-modal dynamics, our proposed can leverage simple fusion method to reach the state-of-the-art performance with better generalization ability. Also note that our algorithm is model-agnostic, and we can integrate any sequence learning networks into our unimodal networks . In this paper, we apply Transformer-based Vaswani et al. (2017) architectures to build up the unimodal networks. As for the multimodal network , we introduce different fusion mechanisms to evaluate the algorithm. Please refer to Appendix for the details about the unimodal and multimodal networks.
3.3 Modulation Loss
The cross-modal modulation function is proposed to modulate the loss contribution of each modality as a function of the confidence of individual modalities. This is based on the assumption that each modality carries various modality-specific information, whose importance varies from one modality to another modality. And in different utterances, the role of the same modality also varies (in some utterances, this modality is important, while in other utterance, it contains only the noisy information). Instead of learning the fixed attention weight for each modality as the previous methods do Wang et al. (2019); Mai et al. (2020a), we seek to dynamically adjust the contribution from different modalities so as to better leverage the important information hidden within each modality to update the network, and effectively reduces the interference of the noisy utterances. Compared to the attention mechanism, the modulation loss directly has influence on the optimization procedure, which is more straightforward and non-parametric.
How do we dynamically determine the contribution of each modality during training? A intuitive idea is that we can estimate the value of the unimodal loss, under the assumption that the smaller the value of the unimodal loss, the more discriminative it is for the task, and a higher weight shall be assigned so as to better leverage the discriminative information hidden in this modality to update the network. More importantly, when assigning weight to each unimodal loss, we should have a global view on all the modalities to consider the value of the other unimodal losses to estimate the relative importance and adjust the weight for this modality accordingly. The modulation loss can be formulated as (taking language modality as an example):
where is the modulation loss for language modality. The Modulation function aims to learn the weight for unimodal loss by estimating the discriminative information in all the modalities (this is why we call it modulation). The formulation of the Modulation function could have many choices. In practice, we formulate it as:
is the harmonic mean of the three unimodal losses which performs a kind of scale on the weight of unimodal losses, andis the weight for the language loss. By using the loss values of other modalities to compute weights for the current modality, the weight of the current modality reduces when the other modalities obtain relatively low losses (i.e., other modalities have high confidence for prediction). In other words, the modality that has a relatively high loss obtains a low weight when updating the corresponding unimodal network, which dynamically reduces the influence of noisy modalities to the network. This simple operation is shown to be very effective (see experiment).
3.4 Modality Filter Module
The problem of noisy modalities negatively affects the learning of other informative modalities and hinders higher performance of existing MSA models. Many existing works try to identify modality importance with attention mechanisms Mai et al. (2020a); Liang et al. (2018), which can highlight useful tokens or modalities and filter certain noisy information out. However, those methods cannot completely filter out the noisy information and only tend to assign high weight to the informative modalities. Chen et al. (2017)
leverage reinforcement learning (RL) to learn a gate controller for each modality, which can shut off noisy modalities. But RL suffers from high variance and introduces more parameters and optimization objectiveLouizos et al. (2017), which is unstable in training.
Unlike previous methods, we propose a modality filter module (MFM) to selectively filter noisy modalities out, in which way the negative impact of noisy information can be minimized. Unlike Chen et al. (2017) which only considers non-lexical modalities as the possible noisy modalities, we aim to identify if the three modalities in each utterance contain noisy information, and if they should contribute to the final prediction.
Mathematically, the deployment of MFM firstly takes the feature embeddings of all the modalities as inputs, and calculates a feature shift of the overall multimodal embedding to each specific unimodal embedding, which can be formulated as:
where denotes a multimodal representation by the concatenation of the embeddings of the three modalities,
represents the processed multimodal representation which preserves the same dimensionality as individual modalities by a linear transformation, andis the feature shift of modality compared to . By using all the unimodal embeddings to modulate and determine the noisy level of each specific modality, the model can have a global view on all the modalities and determine which is informative and which is not.
With the obtained feature shift of each modality, MFM filters out noisy information by a Filter:
where Filter parameterized by outputs , which determines whether to filter the modality out based on its noise level. The Filter is trained across all utterances, and it can identify and filter out noisy modality. The realization of Filter has many possibilities, and we put forward two candidates in Section 3.4.1 and Section 3.4.2. After obtaining the output from the Filter, the final embedding of the modality can be determined:
where represents the final embedding of the modality , which contains much less noisy information. of individual modalities is then leveraged to learn a correct multimodal embedding for MSA task. Besides, we assume that filtering out too much information of the noisy modality may degrade the performance, for the model may lose modality-specific information. To compensate the modality-specific information of noisy modalities, we learn a baseline embedding for each modality. The unimodal baseline embedding is a critical part of our MFM, which is trained across multiple data points in the dataset. is assumed to integrate the general distributions and properties of each modality, and therefore it can compensate the modality-specific information for fusion. Moreover, instead of directly removing the noisy modalities or tokens Chen et al. (2017); Zhang et al. (2019), the unimodal baseline embedding enables our model to fit into any fusion mechanism such as tensor fusion or element-wise multiplication, providing more generalization ability.
With our proposed MFM, our model is capable of identifying and filtering out noisy modalities. In this way, our proposed model can dynamically retain informative modalities to modulate the learning of correct multimodal embedding for each utterance. Besides, to minimize the negative impact of the absence of modality-specific information, the learned baseline embedding of each modality helps to sufficiently learn cross-modal dynamics.
3.4.1 Soft Filter
To realize the Filter function, we first consider the soft filter mechanism whose output value is not binary. The procedure for soft filter is shown below:
where is the scale factor to widen the distance between the elements in , is the fully-connected network activate by ReLU, and
is the assignment vector that determines the noisy level of modality.is the penalty loss that encourages the elements of to be close to 0 or 1. Nevertheless, the elements of are not likely to be binary because they are continuous. But via the soft filter, the model can learn to estimate how much information in the modality can be filtered out instead of directly filtering out all the information, providing more fine-grained filtering effect. Since the output of soft-filter is a 2-dimensional vector, Eq. 16 should be rewritten as:
Soft filter differs from attention mechanism in following aspects: 1) introducing scale factor and penalty loss to reach better filtering effect; 2) introducing the unimodal baseline embedding to compensate the filtered modality-specific information; 3) merely modifying the unimodal embedding and can be integrated with any fusion mechanisms.
3.4.2 Hard Filter
The output of the hard filter, i.e., , is a scalar that is either or . However, due to the discrete nature of , training this kind of framework using gradient-based optimization algorithm is intractable. To resolve this problem, we follow Louizos et al. (2017) to use reparameterization trick Kingma and Welling (2013) to compute the unbiased and low variance gradients. Specifically, we utilize the Hard Concrete distribution introduced in Louizos et al. (2017)
, which is a mixed discrete-continuous distribution on the interval [0, 1]. Hard Concrete assigns a continuous probability to exact zeroes or ones, and meanwhile it allows continuous outcomes in the unit interval such that the gradient can be computed via the reparameterization trick. The computation offor hard filter is illustrated as follows:
where is the temperature, and are the hyper-parameter to scale , and (
denotes Gaussian distribution). Compared to using RLChen et al. (2017); Zhang et al. (2019) to obtain the exact binary weight, using the Hard Concrete distribution is much more simple and stable in training, with no additional optimization objectives or components introduced. Via the hard filter, the model can completely filter out the noisy modalities which cannot be realized by the attention mechanisms. For more details about Hard Concrete distribution, please refer to Louizos et al. (2017).
|TFN Zadeh et al. (2017)||32.2||76.4||76.3||1.017||0.604|
|LMF Liu et al. (2018)||30.6||73.8||73.7||1.026||0.602|
|MFN Zadeh et al. (2018a)||32.1||78.0||76.0||1.010||0.635|
|RAVEN Wang et al. (2019)||33.8||78.8||76.9||0.968||0.667|
|MULT Tsai et al. (2019)||33.6||79.3||78.3||1.009||0.667|
|QMF Li et al. (2021)||35.5||79.7||79.6||0.915||0.696|
|MAG-BERT Rahman et al. (2020)||42.9||83.5||83.5||0.790||0.769|
|TFN Zadeh et al. (2017)||49.8||79.4||79.7||0.610||0.671|
|LMF Liu et al. (2018)||50.0||80.6||81.0||0.608||0.677|
|MFN Zadeh et al. (2018a)||49.1||79.6||80.6||0.618||0.670|
|RAVEN Wang et al. (2019)||50.2||79.0||79.4||0.605||0.680|
|MULT Tsai et al. (2019)||48.2||80.2||80.5||0.638||0.659|
|IMR Tsai et al. (2020)||48.7||80.6||81.0||-||-|
|QMF Li et al. (2021)||47.9||80.7||79.8||0.640||0.658|
|MAG-BERT Rahman et al. (2020)||51.9||85.0||85.0||0.602||0.778|
4.1 Experimental Setting
We use the CMU-MOSI Zadeh et al. (2016) and CMU-MOSEI Zadeh et al. (2018b) datasets to evaluate the model. We provide details about the datasets, evaluation protocols, baseline methods, and other experimental details in Appendix.
During the training stage, we first update individual unimodal sub-networks with the modulated unimodal losses, after which the whole model is updated with the multimodal loss derived from MFM.
4.2 Experimental Results
4.2.1 Comparison with Baselines
In this section, we compare our proposed model with other baselines on two datasets CMU-MOSI Zadeh et al. (2016) and CMU-MOSEI Zadeh et al. (2018b). As shown in Table 1 and 2, although MAG-BERT outperforms other existing methods and sets up a high baseline due to the effectiveness of BERT Devlin et al. (2019), it can be seen that both of our proposed (Hard) and (Soft) significantly outperform all baselines in most cases. Specifically, on CMU-MOSI dataset, our method achieves the best results on all metrics, and (Soft) outperforms MAG-BERT by 3.5% on Acc7, 2.2% on Acc2 and 2.1% on F1 score. On CMU-MOSEI dataset, our proposed (Hard) yields 0.8% improvement on Acc7, and 0.6% on Acc2 and 0.5% on F1 score compared with MAG-BERT. These results demonstrate the superiority of our proposed model, indicating the effectiveness of reaching optimal unimodal network and filtering out noisy modalities.
4.2.2 Ablation Study
In this section, we perform ablation studies to verify the effectiveness of each component by removing it from the model.
Aiming to verify the effectiveness of the designed modulation loss, we conduct experiments where modulation loss is removed (see the cases of ‘ (Hard) (W/O ML)’ and ‘ (Soft) (W/O ML)’ in Table 3). From the experimental results, it can be seen that removing the modulation loss degrades the performance of the model. Specifically, performance on Acc7, Acc2 and F1 score has seen a great drop. It is obvious that our proposed contrastive learning method is effective and can greatly boost the performance.
Meanwhile, we design two ablation experiments to investigate the contribution of MFM (see the cases of ‘ (Hard) (W/O MFM)’ and ‘ (Soft) (W/O MFM) in Table 3). We can observe that without MFM, our model sees a greater drop in performance, which may be due to the reason that noisy information interferes the learning of other useful modalities. The results suggest the necessity to identify and filter out noisy modalities for a correct multimodal embedding, and in this way informative modalities can also be highlighted.
We also perform ablation study on the design of considering baseline embedding in MFM (see the cases of ‘ (Hard) (W/O BE)’ and ‘ (Soft) (W/O BE) in Table 3). We can see from the results that removing the compensation of baseline embedding in MFM degrades the performance of severely compared to other cases. Specifically, the performance drops even greater than the cases W/O MFM. It may be because, despite the removal of noisy information, modality-specific information of the noisy modality is lost. The results indicate that the learning of baseline embedding in MFM is of necessity, for it compensates the filtered modality-specific information.
|(Hard) (W/O ML)||44.9||84.2||84.2||0.743||0.786|
|(Soft) (W/O ML)||46.2||85.0||84.9||0.729||0.794|
|(Hard) (W/O MFM)||47.0||84.8||84.8||0.725||0.791|
|(Soft) (W/O MFM)||44.2||83.9||83.9||0.737||0.794|
|(Hard) (W/O BE)||46.1||84.2||84.2||0.728||0.788|
|(Soft) (W/O BE)||46.1||83.9||83.9||0.733||0.794|
|Tensor Fusion (Hard)||43.1||84.3||84.3||0.772||0.786|
|Graph Fusion (Hard)||45.7||84.6||84.6||0.759||0.772|
|Tensor Fusion (Soft)||43.8||84.7||84.7||0.742||0.787|
|Graph Fusion (Soft)||46.6||84.7||84.6||0.748||0.775|
4.2.3 Analysis of Generalization Ability
We also conduct experiments to verify that our proposed is generalized to be applied with different fusion strategies. Previous work mostly rely on sophisticated fusion methods to sufficiently learn cross-modal dynamics to reach satisfactory results. Unlike them, our proposed model can achieve state-of-the-art performance with simple fusion strategies. As shown in Table 4, even with simple and direct fusion methods like concatenation and element-wise addition of unimodal representations, still outperforms all baselines in most cases. Note that despite the choice of (Hard) or (Soft), all the variants of our model reach the state-of-the-art performance compared to baselines. A conclusion can be reached that our designed modulation loss and MFM is effective and of satisfactory generalization ability. Also note that our proposed modulation loss and MFM can be applied to any cross-modal scenarios.
As shown in the Table, combining all the evaluation metrics, the simple fusion method, i.e., Addition performs best. We argue that apart from the modulation loss which can help to learn better unimodal representation, it is partly because we use the same classifierto regularize the feature distributions of unimodal and multimodal representations which forces them to have the same distribution, such that direct addition is strong enough to explore the complementary information and interactions between modalities. Instead, the high-complex learnable fusion methods may introduce noise to the distribution, which degrades the performance. Specifically, we can observe that tenser fusion Zadeh et al. (2017) gets a relatively unfavorable results. The reason for it could be that tensor fusion implements the outer product on vectors of all modalities, which may change the distribution of high-level features and exhaust the deep network for introducing a lot of computation and parameters.
4.2.4 Analysis on the Modality Importance
We provide a visualization for the learned mask value of the soft filter for the testing utterances, aiming to verify the effectiveness of MFM to identify and filter out noisy modalities. Note that the value of ‘Mask1’ and ’Mask2’ represents the percentage of the preserved information and filtered information of the corresponding modality. We can infer from Fig. 2 that, the language modality is the most informative modality that is rarely filtered out (and this conclusion is consistent with other works Mai et al. (2021b)). Contrary to it, the acoustic modality is frequently identified as noisy and filtered out which is the most uninformative modality. It can be seen that our MFM is capable to identify and filter out noisy modalities, which can also highlight the role of informative modalities when noisy information is filtered. Notably, the mean mask value is 0.998, 0.012, 0.088 for language, acoustic, and visual modalities, respectively.
Also, from the visualization results we can observe that the learned mask value approximates the 0-1 distribution (i.e, a modality is identified as either very informative or very noisy), which differs from existing attention mechanisms and the difference is mostly due to our defined scale factor and penalty loss . Apart from highlighting important modalities as in attention mechanisms, our MFM can reach better filtering effect and can be integrated with any fusion mechanisms. The visualization of (Hard) is similar, which is not presented due to the page limitations.
We propose novel MSA framework to modulate the learning of unimodal and cross-modal dynamics, which is capable of exploring an optimal solution for unimodal networks and filtering out noisy modalities. Specifically, modulation loss can modulate the learning of unimodal networks based on their confidence of prediction, while modality filter module can filter out noisy modalities for a correct multimodal embedding. Experiments demonstrate that our model outperforms state-of-the-art methods in two datasets.
- Multi-task learning for multi-modal emotion recognition and sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 370–379. Cited by: §1, §2.
Context-aware interactive attention for multi-modal sentiment and emotion analysis.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5651–5661. Cited by: §1, §2.
- Multimodal sentiment analysis with word-level fusion and reinforcement learning. In 19th ACM International Conference on Multimodal Interaction (ICMI’17), pp. 163–171. Cited by: §1, §2, §3.4.2, §3.4, §3.4, §3.4.
- COVAREP: a collaborative voice analysis repository for speech technologies. In ICASSP, pp. 960–964. Cited by: §C.4.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: Appendix A, §4.2.1.
- Cross modal focal loss for rgbd face anti-spoofing. In , pp. 7882–7891. Cited by: §1.
- What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Information Fusion 66, pp. 184–197. Cited by: §C.4.
- MISA: modality-invariant and -specific representations for multimodal sentiment analysis. Proceedings of the 28th ACM International Conference on Multimedia. Cited by: §1.
- Deep multimodal multilinear fusion with high-order polynomial pooling. In Advances in Neural Information Processing Systems, pp. 12113–12122. Cited by: Appendix B.
- IMotions. Facial expression analysis. Cited by: §C.4.
- Adam: a method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §C.4.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.4.2.
- Quantum-inspired multimodal fusion for video sentiment analysis. Information Fusion 65, pp. 58 – 71. External Links: Cited by: §C.3, §C.4, Table 1, Table 2.
- Strong and simple baselines for multimodal utterance embeddings. In NAACL, pp. 2599–2609. Cited by: §2.
- Multimodal language analysis with recurrent multistage fusion. In EMNLP, pp. 150–161. Cited by: §3.4.
- Efficient low-rank multimodal fusion with modality-specific factors. In ACL, pp. 2247–2256. Cited by: Appendix B, §C.3, §2, Table 1, Table 2.
Learning sparse neural networks throughregularization. arXiv preprint arXiv:1712.01312. Cited by: §2, §3.4.2, §3.4.
- Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In ACL, Cited by: Appendix B, §2.
Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 164–172. Cited by: Appendix B, §1, §2, §3.3, §3.4, Table 4.
- A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning. IEEE Transactions on Multimedia. Cited by: §1.
- Analyzing unaligned multimodal sequence via graph convolution and graph pooling fusion. External Links: Cited by: §2.
- Analyzing multimodal sentiment via acoustic- and visual-lstm with channel-aware temporal convolution network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (), pp. 1424–1437. External Links: Cited by: §1, §4.2.4.
Neural variational inference and learning in belief networks.
International Conference on Machine Learning, pp. 1791–1799. Cited by: §2.
- From utterance to text: the bias of language in speech and writing. Harvard Educational Review 47 (3), pp. 257–281. Cited by: §3.1.
- Found in translation: learning robust joint representations by cyclic translations between modalities. In AAAI, pp. 6892–6899. Cited by: §1.
- A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion 37, pp. 98–125. Cited by: §1.
- Context-dependent sentiment analysis in user-generated videos. In ACL, pp. 873–883. Cited by: §2.
- Convolutional mkl based multimodal emotion recognition and sentiment analysis. In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 439–448. Cited by: §2.
- Integrating multimodal information in large pretrained transformers. Proceedings of the conference. Association for Computational Linguistics. Meeting 2020, pp. 2359–2369. Cited by: §C.3, §2, Table 1, Table 2.
- Ensemble of svm trees for multimodal emotion recognition. In Signal and Information Processing Association Summit and Conference, pp. 1–4. Cited by: §2.
- Interpreting graph neural networks for nlp with differentiable edge masking. arXiv preprint arXiv:2010.00577. Cited by: §1.
- Multilogue-net: a context aware rnn for multi-modal emotion detection and sentiment analysis in conversation. arXiv preprint arXiv:2002.08267. Cited by: §2.
- Learning factorized multimodal representations. In ICLR, Cited by: §2.
- Multimodal transformer for unaligned multimodal language sequences. In ACL, Cited by: §C.3, Table 1, Table 2.
- Multimodal routing: improving local and global interpretability of multimodal language analysis. arXiv preprint arXiv:2001.08735, 2020. External Links: Cited by: §C.3, Table 2.
- Attention is all you need. In NIPS, pp. 5998–6008. Cited by: Appendix A, §3.2.
- YouTube movie reviews: sentiment analysis in an audio-visual context. IEEE Intelligent Systems 28 (3), pp. 46–53. Cited by: §2.
- Words can shift: dynamically adjusting word representations using nonverbal behaviors. In AAAI, Vol. 33, pp. 7216–7223. Cited by: §C.3, §3.3, Table 1, Table 2.
- CM-bert: cross-modal bert for text-audio sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 521–528. Cited by: §2.
- Tensor fusion network for multimodal sentiment analysis. In EMNLP, pp. 1114–1125. Cited by: Appendix B, §C.3, §2, §4.2.3, Table 1, Table 2, Table 4.
- Memory fusion network for multi-view sequential learning. In AAAI, pp. 5634–5641. Cited by: §C.3, Table 1, Table 2.
- Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In ACL, pp. 2236–2246. Cited by: §C.1, §2, §4.1, §4.2.1.
- Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems 31 (6), pp. 82–88. Cited by: §4.2.1.
- MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. IEEE Intelligent Systems 31 (6), pp. 82–88. Cited by: §C.1, §4.1.
- Effective sentiment-relevant word selection for multi-modal sentiment analysis in spoken language. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 148–156. Cited by: §1, §2, §3.4.2, §3.4.
Appendix A Unimodal Network:
Since Transformer-based Vaswani et al. (2017) structure enables parallel computation in time dimension and can learn longer temporal dependency in long sequences, we apply Transformer-based Vaswani et al. (2017) architectures to build up the unimodal learning networks. Specifically, for acoustic and visual modalities, we apply the standard Transformer to extract the high-level unimodal representations. For language modality, the large-pretrained Transformer model, i.e., BERT Devlin et al. (2019) is applied to extract the language representation. The equations are shown as below:
where denotes the temporal convolution operation with being the kernel size, which is used for mapping the output dimensionality of BERT to the shared dimensionality that are equal for all modalities. Note that is the feature embedding of in the last time step, and we only use the feature embedding of the last time step to conduct fusion and prediction such that our model is suitable for handling the fusion of unimodal sequences of various length. For acoustic and visual modalities, the equations are presented as follows:
Different from the language processing procedure, the temporal convolution operation for the other modalities is used before the Transformer to map the feature dimensionality to the same one.
Appendix B Multimodal Network:
Our algorithm is independent of the concrete fusion mechanism, and we can inject various fusion methods into our multimodal learning structure. In this paper, we mainly investigate four fusion methods to verify the effectiveness of our algorithm. Note that since the unimodal and multimodal representations share the same classifier , the dimensionality of the fused multimodal representation shall be the same as that of the unimodal representations. The fusion methods are illustrated as follows:
1) Direct Addition:
where is the multimodal representation. Since the addition will not change the feature dimensionality, we need not to apply a learnable layer such as fully-connected layer to change the feature dimensionality of the multimodal representation. Therefore, this method of fusion is learnable. In our experiment, we show that even with such a simple fusion method, our algorithm can still reach very competitive performance.
where denotes fully-connected network to map the feature dimensionality to . This method is learnable as it uses fully-connected layers to inject the multimodal representation into the common embedding space as that of the unimodal representations. Together with Direct Addition, it serves as the baseline fusion methods throughout the researches of multimodal learning.
3) Tensor Fusion: Tensor fusion Zadeh et al. (2017) is a widely-used fusion algorithm that attracts significant attention Mai et al. (2019); Liu et al. (2018); Hou et al. (2019). By applying outer product over the unimodal representations, the generated multimodal representation has the highest expressive power but meanwhile is high-dimensional. The equations for tensor fusion are shown below:
where denotes outer product of a set of vectors, denotes fully-connected network to map the feature dimensionality to . In Eq. 26
, each unimodal representation is padded with1s to retain interactions of any subset of modalities as in Zadeh et al. (2017).
4) Graph Fusion: Graph fusion Mai et al. (2020a) regards each modality as one node, and conduct message passing between nodes to explore unimodal, bimodal, and trimodal dynamics. The final graph representation is obtained by averaging the node embedding. For more details, please refer to the Graph Fusion Network in Mai et al. (2020a).
Appendix C Experimental Setting
1) CMU-MOSI is a widely-used dataset for multimodal sentiment analysis, which is a collection of 2199 opinion video clips. Each opinion video is annotated with sentiment on a [-3,3]. To be consistent with prior works, we use 1,284 utterances for training, 229 utterances for validation, and 686 utterances for testing.
2) CMU-MOSEI is a large dataset of multimodal sentiment analysis and emotion recognition. The dataset consists of 23454 video utterances from more than 1000 YouTube speakers, covering 250 distinct topics. All the sentences utterance are randomly chosen from various topics and monologue videos, and each utterance is annotated on two views: emotion of six different values, and sentiment in the range [-3,3]. In our work, we use the sentiment label to perform MSA. We use 16,265 utterances as training set, 1,869 utterances as validation set, and 4,643 utterances as testing set.
c.2 Evaluation Protocol
In our experiments, the evaluation metrics for CMU-MOSEI are the same as those for CMU-MOSI dataset. We adopt various metrics to evaluate the performance of each model: 1) Acc7: 7-way accuracy, sentiment score classification; 2) Acc2: binary accuracy, positive or negative; 3) F1 score; 4) MAE: mean absolute error and 5) Corr: the correlation of the model’s prediction.
We compare our proposed model with the following state-of-the-art models:
1) Early Fusion LSTM (EF-LSTM), which is the baseline fusion approach that concatenates the input features of different modalities at word-level, and then sends the concatenated features to an LSTM layer. EF-LSTM is an RNN-based word-level fusion model.
2) Late Fusion LSTM (LF-LSTM), which is another baseline method that uses an LSTM network for each modality to extract unimodal features and infer decision, and then combine the unimodal decisions by voting mechanism, etc.
3) Recurrent Attended Variation Embedding Network (RAVEN) Wang et al. (2019), which models human language by shifting word representations based on the features of the facial expressions and vocal patterns. It is an RNN-based word-level fusion approaches.
4) Memory Fusion Network (MFN) Zadeh et al. (2018a) is also an RNN-based word-level fusion method, which includes three components. The first component is the systems of LSTMs which is used to model unimodal dynamics. The latter components are delta-attention module and multi-view gated memory network which are used for discovering cross-modal dynamics through time.
5) Multimodal Transformer (MULT) Tsai et al. (2019), which learns joint multimodal representation by translating source modality into target modality. It is a transformer-based model.
6) Interpretable Modality Fusion (IMR) Tsai et al. (2020), which improves the interpretable ability of MULT by introducing the multimodal routing mechanism. IMR is also a transformer-based model.
7) Tensor Fusion Network (TFN) Zadeh et al. (2017), which applies 3-fold outer product from modality embeddings to jointly learn unimodal, bimodal and trimodal interactions.
8) Low-rank Modality Fusion (LMF) Liu et al. (2018), which leverages low-rank weight tensors to reduce the complexity of tensor fusion without compromising on performance.
9) Quantum-inspired Multimodal Fusion (QMF) Li et al. (2021), which addresses the interpretable problem of multimodal fusion by taking inspiration from the quantum theory.
10) Multimodal Adaption Gate BERT (MAG-BERT) Rahman et al. (2020)
: MAG-BERT proposes an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG), which allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. The feature extraction method of MAG-BERT is the same as that of our method, which ensures fair comparison. MAG-BERT is currently the state-of-the-art algorithm on multimodal sentiment analysis.
c.4 Experimental Details
For each baseline (except for QMF Li et al. (2021) whose codes are unavailable), following Gkoumas et al. (2021), we first perform fifty-times random grid search on the hyper-parameters to fine-tune the model, and save the hyper-parameter setting that reaches the best performance. After that, we train each model with the best hyper-parameters setting for five times, and the final results are obtained by calculating the mean results.
For CMU-MOSEI dataset, the input dimensionality of language, audio, and visual modality is 768, 74, and 35, respectively. While for CMU-MOSI, the input dimensionality of language, audio, and visual modality is 768, 74, and 47, respectively. For feature extraction, Facet iMotions 2017 (2017) 111 iMotions 2017. https://imotions.com/ is used for the visual modality to extract a set of features that are composed of facial action units, facial landmarks, head pose, etc. These visual features are extracted from the video utterance at the frequency of 30Hz to form a sequence of facial gestures over time. COVAREP Degottex et al. (2014) is utilized for extracting features of acoustic modality, including 12 Mel-frequency cepstral coefficients, pitch tracking, speech polarity, glottal closure instants, spectral envelope, etc. These acoustic features are extracted from the full audio clip of each utterance at 100Hz to form a sequence that represents variations in the tone of voice across the utterance.