Introduction and Related Works
Multimodal data fusion is a desirable method for many machine learning tasks where information is available from multiple source modalities, typically achieving better predictions. Multimodal integration can handle missing data from one or more modalities. Since some modalities can include noise, it can lead to more robust prediction. Moreover, since some information may not be visible in some modalities or a single modality may not be powerful enough for a specific task, considering multiple modalities often improves performance[Potamianos et al.2003, Soleymani, Pantic, and Pun2012, Kampman et al.2018].
For example, humans assign personality traits to each other, as well as to virtual characters by inferring personality from diverse cues, both behavioral and verbal, suggesting that a model to predict personality should take into account multiple modalities such as language, speech, and visual cues.
Multimodal fusion has a very broad range of applications, including audio-visual speech recognition [Potamianos et al.2003], multimodal emotion recognition [Soleymani, Pantic, and Pun2012], medical image analysis [James and Dasarathy2014], and multimedia event detection [Lan et al.2014], Personality trait detection [Kampman et al.2018], and sentiment analysis [Zadeh et al.2017].
According to the recent work by [Baltrušaitis, Ahuja, and Morency2018], the techniques for multimodal fusion can be divided into early, late and hybrid approaches. Early approaches combine the multimodal features immediately by simply concatenating them [D’mello and Kory2015]. Late fusion combines the decision for each modality (either classification, or regression), by voting [Morvant, Habrard, and Ayache2014], averaging [Shutova, Kiela, and Maillard2016] or weighted sum of the outputs of the learned models [Glodek et al.2011, Shutova, Kiela, and Maillard2016]. The hybrid approach combines the prediction by early fusion and unimodal predictions.
It has been observed that early fusion (feature level fusion) concentrates on the inter-modality information rather than intra-modality information [Zadeh et al.2017] due to the fact that inter-modality information can be more complicated at the feature level and dominates the learning process. On the other hand, these fusion approaches are not powerful enough to extract the inter-modality integration model and they are limited to some simple combining methods [Zadeh et al.2017].
Zadeh et.al. [Zadeh et al.2017] proposed combining modalities by computing an dimensional tensor as a tensor product of the different modality representations followed by a flattening operation, in order to include 1-st order to n-th order inter modality relations. This is then fed to a neural network model to make predictions. The authors show that their proposed method improves the accuracy by considering both inter-modality and intra-modality relations. However, the generated representation has a very large dimension which leads to a very large hidden layer and therefore a huge number of parameters. Recently [Liu et al.2018] proposed a factorization approach in order to achieve a factorized version of the weight matrix which leads to fewer parameters while maintaining model accuracy. They use a CANDECOMP/PARAFAC decomposition [Carroll and Chang1970, Harshman1970] which follows Eq. 1 in order to decompose a tensor
to several 1-dimensional vectors:
where is the outer product operator, s are scalar weights to combine rank 1 decompositions. The authors approach used the same factorization rate for all modalities, i.e. is shared for all the modalities, resulting in the same compression rate for all the modalities, and is not able to allow for varying compression rates between modalities. Previous studies have found that some modalities are more informative than others [De Silva, Miyasato, and Nakatsu1997, Kampman et al.2018], suggesting that allowing different compression rates for different modalities should improve performance.
Our method, Modality-based Redundancy Reduction multimodal Fusion (MRRF), uses Tuckers tensor decomposition instead (see the Methodology section), which uses different factorization rates for each modality, allowing for variations in the amount of useful information between modalities. Modality-specific factors are chosen by maximising performance on a validation set. Applying a modality-based factorization method, results in removing the redundant information in the aforementioned high-order dependency structure and leads to fewer parameters with minimal information loss. Our method, works as a regularizer which leads to a less complicated model and reduces overfitting. In addition, our modality-based factorization approach helps to figure out the amount of useful information in each modality for the task at hand.
We evaluate the performance of our approach using sentiment analysis, personality detection, and emotion recognition from audio, text and video frames. The method reduces the number of parameters which requires fewer training samples, providing efficient training for the smaller datasets, and accelerating both training and prediction. Our experimental results demonstrate that the proposed approach can make notable improvements, in terms of accuracy, mean average error (MAE), correlation, and score, specially for the applications with more complicated inter-modality relations.
We further study the effect of different factorization rates for different modalities. Our results on the importance of each modality for each task supports the previous results on the usefulness of each modality for personality recognition, emotion recognition and sentiment analysis. Moreover, our results demonstrate that our factorization approach avoids underfitting and overfitting for very simple and very large models, respectively.
In the sequel, we first explain the notation used in this paper. We elaborate on the details of our proposed method in methodology section. In the following section we go on to describe our experimental setup. In the results section, we compare the performance of MRRF and state-of-the-art baselines on three different datasets and discuss the effect of factorization rate on each modality. Finally, we provide a brief conclusion of the approach and the results.
The operator is the outer product operator which for leads to a M-dimentional tensor in . The operator , for a given , is k-mode product of a tensor and a matrix as , which results in a tensor . This operator first flattens the tensor and converts it to a matrix . The next step is a simple matrix product as . By unflattening the resulted matrix, we can convert it to a tensor in .
We propose Modality-based Redundancy Reduction Fusion (MRRF), a tensor fusion and factorisation method allowing for modality specific compression rates, combining the power of tensor fusion methods with a reduced parameter complexity.
Instead of simply concatenating the feature vectors of each modality, or using expensive fusion approaches such as the tensor fusion method by [Zadeh et al.2017], our aim is to use a compressed tensor fusion method.
The operator is a k-mode product of a tensor and a matrix as , which results in a tensor .
For modalities with representations , , and of size , , and , an -modal tensor fusion approach as proposed by the authors of [Zadeh et al.2017] leads to a tensor . The authors proposed flattening the tensor layer in the deep network which results in loss of the information included in the tensor structure. In this paper, we propose to avoid the flattening and follow Eq. 3 with weight tensor
and bias vector, where leads to an output layer of size .
The above equation suffers from a large number of parameters () which requires a large number of the training samples, huge time and space, and easily overfits. In order to reduce the number of parameters, we propose to use Tucker’s tensor decomposition [Tucker1966, Hitchcock1927] as shown in Eq. 4, which works as a low-rank regularizer [Fazel2002].
The non-diagonal core tensor maintain inter-modality information after compression, despite the factorization proposed by [Liu et al.2018] which loses part of inter-modality information.
For a multimodal deep neural network architecture consisting of three separate channels for audio, text, and video, we can represent the method as seen in Fig. 2
. It is worth mentioning that a simple outer product of the input features leads only to the high-order trimodal dependencies. In order to overcome this drawback, the input feature vectors for each modality have been padded by 1 and thus also obtain the unimodal and bimodal dependencies . Algorithm1 shows the whole MRRF algorithm.
It is notable to mention that the factorization step is task dependent, included in the deep network structure and learned during network training. Thus, for follow-up learning tasks, we would learn a new factorization specific to the task at hand, typically also estimating optimal compression ratios as described in the discussion section. In this process, any shared, helpful information is retained, as demonstrated by our results.
Following our proposed approach, we have decomposed the trainable tensor to four substantially smaller trainable matrices () leading to parameters.
For the feature level information of size , and for three different modalities, concat fusion (CF) leads to a layer size of and parameters.
The tensor fusion approach (TF), Applying the flattening method directly to Eq. 3, leads to a layer size of , and parameters. The LMF approach [Liu et al.2018] requires training parameters, where is the rank used for all the modalities.
It can be seen that the number of parameters in the proposed approach is substantially fewer than the simple tensor fusion (TF) approach and comparable to the LMF approach. For example, most often is presumable which leads to much fewer parameters for MRRF than LMF. However the simple concatenating approach has fewer parameters which leads to worse performance as a result of biasing toward the intra-modality information representation than the inter-modality information fusion.
We perform our experiments on the following multimodal datasets: CMU-MOSI [Zadeh et al.2016], POM [Park et al.2014], and IEMOCAP [Busso et al.2008] for sentiment analysis, speaker traits recognition, and emotion recognition, respectively. These tasks can be done by integrating both verbal and nonverbal behaviors of the persons.
The CMU-MOSI dataset is annotated on a seven-step scale as highly negative, negative, weakly negative, neutral, weakly positive, positive, highly positive which can be considered as a 7 class classification problem with 7 labels in the range . The dataset is an annotated dataset of 2199 opinion utterances from 93 distinct YouTube movie reviews, each containing several opinion segments. Segments average of 4.2 seconds in length.
The POM dataset is composed of 903 movie review videos. Each video is annotated with the following speaker traits: confident, passionate, voice pleasant, dominant, credible, vivid, expertise, entertaining, reserved, trusting, relaxed, outgoing, thorough, nervous, persuasive and humorous.
The IEMOCAP dataset is a collection of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset. Each segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disgust and neutral).
Each dataset consists of three modalities, namely language, visual, and acoustic. The visual and acoustic features are calculated by taking the average of their feature values over the word time interval [Chen et al.2017]. In order to perform time alignment across modalities, the three modalities are aligned using P2FA [Yuan and Liberman2008] at the word level.
Pre-trained 300-dimensional Glove word embeddings [Chen et al.2017] were used to extract the language feature representations, which encodes a sequence of the transcribed words into a sequence of vectors.
Visual features for each frame (sampled at 30Hz) are extracted using the library Facet111goo.gl/1rh1JN which includes 20 facial action units, 68 facial landmarks, head pose, gaze tracking and HOG features [Zhu et al.2006].
COVAREP acoustic analysis framework [Degottex et al.2014] is used to extract low-level acoustic features, including 12 Mel frequency cepstral coefficients (MFCCs), pitch, voiced/unvoiced segmentation, glottal source, peak slope, and maxima dispersion quotient features.
To evaluate model generalization, all datasets are split into training, validation, and test sets such that the splits are speaker independent, i.e., no speakers from the training set are present in the test sets. Table 1 illustrates the data splits for all the datasets in detail.
Similarly to [Liu et al.2018], we use a simple model architecture for extracting the representations for each modality. We used three unimodal sub-embedding networks to extract representations , and
for each modality, respectively. For acoustic and visual modalities, the sub-embedding network is a simple 2-layer feed-forward neural network, and for language, we used a long short-term memory (LSTM) network[Hochreiter and Schmidhuber1997].
We compared our proposed method with three baseline methods. Concat fusion (CF) [Baltrušaitis, Ahuja, and Morency2018] proposes a simple concatenation of the different modalities followed by a linear combination. The tensor fusion approach (TF) [Zadeh et al.2017] computes a tensor including uni-modal, bi-modal, and tri-modal combination information. LMF [Liu et al.2018] is a tensor fusion method that performs tensor factorization using the same rank for all the modalities in order to reduce the number of parameters. Our proposed method aims to use different factors for each modality.
In Table 2, we present mean average error (MAE), the correlation between prediction and true scores, accuracy and F1 measure. The proposed approach outperforms baseline approaches in nearly all metrics, with marked improvements in Happy and Neutral recognition. The reason is that the inter-modality information for these emotions is more complicated than the other emotions and requires a non-diagonal core tensor to extract the complicated information.
Investigating the Effect of Factorization Rate on Each Modality
In this section, we aim to investigate the amount of redundant information in each modality. To do this, after obtaining a tensor which includes the combinations of all modalities with the equivalent size, we factorize a single dimension of the tensor while keeping the size for the other modalities fixed. By observing how the performance changes by factorization rate, one can find how much redundant information is contained in the corresponding modality relative to the other modalities.
The results can be seen in Fig. 3, 4 and 5. The horizontal axis is the ratio of compressed size over the original size for a single modality (), and the vertical axis shows the accuracy for each modality.
Fig. 3 shows the results for the IEMOCAP emotion recognition dataset in 4 columns for the four emotional categories including happy, angry, sad, and neutral. The first point that could be perceived clearly from the different modality diagrams is that each of the modalities has a different optimum compression rate ,(the maximum accuracy is highlighted in each of the diagrams), which means they each have a different amount of the redundant information. In other words, a high accuracy for a small factorization rate means that there is a lot of redundant information in this modality. The information loss resulting from this factorization could be compensated by the other modalities thus avoiding performance reduction. For example, looking at the sad category, we see higher optimal factor sizes than the angry category apart from the video modality, which is more informative for the angry category than the sad category. This observation is supported by the results obtained in [De Silva, Miyasato, and Nakatsu1997].
Moving on to the neutral category, optimal factorization rates are smaller (a higher compression rate) for video and language modalities. We know that the neutral category is very difficult to predict by these modalities in comparison to the other categories which means that these modalities are not that informative for the neutral category. The happy category suffers a lot by smaller factors (higher compression rate) which we can interpret that all the modalities include some useful information for this category.
Fig. 3 shows results for the CMU-MOSI sentiment analysis dataset. For this dataset also, the first point that could be perceived clearly is that each modality has a different optimum compression rate, which means there is a different amount of the non-redundant information in each modality. In addition, we can see that the language modality cannot be compressed very much and includes little redundant information for the current task.
Fig. 5 shows the results for the POM personality trait recognition dataset. For this dataset also, each of the modalities has a different optimum compression rate (the maximum accuracy is highlighted in each of the diagrams), meaning they have differing levels of the non-redundant information. Moreover, we can see that the visual modality includes more non-redundant information for the personality recognition, which is supported by other recent publications [Kampman et al.2018].
In Figures 3,4 and 5, the accuracy curves under different compression rates show a trend of increasing first and then decreasing. This phenomenon has some logical reasons. If The factor is too small and the resulted model is too simple, and tends to underfit. On the other hand, if the factor is too large and the model is not compressed, it is too large with many parameters and is prone to overfitting. This supports the supposition that our factorization method functions as a regularizer. Therefore, the accuracy is lower at the beginning and the end of the factorization spectrum.
We proposed a tensor fusion method for multimodal media analysis by obtaining a dimensional tensor to consider the high-order relationships between input modalities and the output layer. Our modality-based factorization method removes the redundant information in this high-order dependency structure and leads to the fewer parameters with minimal loss of information.
The Modality-based Redundancy Reduction multimodal Fusion (MRRF) works as a regularizer which leads to a less complicated model and avoids overfitting. In addition, a modality-based factorization approach helps to figure out the amount of non-redundant useful information in each individual modality through investigation of optimal modality-specific compression rates.
We have provided experimental results for combining acoustic, text, and visual modalities for three different tasks: sentiment analysis, personality trait recognition, and emotion recognition. We have seen that the modality-based tensor compression approach improves the results in comparison to the simple concatenation method, the tensor fusion method and tensor fusion using the same factorization rank for all modalities, as proposed in the LMF method. In other words, the proposed method enjoys the same benefits as the tensor fusion method and avoids suffering from having a large number of parameters, which leads to a more complex model, needs many training samples and is prone to overfitting. We have evaluated our method by comparing the results with state-of-the-art methods, achieving a 1% to 4% improvement across multiple measures for the different tasks.
Moreover, we have investigated the effect of the compression rate on single modalities while fixing the other modalities helping to understand the amount of useful non-redundant information in each modality.
In future, as the availabality of data with more and more modalities increases, both finding a trade-off between cost and performance and effective and efficient utilisation of available modalities will be vital.
To be specific, does adding more modalities result in new information?
If so, does the amount of performance improvement worth the resulting computational and memory cost?
Accordingly, exploring the compression and factorization methods could help removing highly redundant modalities.
- [Baltrušaitis, Ahuja, and Morency2018] Baltrušaitis, T.; Ahuja, C.; and Morency, L.-P. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- [Busso et al.2008] Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation 42(4):335.
- [Carroll and Chang1970] Carroll, J. D., and Chang, J.-J. 1970. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika 35(3):283–319.
[Chen et al.2017]
Chen, M.; Wang, S.; Liang, P. P.; Baltrušaitis, T.; Zadeh, A.; and
Multimodal sentiment analysis with word-level fusion and reinforcement learning.In Proceedings of the 19th ACM International Conference on Multimodal Interaction, 163–171. ACM.
- [De Silva, Miyasato, and Nakatsu1997] De Silva, L. C.; Miyasato, T.; and Nakatsu, R. 1997. Facial emotion recognition using multi-modal information. In Information, Communications and Signal Processing, 1997. ICICS., Proceedings of 1997 International Conference on, volume 1, 397–401. IEEE.
- [Degottex et al.2014] Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; and Scherer, S. 2014. Covarep—a collaborative voice analysis repository for speech technologies. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 960–964. IEEE.
- [D’mello and Kory2015] D’mello, S. K., and Kory, J. 2015. A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys (CSUR) 47(3):43.
- [Fazel2002] Fazel, M. 2002. Matrix rank minimization with applications. Ph.D. Dissertation, PhD thesis, Stanford University.
[Glodek et al.2011]
Glodek, M.; Tschechne, S.; Layher, G.; Schels, M.; Brosch, T.; Scherer, S.;
Kächele, M.; Schmidt, M.; Neumann, H.; Palm, G.; et al.
Multiple classifier systems for the classification of audio-visual emotional states.In Affective Computing and Intelligent Interaction. Springer. 359–368.
- [Harshman1970] Harshman, R. 1970. Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics 16:1–84.
- [Hitchcock1927] Hitchcock, F. L. 1927. The expression of a tensor or a polyadic as a sum of products. Studies in Applied Mathematics 6(1-4):164–189.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- [James and Dasarathy2014] James, A. P., and Dasarathy, B. V. 2014. Medical image fusion: A survey of the state of the art. Information Fusion 19:4–19.
- [Kampman et al.2018] Kampman, O.; Barezi, E. J.; Bertero, D.; and Fung, P. 2018. Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, 606–611.
- [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Lan et al.2014] Lan, Z.-Z.; Bao, L.; Yu, S.-I.; Liu, W.; and Hauptmann, A. G. 2014. Multimedia classification and event detection using double fusion. Multimedia tools and applications 71(1):333–347.
- [Liu et al.2018] Liu, Z.; Shen, Y.; Bharadhwaj Lakshminarasimhan, V.; Liang, P. P.; Zadeh, A.; and Morency, L.-P. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064.
[Morvant, Habrard, and
Morvant, E.; Habrard, A.; and Ayache, S.
Majority vote of diverse classifiers for late fusion.
Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), 153–162. Springer.
- [Park et al.2014] Park, S.; Shim, H. S.; Chatterjee, M.; Sagae, K.; and Morency, L.-P. 2014. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In Proceedings of the 16th International Conference on Multimodal Interaction, 50–57. ACM.
- [Paszke et al.2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
- [Potamianos et al.2003] Potamianos, G.; Neti, C.; Gravier, G.; Garg, A.; and Senior, A. W. 2003. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9):1306–1326.
- [Shutova, Kiela, and Maillard2016] Shutova, E.; Kiela, D.; and Maillard, J. 2016. Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 160–170.
- [Soleymani, Pantic, and Pun2012] Soleymani, M.; Pantic, M.; and Pun, T. 2012. Multimodal emotion recognition in response to videos. IEEE transactions on affective computing 3(2):211–223.
- [Tucker1966] Tucker, L. R. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279–311.
- [Yuan and Liberman2008] Yuan, J., and Liberman, M. 2008. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America 123(5):3878.
- [Zadeh et al.2016] Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259.
- [Zadeh et al.2017] Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; and Morency, L.-P. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
- [Zhu et al.2006] Zhu, Q.; Yeh, M.-C.; Cheng, K.-T.; and Avidan, S. 2006. Fast human detection using a cascade of histograms of oriented gradients. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, 1491–1498. IEEE.