Log In Sign Up

Modality To Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction and classification losses. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performances on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.


page 3

page 8


Universal Multi-Modality Retrieval with One Unified Embedding Space

This paper presents Vision-Language Universal Search (VL-UnivSearch), wh...

MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Humans are able to create rich representations of their external reality...

DM^2S^2: Deep Multi-Modal Sequence Sets with Hierarchical Modality Attention

There is increasing interest in the use of multimodal data in various we...

Graph Neural Networks for Multimodal Single-Cell Data Integration

Recent advances in multimodal single-cell technologies have enabled simu...

Target-Oriented Deformation of Visual-Semantic Embedding Space

Multimodal embedding is a crucial research topic for cross-modal underst...

Learning Shared Multimodal Embeddings with Unpaired Data

In this paper, we propose a method to learn a joint multimodal embedding...

Fusing Modalities by Multiplexed Graph Neural Networks for Outcome Prediction in Tuberculosis

In a complex disease such as tuberculosis, the evidence for the disease ...


Multimodal machine learning, where multiple data sources are available, always fares better compared with the situation where only single data source is utilized


. For example, sentiment analysis of text has been widely explored for a long time

[13]. But recent research shows that information from text is not sufficient for mining opinion of human [24], especially under the situation where sarcasm or ambiguity is included. Nevertheless, if machine is able to understand a speaker’s facial expression and how spoken language is uttered, it would be much easier to figure out the real sentiment. In this paper, we focus on analyzing human multimodal language from videos, where acoustic, language and visual modalities are included [11, 34]. We can easily extend our work to other machine learning tasks as long as multiple modalities are presented, such as audio-visual speech recognition [27] and cross-modal retrieval [30].

How to fuse representations of various modalities has always been an open research. A key issue in multimodal fusion lies in the heterogeneous data distributions from different modalities [1], leading to the difficulty in mining complementary information across modalities which is critical for a comprehensive interpretation of multimodal information. The majority of previous works do not devote efforts to learn joint embedding space for various modalities to match multimodal distributions. Instead, they apply a subnetwork to each modality and then conduct fusion immediately [31, 25, 14]. However, due to the heterogeneity across divergent modalities [12], the transformed representations learnt by these approaches can still follow unknown yet complex distributions.

Generative Adversarial Networks (GANs) [5] have achieved significant progress, which can explicitly map one distribution to another prior distribution using adversarial training [18]. Based on its unique characteristics, adversarial training is suitable for modality distribution translation. Motivated by it, we propose an adversarial encoder framework to match transformed distributions of all modalities and learn a modality-invariant embedding space. Specifically, source modalities’ encoders, transforming the unimodal raw features into embedding space, try to fool the discriminator into distinguishing their encoded representations as those from target modality, while discriminator aims to classify the encoded representations from target modality as true but others as false. We also define one decoder for each modality that seeks to reconstruct the raw features to prevent unimodal information from being lost. Moreover, a classifier is built to classify the encoded representations into true category, which ensures that the embedding space is discriminative for learning task.

Furthermore, many prior methods fail to conduct fusion in a hierarchical way and are unable to explicitly model the interaction between each subset of multiple modalities [25, 14, 33]. In contrast, we interpret multimodal fusion as a hierarchical interaction learning procedure where firstly bimodal interactions are generated based on unimodal dynamics, and then trimodal dynamics are generated based on bimodal and unimodal dynamics. Drawing inspiration from recent success of graph-style neural network [34], we propose a hierarchical fusion network, i.e., graph fusion network, which is highly interpretable in terms of the fusion procedure. Our graph fusion network consists of three layers containing unimodal, bimodal and trimodal dynamics respectively. The vertices in lower layer deliver their information to the higher layers where the information is fused to form multimodal information for the higher layers. By this means, we can explore cross-modal interactions hierarchically while still maintain the original interactions.

In brief, the main contributions are listed below:

  • We propose Adversarial Representation Graph Fusion framework (ARGF) for multimodal fusion. We address the importance of matching distributions before fusion and introduce adversarial training to learn a discriminative joint embedding space for various modalities, which can significantly narrow modality gap by matching multimodal distributions. Moreover, we define decoders and classifier to retain unimodal information and enhance discrimination of the embedding space respectively.

  • We propose a hierarchical graph fusion network which can explicitly model unimodal, bimodal and trimodal dynamics successively and is highly interpretable due to explicit meanings of vertices. It is flexible in fusion structure due to the changeable weights of edges and the ability to assign different importance to various interactions.

  • ARGF achieves state-of-the-art performance on three multimodal learning datasets. We also provide visualizations for embedding space and graph fusion to prove their characteristics and contributions.

Related Work

Earlier works on multimodal fusion focus on early fusion and late fusion. Early fusion approaches extract features of various modalities and fuse features at input level by simple concatenation [29, 26], but they cannot fully explore intra-modality dynamics as stated in [31]. In contrast, late fusion first makes decision according to each modality and then fuses the decisions by weighted averaging [20, 8], but it fails to model cross-modal interactions for features cannot interact with each other.

Recently, local fusion has become the mainstream strategy which can model time-dependent cross-modal interactions [6, 33]. For instance, [32] propose Memory Fusion Network that fuses memories of LSTMs over time, which is extended by [34]

that use a Dynamic Fusion Graph (DFG) to fuse features. Our graph fusion network is mainly different from DFG in that: 1) we use inner product as part of edges’ weight to estimate similarity between interactions; 2) in addition to fuse bimodal and unimodal dynamics, we also fuse each two bimodal dynamics to obtain more complete trimodal representations; 3) we determine the importance of vertices respectively to obtain each layer’s output firstly rather than directly adding weighted information from all vertices. Tensor fusion has also become a new trend. Tensor Fusion Network

[31] adopts an outer product from multimodal embeddings to conduct fusion, followed by Liu2018Efficient that tries to improve efficiency and reduces redundant information. More recently, HFFN propose a ‘Divide, Conquer and Combine’ strategy to conduct local tensor and global fusion, which is extended in [15] using a more efficient feature segmentation method and an elaborately-designed BM-LSTM. But methods above do not explicitly explore joint embedding space before fusion. Consequently, modality gap still seriously affects the effect of fusion.

For modality translation methods, Multimodal Transformer (MulT) [28] transforms source modality to target modality using directional pairwise cross-modal attention. Moreover, Multimodal Cyclic Translation Network (MCTN) [23] learn joint representations via encoder-decoder framework by translating encoder’s input (source modality) into decoder’s output (target modality). MCTN uses encoder’s output as joint representations of source and target modalities without exploring information from target modality, aiming to solve modality loss problem. In contrast, we address the problem of distribution translation and learn joint embedding space by translating the distributions of source modalities to target modality via adversarial training. Graph fusion is then conducted using encoded representations of both target and source modalities.

GANs [5] have been successfully adopted in lots of learning tasks [30, 12]

. GANs have attracted significant interest in matching different distributions. Specifically, Adversarial Autoencoder


aims to match aggregated posterior of hidden vector with a prior distribution.

[21] seek to learn common representations for correlating heterogeneous data of various modalities, while we learn a modality-invariant embedding space for multimodal fusion by distribution translation. Besides, we add more constraints to transformed representations as well as separate the embedding space learning and cross-modal interaction learning procedures.

Model Architecture

An overview of ARGF is given in Fig. 1. ARGF is comprised of two stages: a joint embedding space learning stage and a graph fusion stage. In the first stage, we learn an embedding space for all modalities via adversarial framework. In the second stage, we utilize the representations output by encoders to conduct graph fusion (see Fig. 2).

Figure 1: Schematic Diagram of ARGF. Input is a video of a speaker consisting of language, acoustic and visual modalities.

Joint Embedding Space Learning

In this section we provide an effective approach for translating distributions of source modalities to that of the target modality so as to learn a modality-invariant embedding space. Matching distributions in embedding space is crucial for modality fusion because the divergent statistical properties among modalities might seriously prevent cross-modal dynamics from being effectively explored. Drawing inspiration from GANs [5], we regularize the network by adversarial training. We also exert additional constraints like reconstruction and classification losses to optimize the learnt embedding space.

We start with notation: assume we have three modalities as input: acoustic , language and visual with being the dimensionality of and so on. Assume that is the target modality and other modalities are known as source modalities, and represents prior data distribution of language modality. Similar to [18], we define the transformed distributions of these three modalities as:


where is known as the encoder function of language modality with being the parameters, and is the transformed language distribution in learnt embedding space restricted by (given specific input and , the encoded representations are and respectively). Language encoder is a deep neural network that is denoted as for simplicity, where is the dimensionality of encoded representations, so are other encoders (all encoded representations share the same dimensionality ).

We hope that through optimizing and , we can explicitly map transformed distributions and to . However, the distributions of different modalities are very complex and they vary in nature which are extremely difficult to be matched by simple encoder networks. Therefore, we utilize adversarial training to add constraints to transformed distributions. Specifically, a discriminator is defined which aims to classify as true but and as false, while the generators (which are encoders and ) seek to fool discriminator into classifying and

as true. The generators and discriminator beat each other as a min-max game to learn joint embedding space. The loss function here can be divided into two parts: fake adversarial loss

and true adversarial loss , as shown below:


and try to fool which results in while aims to determine the distribution of target modality as true but others as false, leading to . In practice, we define and as:


where denotes the predicted distribution value of ranging from 0 to 1, and is the learnt weight for adversarial loss. If discriminator can not tell the target modality from all modalities (in the case where ), then the distributions of various modalities are successfully mapped into a modality-invariant embedding space. Adversarial training strategy puts restrictions on the statistical properties of encoded representations. By adversarial training, modality gap can be reduced effectively so that representations from various modalities can be directly fused.

Transforming distributions might lead to the loss of unimodal information needed for mining complementary information between modalities. Therefore, to retain modality-specific information in learnt embedding space, we define decoders as:


where is the language decoder with being the parameters, and is the distributions of reconstructed representations for language modality. Given specific input to the encoders, we hope that the regenerated outputs of decoders approximate to minimize information loss. To do so, we define a reconstruction loss as:


By minimizing , encoded representations can retain the unimodal information for further fusion.

Furthermore, to render the learnt embedding space more discriminative with respect to learning task, we also define a classifier which takes as input the concatenated encoded representations. The classifier is defined as:


where denotes vector concatenation and denotes the classifier with being the number of categories to be classified, and is the predicted label based on encoded representations of modality . To minimize predicted error, we define a classification loss as:


where is the true one-hot label. Classification loss enables the encoded representations to carry the needed label information and thus the embedding space is discriminative for learning tasks.

In conclusion, the total loss of this section is:


where is a hyper-parameter that determines the importance of loss functions whose value is determined by grid search. During gradient update, firstly, we apply and to update encoders and decoders; secondly, we use to update the discriminator to improve its discriminative power with respect to fake/true distributions; lastly, is utilized to update encoders and classifier to improve discrimination of joint embedding space.

Graph Fusion Network

We explore cross-modality interactions by fusing encoded representations of all modalities in this section. Assuming that multimodal fusion is in multi-stage [11] and considering the need to preserve all -modal interactions, we introduce graph fusion network (GFN), a hierarchical neural network, to model unimodal, bimodal and trimodal interactions successively. As shown in Fig. 2, the network consists of unimodal, bimodal and trimodal dynamic learning layers. GFN regards each interaction as a vertex and the similarities between interacted vertices as well as the interacted vertices’ importance as weights of edges, which is highly interpretable and flexible in terms of fusion structure.

Figure 2: Schematic Diagram of GFN.

The first layer is the unimodal dynamic learning layer consisting of unimodal vertices of three modalities whose information vectors are denoted as and , corresponding to the encoded representations and respectively. In first layer, we apply a Modality Attention Network to process each vertex and determine the importance of these unimodal interactions since not all modalities contribute equally. Then we obtain final unimodal dynamics by calculating the weighted average of information from all unimodal vertices, as shown below:


where is the final unimodal vector and is the weight of modality . In practice, consists of a -activated dense layer parameterized by .

In the second layer, i.e., bimodal dynamic learning layer, each two unimodal vertices are fused using a multi-layer neural fusion network: to obtain each bimodal vertex:


where denotes vector concatenation, is the information vector of vertex and is the parameter matrix of . In practice, is composed of two dense layers activated by and respectively. As for the weights of those edges linking the first and second layer, we firstly estimate the similarity of each two -uniformed unimodal information vectors of first layer using inner product. We argue that the more similar two information vectors are, the less important their bimodal interaction would be. This is premised on the assumption that provided two information vectors are close to each other, then little complementary information lies between them and their information has been well explored in the first layer. The similarity of two information vectors is defined as:


where represents -normalized vector of ( ensures that the computed similarity is between 0 to 1), means vector transpose operation and is a scalar that denotes the similarity of and vertices. The weight of edge that links vertex in first layer and vertex in second layer is defined as , which varies proportionally with but grows in inverse proportion to , so are the other weights of edges. Then, the weight for vertex is defined as:


where is a scalar that represents the weight of vertex (the value 0.5 on the denominator is applied to control the relative importance of similarity and vertex weights to ), and Eq.(14) is a softmax-uniformed operation of weight vector in second layer. The output of the second layer is the weighted average of information in bimodal vertices:


where is the combined bimodal dynamic.

Each two bimodal vertices are further fused using the same network that generate bimodal dynamics to obtain trimodal vertices in the third layer, i.e., trimodal dynamic learning layer. In addition, as shown in Fig. 2, each specific bimodal vertex is also fused with the unimodal vertex that do not contribute to the formation of this bimodal vertex in previous fusion stage, resulting in three other trimodal vertices. Therefore, there are six trimodal vertices in total. We apply the same weight computing method for edges that link to the third layers and the same importance measure method for trimodal vertices as applied in second layer. Then we add up the weighted information from each trimodal vertex to obtain final trimodal information .

The final output of the hierarchical graph fusion network is the concatenation of trimodal, bimodal and unimodal dynamics, defined as: . Finally, a decision neural network is utilized to infer the final decision:


where ( is the number of classes). is the decision inference module containing two dense layers activated by and respectively.



CMU-MOSI [35] is a widely-used dataset that includes a collection of 93 opinion videos from different speakers which have been divided into 2199 segments in total. We report the binary (positive and negative) sentiment accuracy and F1 score on this dataset. We use 1141 segments for training, 306 segments for validation and 752 segments for testing. CMU-MOSEI [34], consisting of 2928 videos (20802 segments in total), is the largest multimodal language analysis dataset so far. We report positive, negative and neutral sentiment classification accuracy and F1 score on this dataset. We use 13168 segments for training, 3020 segments for validation and 4614 segments for testing. IEMOCAP [2] is an emotion recognition dataset that contains 151 videos and the videos have been segmented into 7433 segments. The dataset contains 9 emotional labels. We analyze the anger, happiness, sadness and neutral emotions so as to compare with previous approaches. The training set consists of 4674 segments, while the validation and test set has 1136 and 1623 segments, respectively.

Figure 3: Schematic Diagram of Encoder, Decoder, Discriminator and Classifier.

Experimental details

We develop our model on Pytorch. The implemental details of encoder, decoder, classifier and discriminator are shown in Fig. 

3. Specifically, the number inside the dense layer is the output dimensionality, is the dimensionality of encoded representations, is the number of classes and denotes the dimensionality of input feature vector ( is 50 for CMU-MOSI and CMU-MOSEI datasets and 100 for IEMOCAP datatset). We apply Mean Square Error as loss function for graph fusion network with Adam [10] being the optimizer. The framework is trained end to end. The hyper-parameters such as batch size, learning rate and are chosen by grid search to optimize the performance.

In feature pre-extraction stage, for CMU-MOSI and IEMOCAP, we follow the setting of [25]. A text-CNN consisting of word2vec embedding [19] followed by CNNs [9], openSMILE [4] and 3D-CNN [7]

are applied for language, acoustic and visual feature extraction respectively. For CMU-MOSEI, the features are extracted using CMU-MultimodalSDK

111 GloVe word embeddings [22], Facet 222 iMotions 2017. and COVAREP [3] are applied for extracting language, visual and acoustic features respectively (please refer to [16] for more details).

After pre-extraction, similar to HFFN [16], we develop a Unimodal Feature Extraction Network (UFEN): , which is composed of a bidirectional LSTM layer followed by a dense layer, for each separate modality. Here, denotes the number of segments that constitute a video and is the dimensionality of raw feature vector for modality. Through UFEN, feature vectors of all modalities are mapped into the same dimensionality (). UFEN for each modality is individually trained followed by a dense layer: using Adadelta [36] as optimizer and categorical cross-entropy as loss function. The precessed feature vectors output by UFEN will be sent into ARGF for modality fusion.

Results and Discussions

Methods Avg Acc Avg F1
LMFN [15] 80.9 80.9
TFN [31] 78.3 78.3
LMF [14] 75.8 75.9
CHF-Fusion [17] 80.0 -
BC-LSTM [25] 77.9 78.1
HFFN [16] 80.2 80.3
ARGF () 81.38 81.52
ARGF () 81.25 81.32
ARGF () 81.12 81.25
Table 1: Performance on CMU-MOSI dataset.
Models Happy Sad Neutral Angry Excited Frustrated Avg Acc Avg F1
LMFN [15] 31.25 64.90 58.33 62.94 47.83 69.55 58.10 57.88
CHFusion [17] 36.11 62.04 56.25 70.00 55.52 63.52 58.35 58.39
BC-LSTM [25] 35.42 58.37 53.91 64.71 54.18 60.63 55.70 55.78
TFN [31] 29.86 55.51 48.81 60.59 57.86 63.25 54.28 54.19
LMF [14] 26.39 49.39 56.77 61.18 47.16 63.25 53.17 53.02
HFFN [16] 31.25 63.67 54.69 61.18 48.83 69.82 57.12 56.82
ARGF () 15.28 68.98 59.11 68.24 72.58 61.42 60.69 59.53
ARGF () 22.92 62.86 57.55 65.29 67.22 67.45 60.20 59.75
ARGF () 26.39 68.98 54.95 62.35 64.21 68.50 60.20 59.81
Table 2: Performance on IEMOCAP dataset. The evaluation index for each emotion is the recognition accuracy.
Positive Negative Neutral Average
Models Acc F1 score Acc F1 score Acc F1 score Avg Acc Avg F1
LMFN [15] 61.88 59.98 26.21 31.42 75.54 71.49 60.77 59.42
TFN [31] 60.46 58.01 18.30 25.08 77.70 70.91 59.69 57.13
LMF [14] 68.91 59.85 11.77 18.31 75.06 70.15 59.41 55.80
CHFusion [17] 59.19 56.61 22.55 27.96 73.69 69.56 58.28 56.69
BC-LSTM [25] 64.20 60.97 24.53 29.74 74.53 71.19 60.58 59.14
HFFN [16] 59.49 59.05 26.61 31.35 75.85 71.35 60.32 59.03
ARGF () 64.57 60.55 21.66 28.01 76.20 71.77 60.88 58.92
ARGF () 61.29 59.61 28.78 33.31 74.30 71.16 60.55 59.52
ARGF () 62.33 60.04 20.87 26.99 77.88 71.99 60.88 58.66
Table 3: Performance on CMU-MOSEI dataset.

Comparison with Baselines: As presented in Table 1

, ARGF shows improvement over typical approaches and outperforms the SOTA method LMFN by about 0.5% on CMU-MOSI dataset. Moreover, compared to the tensor fusion methods TFN and LMF, ARGF achieves improvement by about 3% and 6% respectively. We argue that it is probably partly because these approaches ignore exploring modality-invariant embedding space, while we adopt adversarial training to learn joint embedding space before fusion. These results demonstrate the superiority of ARGF. We also report ARGF’s performance on the more challenging datasets IEMOCAP and CMU-MOSEI to evaluate ARGF’s robustness. For IEMOCAP, from Table 

2 we can infer that ARGF achieves the best performance and significantly outperforms SOTA methods by about 2% in the most important index Avg Acc. For CMU-MOSEI, as shown in Table 3, the average accuracy and F1 score of ARGF are still the highest among all methods, showing excellent performance. In addition, all the models’ performance on ‘Negative’ emotion are weaker than that on other emotions by a large margin. We speculate that one of the reasons is that the number of samples belonging to ‘Negative’ class is much fewer than the ones of other emotions. Consequently the model tends to predict otherwise which is less ‘risky’.

Methods Parameters FLOPs
BC-LSTM [25] 1,383,902 1,322,044
TFN [31] 4,245,986 8,491,844
ARGF (ours) 1,270,770 2,017,645
Table 4: The Comparison of Model Complexity.

Complexity Analysis: We use the amount of trainable parameters as a proxy for the space complexity. As shown in Table 4, our model has 1,270,770 trainable parameters ( is set to 50), which is approximately 29.93% and 91.83% of the number of parameters of TFN and BC-LSTM, respectively. To explore the time complexity of ARGF and make a comparison against baselines, we compute FLOPs for each method during testing. We empirically find that ARGF needs 2,017,645 FLOPs in testing, whereas the number of FLOPs in TFN and BC-LSTM is 8,491,844 and 1,322,044, respectively. The reason for fewer trainable parameters and moderate number of FLOPs is that despite having multiple components in the architecture, our model entails low dimensions for unimodal and multimodal representations. In addition, every component in our architecture has only several layers, which guarantees a reasonable computational load.

Choice of Target Modality: As presented in Table 1,  2 and  3, the results demonstrate that language modality is the best choice for CMU-MOSI, while acoustic and visual arrives in second and third place respectively by a slight margin. Actually, all the three modalities perform very closely across the three datasets. There are no hard and fast rules to choose an optimal target modality since their performances are rather similar. It is reasonable because the target modality only serves as a distribution provider so that the source modalities can map their transformed distributions into that of the target modality. During the distribution mapping procedure, all the modalities’ information is well preserved by the use of decoders and classifier. Therefore, theoretically there is no difference what the target modality is.

Acc F1 score
W/O AT 80.11 80.23
W/O C 80.92 80.98
W/O D 79.72 79.76
ARGF 81.38 81.52
Table 5: Comparison of Different Architectures on CMU-MOSI dataset. W/O AT means removing adversarial training, C denotes classifier and D represents Decoders.

Ablation Studies: From Table 5 we can see that the removal of any component in ARGF leads to a decline in performance. Specifically, the removal of decoders leads to the most significant decline of more than 1.5%, closely followed by the removal of adversarial training. In addition, experiments without classifier achieve relatively acceptable results, with a slight decrease of around 0.4%, but classifier is critical for learning a discriminative embedding space as demonstrated in the visualization part. The ablation studies show that these three components are crucial factors contributing to the competitive results of ARGF.

Figure 4: Schematic Diagram of ‘Concatenation + FC’, ‘Multiplication + FC’ and ‘Weighted Average’.
Concatenation + FC 80.19 57.98
Multiplication + FC 80.05 58.60
Weighted Average 78.86 59.64
Tensor Fusion 79.65 59.46
Low-rank Modality Fusion(LMF) 78.72 59.21
Dynamic Fusion Graph (DFG) 79.92 57.79
Graph Fusion Network (GFN) 81.38 60.20
Table 6: Performance of various fusion structures. The diagrams of some fusion methods for comparison are given in Fig. 4. Please refer to the corresponding papers [31, 14, 34] for the diagrams of Tensor Fusion, LMF and DFG respectively.
Figure 5: Visualization for Graph Fusion in CMU-MOSI sentiment analysis task. Each row represents one test sample and each column denotes weight for one vertex.

Comparison between fusion strategies: To demonstrate that our fusion method is indeed effective, we conduct a contrast experiment to compare the performance between other fusion strategies with our GFN. We can infer from Table 6 that GFN brings significant improvement on performance compared to other fusion methods, demonstrating its superiority. Specifically, DFG [34] achieves comparable results in two datasets, but our GFN outperforms it by over 2% in IEMOCAP and 1% in CMU-MOSI. We argue that it is because we explicitly model unimodal, bimodal and trimodal dynamics and obtain trimodal representations in a more comprehensive way (as also revealed in the following visualization part). From above analysis, we can draw the conclusion that the interpretable, comprehensive and hierarchical fusion strategy is a crucial factor that lead to the marked improvement of ARGF. In addition, all the fusion strategies fitted in our framework obtain relatively good results compared with the original frameworks. For instance, our version of Tensor Fusion [31] achieves the accuracy of 59.46% in IEMOCAP, which is much higher than TFN’s accuracy under the same experimental setting (54.28%, see Table 2). These comparisons prove to some extent that learning joint embedding space before feature fusion is indeed important and effective.

Visualization for Weights of Vertices in Graph Fusion: This visualization is conducted to prove graph fusion’s interpretability. As shown in Fig. 5, for unimodal interactions (the first three columns), obviously language modality is the most predictive for the majority of samples, which is reasonable since language is the most important clue for sentiment analysis. But its importance varies greatly across samples, indicating the advantage of analyzing sentiment in a multimodal form for other modalities can play a dominant role whenever language modality is trivial for a specific sample. For bimodal interactions (from to columns), weights for vertex () and vertex () are very close, followed closely by , possibly because language modality plays a huge role in bimodal fusion. For trimodal interactions, to our surprise, vertices obtained by fusing a bimodal vertex and a unimodal vertex hardly make any difference, but vertices obtained by fusing two bimodal vertices dominate trimodal information. It proves the necessity to explore interactions for each two bimodal vertices, which are not modeled by DFG [34].

Figure 6: Visualization for distributions of different sentiments in learnt embedding space. The cyan and purple dots represent positive and negative sentiment respectively.

Visualization of Embedding Space: We provide a visualization for distributions of sentiments in the embedding space where the the left sub-figure on Fig. 6 illustrates the embedding space learnt by ARGF while the right sub-figure presents the embedding space learnt without classifier. The visualization is obtained by transforming the concatenated features of all modalities output by encoders. We use t-SNE algorithm to transform high dimensional concatenated feature vectors into 2-dimension feature points. We can infer from Fig. 6 that in the embedding space learnt by ARGF, the dots tend to form two separate clusters which mainly belong to positive and negative sentiment respectively. The distance between two clusters is large but the dots belonging to same cluster are tied closely, which demonstrates the discrimination of our embedding space. Nevertheless, there are some dots that are extremely difficult to be correctly classified found in the wrong clusters, which drives the need for advanced fusion strategies to explore cross-modal dynamics. In contrast, it is clear that the embedding space learnt without classifier cannot distinguish positive and negative sentiments effectively, which highlights the necessity of classifier in learning a discriminative embedding space.


We propose a novel fusion framework to learn a discriminative joint embedding space and then perform graph fusion. By introducing adversarial training to match distributions, modality gap can be significantly narrowed and the representations can be directly fused. With the aid of GFN, we can explore unimodal, bimodal and trimodal dynamics successively and dynamically change the fusion structure.


This work was supported in part by the National Natural Science Foundation of China (61673402,61273270, 60802069), and the Natural Science Foundation of Guangdong Province (2017A030311029).


  • [1] T. Baltrušaitis, C. Ahuja, and L. Morency (2019-02) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 423–443. External Links: Document, ISSN 0162-8828 Cited by: Introduction, Introduction.
  • [2] C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (4), pp. 335–359. Cited by: Datasets.
  • [3] G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer (2014) COVAREP: a collaborative voice analysis repository for speech technologies. In ICASSP, pp. 960–964. Cited by: Experimental details.
  • [4] F. Eyben (2010)

    Opensmile: the munich versatile and fast open-source audio feature extractor

    In ACM International Conference on Multimedia, pp. 1459–1462. Cited by: Experimental details.
  • [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, X. Bing, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: Introduction, Related Work, Joint Embedding Space Learning.
  • [6] Y. Gu, K. Yang, S. Fu, S. Chen, X. Li, and I. Marsic (2018) Multimodal affective analysis using hierarchical attention strategy with word-level alignment. In ACL, pp. 2225–2235. Cited by: Related Work.
  • [7] S. Ji, M. Yang, and K. Yu (2013)

    3D convolutional neural networks for human action recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: Experimental details.
  • [8] O. Kampman, E. J. Barezi, D. Bertero, and P. Fung (2018) Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction. In ACL short paper, External Links: Link Cited by: Related Work.
  • [9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F. Li (2014) Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: Experimental details.
  • [10] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: Experimental details.
  • [11] P. P. Liang, Z. Liu, A. Zadeh, and L. P. Morency (2018) Multimodal language analysis with recurrent multistage fusion. In EMNLP, pp. 150–161. Cited by: Introduction, Graph Fusion Network.
  • [12] Z. Lin, G. Ding, M. Hu, and J. Wang (2015) Semantics-preserving hashing for cross-view retrieval. In CVPR, pp. 3864–3872. Cited by: Introduction, Related Work.
  • [13] B. Liu and L. Zhang (2012) A survey of opinion mining and sentiment analysis. In Mining Text Data, C. C. Aggarwal and C. Zhai (Eds.), pp. 415–463. External Links: ISBN 978-1-4614-3223-4, Document, Link Cited by: Introduction.
  • [14] Z. Liu, Y. Shen, P. P. Liang, A. Zadeh, and L. P. Morency (2018) Efficient low-rank multimodal fusion with modality-specific factors. In ACL, pp. 2247–2256. Cited by: Introduction, Introduction, Table 1, Table 2, Table 3, Table 6.
  • [15] S. Mai, S. Xing, and H. Hu (2019) Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Transactions on Multimedia (), pp. 1–1. External Links: Document, ISSN 1520-9210 Cited by: Related Work, Table 1, Table 2, Table 3.
  • [16] S. Mai, H. Hu, and S. Xing (2019-07) Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In ACL, pp. 481–492. External Links: Link Cited by: Experimental details, Experimental details, Table 1, Table 2, Table 3.
  • [17] N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria (2018-06) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-based systems 161, pp. 124–133. Cited by: Table 1, Table 2, Table 3.
  • [18] A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow (2016) Adversarial autoencoders. In ICLR Workshop, External Links: Link Cited by: Introduction, Related Work, Joint Embedding Space Learning.
  • [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. In ICLR Workshop, pp. 1725–1732. Cited by: Experimental details.
  • [20] B. Nojavanasghari, D. Gopinath, J. Koushik, and L. P. Morency (2016) Deep multimodal fusion for persuasiveness prediction. In Proceedings of ACM International Conference on Multimodal Interaction, pp. 284–288. Cited by: Related Work.
  • [21] Y. Peng, J. Qi, and Y. Yuan (2017-10) CM-gans: cross-modal generative adversarial networks for common representation learning. IEEE Transaction on Multimedia, pp. . Cited by: Related Work.
  • [22] J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1532–1543. External Links: Link Cited by: Experimental details.
  • [23] H. Pham, P. P. Liang, T. Manzini, L. P. Morency, and P. Barnabǎs (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In AAAI, pp. 6892–6899. Cited by: Related Work.
  • [24] S. Poria, E. Cambria, R. Bajpai, and A. Hussain (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion 37, pp. 98–125. Cited by: Introduction.
  • [25] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L. P. Morency (2017) Context-dependent sentiment analysis in user-generated videos. In ACL, pp. 873–883. Cited by: Introduction, Introduction, Experimental details, Table 1, Table 2, Table 3, Table 4.
  • [26] S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In ICDM, pp. 439–448. Cited by: Related Work.
  • [27] R. Su, W. Lan, and X. Liu (2018) Multimodal learning using 3d audio-visual data for audio-visual speech recognition. In International Conference on Asian Language Processing, Cited by: Introduction.
  • [28] Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov (2019-07) Multimodal transformer for unaligned multimodal language sequences. In ACL, pp. 6558–6569. External Links: Link Cited by: Related Work.
  • [29] M. W¨ollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. P. Morency (2013) YouTube movie reviews: sentiment analysis in an audio-visual context. IEEE Intelligent Systems 28 (3), pp. 46–53. Cited by: Related Work.
  • [30] X. Xing, H. Li, H. Lu, L. Gao, and Y. Ji (2018) Deep adversarial metric learning for cross-modal retrieval. In World Wide Web, pp. 1–16. Cited by: Introduction, Related Work.
  • [31] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. P. Morency (2017) Tensor fusion network for multimodal sentiment analysis. In EMNLP, pp. 1114–1125. Cited by: Introduction, Related Work, Related Work, Results and Discussions, Table 1, Table 2, Table 3, Table 4, Table 6.
  • [32] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L. P. Morency (2018) Memory fusion network for multi-view sequential learning. In AAAI, pp. 5634–5641. Cited by: Related Work.
  • [33] A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L. P. Morency (2018) Multi-attention recurrent network for human communication comprehension. In AAAI, pp. 5642–5649. Cited by: Introduction, Related Work.
  • [34] A. Zadeh, P. P. Liang, J. Vanbriesen, S. Poria, E. Tong, E. Cambria, M. Chen, and L. P. Morency (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In ACL, pp. 2236–2246. Cited by: Introduction, Introduction, Related Work, Datasets, Results and Discussions, Results and Discussions, Table 6.
  • [35] A. Zadeh, R. Zellers, E. Pincus, and L. P. Morency (2016-11) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems 31 (6), pp. 82–88. Cited by: Datasets.
  • [36] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. preprint, arXiv:1212.5701v1. Cited by: Experimental details.