Introduction
Multimodal machine learning, where multiple data sources are available, always fares better compared with the situation where only single data source is utilized
[1]. For example, sentiment analysis of text has been widely explored for a long time
[13]. But recent research shows that information from text is not sufficient for mining opinion of human [24], especially under the situation where sarcasm or ambiguity is included. Nevertheless, if machine is able to understand a speaker’s facial expression and how spoken language is uttered, it would be much easier to figure out the real sentiment. In this paper, we focus on analyzing human multimodal language from videos, where acoustic, language and visual modalities are included [11, 34]. We can easily extend our work to other machine learning tasks as long as multiple modalities are presented, such as audiovisual speech recognition [27] and crossmodal retrieval [30].How to fuse representations of various modalities has always been an open research. A key issue in multimodal fusion lies in the heterogeneous data distributions from different modalities [1], leading to the difficulty in mining complementary information across modalities which is critical for a comprehensive interpretation of multimodal information. The majority of previous works do not devote efforts to learn joint embedding space for various modalities to match multimodal distributions. Instead, they apply a subnetwork to each modality and then conduct fusion immediately [31, 25, 14]. However, due to the heterogeneity across divergent modalities [12], the transformed representations learnt by these approaches can still follow unknown yet complex distributions.
Generative Adversarial Networks (GANs) [5] have achieved significant progress, which can explicitly map one distribution to another prior distribution using adversarial training [18]. Based on its unique characteristics, adversarial training is suitable for modality distribution translation. Motivated by it, we propose an adversarial encoder framework to match transformed distributions of all modalities and learn a modalityinvariant embedding space. Specifically, source modalities’ encoders, transforming the unimodal raw features into embedding space, try to fool the discriminator into distinguishing their encoded representations as those from target modality, while discriminator aims to classify the encoded representations from target modality as true but others as false. We also define one decoder for each modality that seeks to reconstruct the raw features to prevent unimodal information from being lost. Moreover, a classifier is built to classify the encoded representations into true category, which ensures that the embedding space is discriminative for learning task.
Furthermore, many prior methods fail to conduct fusion in a hierarchical way and are unable to explicitly model the interaction between each subset of multiple modalities [25, 14, 33]. In contrast, we interpret multimodal fusion as a hierarchical interaction learning procedure where firstly bimodal interactions are generated based on unimodal dynamics, and then trimodal dynamics are generated based on bimodal and unimodal dynamics. Drawing inspiration from recent success of graphstyle neural network [34], we propose a hierarchical fusion network, i.e., graph fusion network, which is highly interpretable in terms of the fusion procedure. Our graph fusion network consists of three layers containing unimodal, bimodal and trimodal dynamics respectively. The vertices in lower layer deliver their information to the higher layers where the information is fused to form multimodal information for the higher layers. By this means, we can explore crossmodal interactions hierarchically while still maintain the original interactions.
In brief, the main contributions are listed below:

We propose Adversarial Representation Graph Fusion framework (ARGF) for multimodal fusion. We address the importance of matching distributions before fusion and introduce adversarial training to learn a discriminative joint embedding space for various modalities, which can significantly narrow modality gap by matching multimodal distributions. Moreover, we define decoders and classifier to retain unimodal information and enhance discrimination of the embedding space respectively.

We propose a hierarchical graph fusion network which can explicitly model unimodal, bimodal and trimodal dynamics successively and is highly interpretable due to explicit meanings of vertices. It is flexible in fusion structure due to the changeable weights of edges and the ability to assign different importance to various interactions.

ARGF achieves stateoftheart performance on three multimodal learning datasets. We also provide visualizations for embedding space and graph fusion to prove their characteristics and contributions.
Related Work
Earlier works on multimodal fusion focus on early fusion and late fusion. Early fusion approaches extract features of various modalities and fuse features at input level by simple concatenation [29, 26], but they cannot fully explore intramodality dynamics as stated in [31]. In contrast, late fusion first makes decision according to each modality and then fuses the decisions by weighted averaging [20, 8], but it fails to model crossmodal interactions for features cannot interact with each other.
Recently, local fusion has become the mainstream strategy which can model timedependent crossmodal interactions [6, 33]. For instance, [32] propose Memory Fusion Network that fuses memories of LSTMs over time, which is extended by [34]
that use a Dynamic Fusion Graph (DFG) to fuse features. Our graph fusion network is mainly different from DFG in that: 1) we use inner product as part of edges’ weight to estimate similarity between interactions; 2) in addition to fuse bimodal and unimodal dynamics, we also fuse each two bimodal dynamics to obtain more complete trimodal representations; 3) we determine the importance of vertices respectively to obtain each layer’s output firstly rather than directly adding weighted information from all vertices. Tensor fusion has also become a new trend. Tensor Fusion Network
[31] adopts an outer product from multimodal embeddings to conduct fusion, followed by Liu2018Efficient that tries to improve efficiency and reduces redundant information. More recently, HFFN propose a ‘Divide, Conquer and Combine’ strategy to conduct local tensor and global fusion, which is extended in [15] using a more efficient feature segmentation method and an elaboratelydesigned BMLSTM. But methods above do not explicitly explore joint embedding space before fusion. Consequently, modality gap still seriously affects the effect of fusion.For modality translation methods, Multimodal Transformer (MulT) [28] transforms source modality to target modality using directional pairwise crossmodal attention. Moreover, Multimodal Cyclic Translation Network (MCTN) [23] learn joint representations via encoderdecoder framework by translating encoder’s input (source modality) into decoder’s output (target modality). MCTN uses encoder’s output as joint representations of source and target modalities without exploring information from target modality, aiming to solve modality loss problem. In contrast, we address the problem of distribution translation and learn joint embedding space by translating the distributions of source modalities to target modality via adversarial training. Graph fusion is then conducted using encoded representations of both target and source modalities.
GANs [5] have been successfully adopted in lots of learning tasks [30, 12]
. GANs have attracted significant interest in matching different distributions. Specifically, Adversarial Autoencoder
[18]aims to match aggregated posterior of hidden vector with a prior distribution.
[21] seek to learn common representations for correlating heterogeneous data of various modalities, while we learn a modalityinvariant embedding space for multimodal fusion by distribution translation. Besides, we add more constraints to transformed representations as well as separate the embedding space learning and crossmodal interaction learning procedures.Model Architecture
An overview of ARGF is given in Fig. 1. ARGF is comprised of two stages: a joint embedding space learning stage and a graph fusion stage. In the first stage, we learn an embedding space for all modalities via adversarial framework. In the second stage, we utilize the representations output by encoders to conduct graph fusion (see Fig. 2).
Joint Embedding Space Learning
In this section we provide an effective approach for translating distributions of source modalities to that of the target modality so as to learn a modalityinvariant embedding space. Matching distributions in embedding space is crucial for modality fusion because the divergent statistical properties among modalities might seriously prevent crossmodal dynamics from being effectively explored. Drawing inspiration from GANs [5], we regularize the network by adversarial training. We also exert additional constraints like reconstruction and classification losses to optimize the learnt embedding space.
We start with notation: assume we have three modalities as input: acoustic , language and visual with being the dimensionality of and so on. Assume that is the target modality and other modalities are known as source modalities, and represents prior data distribution of language modality. Similar to [18], we define the transformed distributions of these three modalities as:
(1) 
where is known as the encoder function of language modality with being the parameters, and is the transformed language distribution in learnt embedding space restricted by (given specific input and , the encoded representations are and respectively). Language encoder is a deep neural network that is denoted as for simplicity, where is the dimensionality of encoded representations, so are other encoders (all encoded representations share the same dimensionality ).
We hope that through optimizing and , we can explicitly map transformed distributions and to . However, the distributions of different modalities are very complex and they vary in nature which are extremely difficult to be matched by simple encoder networks. Therefore, we utilize adversarial training to add constraints to transformed distributions. Specifically, a discriminator is defined which aims to classify as true but and as false, while the generators (which are encoders and ) seek to fool discriminator into classifying and
as true. The generators and discriminator beat each other as a minmax game to learn joint embedding space. The loss function here can be divided into two parts: fake adversarial loss
and true adversarial loss , as shown below:(2) 
and try to fool which results in while aims to determine the distribution of target modality as true but others as false, leading to . In practice, we define and as:
(3) 
(4) 
where denotes the predicted distribution value of ranging from 0 to 1, and is the learnt weight for adversarial loss. If discriminator can not tell the target modality from all modalities (in the case where ), then the distributions of various modalities are successfully mapped into a modalityinvariant embedding space. Adversarial training strategy puts restrictions on the statistical properties of encoded representations. By adversarial training, modality gap can be reduced effectively so that representations from various modalities can be directly fused.
Transforming distributions might lead to the loss of unimodal information needed for mining complementary information between modalities. Therefore, to retain modalityspecific information in learnt embedding space, we define decoders as:
(5) 
where is the language decoder with being the parameters, and is the distributions of reconstructed representations for language modality. Given specific input to the encoders, we hope that the regenerated outputs of decoders approximate to minimize information loss. To do so, we define a reconstruction loss as:
(6) 
By minimizing , encoded representations can retain the unimodal information for further fusion.
Furthermore, to render the learnt embedding space more discriminative with respect to learning task, we also define a classifier which takes as input the concatenated encoded representations. The classifier is defined as:
(7) 
where denotes vector concatenation and denotes the classifier with being the number of categories to be classified, and is the predicted label based on encoded representations of modality . To minimize predicted error, we define a classification loss as:
(8) 
where is the true onehot label. Classification loss enables the encoded representations to carry the needed label information and thus the embedding space is discriminative for learning tasks.
In conclusion, the total loss of this section is:
(9) 
where is a hyperparameter that determines the importance of loss functions whose value is determined by grid search. During gradient update, firstly, we apply and to update encoders and decoders; secondly, we use to update the discriminator to improve its discriminative power with respect to fake/true distributions; lastly, is utilized to update encoders and classifier to improve discrimination of joint embedding space.
Graph Fusion Network
We explore crossmodality interactions by fusing encoded representations of all modalities in this section. Assuming that multimodal fusion is in multistage [11] and considering the need to preserve all modal interactions, we introduce graph fusion network (GFN), a hierarchical neural network, to model unimodal, bimodal and trimodal interactions successively. As shown in Fig. 2, the network consists of unimodal, bimodal and trimodal dynamic learning layers. GFN regards each interaction as a vertex and the similarities between interacted vertices as well as the interacted vertices’ importance as weights of edges, which is highly interpretable and flexible in terms of fusion structure.
The first layer is the unimodal dynamic learning layer consisting of unimodal vertices of three modalities whose information vectors are denoted as and , corresponding to the encoded representations and respectively. In first layer, we apply a Modality Attention Network to process each vertex and determine the importance of these unimodal interactions since not all modalities contribute equally. Then we obtain final unimodal dynamics by calculating the weighted average of information from all unimodal vertices, as shown below:
(10) 
where is the final unimodal vector and is the weight of modality . In practice, consists of a activated dense layer parameterized by .
In the second layer, i.e., bimodal dynamic learning layer, each two unimodal vertices are fused using a multilayer neural fusion network: to obtain each bimodal vertex:
(11) 
where denotes vector concatenation, is the information vector of vertex and is the parameter matrix of . In practice, is composed of two dense layers activated by and respectively. As for the weights of those edges linking the first and second layer, we firstly estimate the similarity of each two uniformed unimodal information vectors of first layer using inner product. We argue that the more similar two information vectors are, the less important their bimodal interaction would be. This is premised on the assumption that provided two information vectors are close to each other, then little complementary information lies between them and their information has been well explored in the first layer. The similarity of two information vectors is defined as:
(12) 
where represents normalized vector of ( ensures that the computed similarity is between 0 to 1), means vector transpose operation and is a scalar that denotes the similarity of and vertices. The weight of edge that links vertex in first layer and vertex in second layer is defined as , which varies proportionally with but grows in inverse proportion to , so are the other weights of edges. Then, the weight for vertex is defined as:
(13) 
(14) 
where is a scalar that represents the weight of vertex (the value 0.5 on the denominator is applied to control the relative importance of similarity and vertex weights to ), and Eq.(14) is a softmaxuniformed operation of weight vector in second layer. The output of the second layer is the weighted average of information in bimodal vertices:
(15) 
where is the combined bimodal dynamic.
Each two bimodal vertices are further fused using the same network that generate bimodal dynamics to obtain trimodal vertices in the third layer, i.e., trimodal dynamic learning layer. In addition, as shown in Fig. 2, each specific bimodal vertex is also fused with the unimodal vertex that do not contribute to the formation of this bimodal vertex in previous fusion stage, resulting in three other trimodal vertices. Therefore, there are six trimodal vertices in total. We apply the same weight computing method for edges that link to the third layers and the same importance measure method for trimodal vertices as applied in second layer. Then we add up the weighted information from each trimodal vertex to obtain final trimodal information .
The final output of the hierarchical graph fusion network is the concatenation of trimodal, bimodal and unimodal dynamics, defined as: . Finally, a decision neural network is utilized to infer the final decision:
(16) 
where ( is the number of classes). is the decision inference module containing two dense layers activated by and respectively.
Experiments
Datasets
CMUMOSI [35] is a widelyused dataset that includes a collection of 93 opinion videos from different speakers which have been divided into 2199 segments in total. We report the binary (positive and negative) sentiment accuracy and F1 score on this dataset. We use 1141 segments for training, 306 segments for validation and 752 segments for testing. CMUMOSEI [34], consisting of 2928 videos (20802 segments in total), is the largest multimodal language analysis dataset so far. We report positive, negative and neutral sentiment classification accuracy and F1 score on this dataset. We use 13168 segments for training, 3020 segments for validation and 4614 segments for testing. IEMOCAP [2] is an emotion recognition dataset that contains 151 videos and the videos have been segmented into 7433 segments. The dataset contains 9 emotional labels. We analyze the anger, happiness, sadness and neutral emotions so as to compare with previous approaches. The training set consists of 4674 segments, while the validation and test set has 1136 and 1623 segments, respectively.
Experimental details
We develop our model on Pytorch. The implemental details of encoder, decoder, classifier and discriminator are shown in Fig.
3. Specifically, the number inside the dense layer is the output dimensionality, is the dimensionality of encoded representations, is the number of classes and denotes the dimensionality of input feature vector ( is 50 for CMUMOSI and CMUMOSEI datasets and 100 for IEMOCAP datatset). We apply Mean Square Error as loss function for graph fusion network with Adam [10] being the optimizer. The framework is trained end to end. The hyperparameters such as batch size, learning rate and are chosen by grid search to optimize the performance.In feature preextraction stage, for CMUMOSI and IEMOCAP, we follow the setting of [25]. A textCNN consisting of word2vec embedding [19] followed by CNNs [9], openSMILE [4] and 3DCNN [7]
are applied for language, acoustic and visual feature extraction respectively. For CMUMOSEI, the features are extracted using CMUMultimodalSDK
^{1}^{1}1https://github.com/A2Zadeh/CMUMultimodalSDK. GloVe word embeddings [22], Facet ^{2}^{2}2 iMotions 2017. https://imotions.com/ and COVAREP [3] are applied for extracting language, visual and acoustic features respectively (please refer to [16] for more details).After preextraction, similar to HFFN [16], we develop a Unimodal Feature Extraction Network (UFEN): , which is composed of a bidirectional LSTM layer followed by a dense layer, for each separate modality. Here, denotes the number of segments that constitute a video and is the dimensionality of raw feature vector for modality. Through UFEN, feature vectors of all modalities are mapped into the same dimensionality (). UFEN for each modality is individually trained followed by a dense layer: using Adadelta [36] as optimizer and categorical crossentropy as loss function. The precessed feature vectors output by UFEN will be sent into ARGF for modality fusion.
Results and Discussions
Methods  Avg Acc  Avg F1 

LMFN [15]  80.9  80.9 
TFN [31]  78.3  78.3 
LMF [14]  75.8  75.9 
CHFFusion [17]  80.0   
BCLSTM [25]  77.9  78.1 
HFFN [16]  80.2  80.3 
ARGF ()  81.38  81.52 
ARGF ()  81.25  81.32 
ARGF ()  81.12  81.25 
IEMOCAP  
Models  Happy  Sad  Neutral  Angry  Excited  Frustrated  Avg Acc  Avg F1 
LMFN [15]  31.25  64.90  58.33  62.94  47.83  69.55  58.10  57.88 
CHFusion [17]  36.11  62.04  56.25  70.00  55.52  63.52  58.35  58.39 
BCLSTM [25]  35.42  58.37  53.91  64.71  54.18  60.63  55.70  55.78 
TFN [31]  29.86  55.51  48.81  60.59  57.86  63.25  54.28  54.19 
LMF [14]  26.39  49.39  56.77  61.18  47.16  63.25  53.17  53.02 
HFFN [16]  31.25  63.67  54.69  61.18  48.83  69.82  57.12  56.82 
ARGF ()  15.28  68.98  59.11  68.24  72.58  61.42  60.69  59.53 
ARGF ()  22.92  62.86  57.55  65.29  67.22  67.45  60.20  59.75 
ARGF ()  26.39  68.98  54.95  62.35  64.21  68.50  60.20  59.81 
Positive  Negative  Neutral  Average  
Models  Acc  F1 score  Acc  F1 score  Acc  F1 score  Avg Acc  Avg F1 
LMFN [15]  61.88  59.98  26.21  31.42  75.54  71.49  60.77  59.42 
TFN [31]  60.46  58.01  18.30  25.08  77.70  70.91  59.69  57.13 
LMF [14]  68.91  59.85  11.77  18.31  75.06  70.15  59.41  55.80 
CHFusion [17]  59.19  56.61  22.55  27.96  73.69  69.56  58.28  56.69 
BCLSTM [25]  64.20  60.97  24.53  29.74  74.53  71.19  60.58  59.14 
HFFN [16]  59.49  59.05  26.61  31.35  75.85  71.35  60.32  59.03 
ARGF ()  64.57  60.55  21.66  28.01  76.20  71.77  60.88  58.92 
ARGF ()  61.29  59.61  28.78  33.31  74.30  71.16  60.55  59.52 
ARGF ()  62.33  60.04  20.87  26.99  77.88  71.99  60.88  58.66 
Comparison with Baselines: As presented in Table 1
, ARGF shows improvement over typical approaches and outperforms the SOTA method LMFN by about 0.5% on CMUMOSI dataset. Moreover, compared to the tensor fusion methods TFN and LMF, ARGF achieves improvement by about 3% and 6% respectively. We argue that it is probably partly because these approaches ignore exploring modalityinvariant embedding space, while we adopt adversarial training to learn joint embedding space before fusion. These results demonstrate the superiority of ARGF. We also report ARGF’s performance on the more challenging datasets IEMOCAP and CMUMOSEI to evaluate ARGF’s robustness. For IEMOCAP, from Table
2 we can infer that ARGF achieves the best performance and significantly outperforms SOTA methods by about 2% in the most important index Avg Acc. For CMUMOSEI, as shown in Table 3, the average accuracy and F1 score of ARGF are still the highest among all methods, showing excellent performance. In addition, all the models’ performance on ‘Negative’ emotion are weaker than that on other emotions by a large margin. We speculate that one of the reasons is that the number of samples belonging to ‘Negative’ class is much fewer than the ones of other emotions. Consequently the model tends to predict otherwise which is less ‘risky’.Methods  Parameters  FLOPs 

BCLSTM [25]  1,383,902  1,322,044 
TFN [31]  4,245,986  8,491,844 
ARGF (ours)  1,270,770  2,017,645 
Complexity Analysis: We use the amount of trainable parameters as a proxy for the space complexity. As shown in Table 4, our model has 1,270,770 trainable parameters ( is set to 50), which is approximately 29.93% and 91.83% of the number of parameters of TFN and BCLSTM, respectively. To explore the time complexity of ARGF and make a comparison against baselines, we compute FLOPs for each method during testing. We empirically find that ARGF needs 2,017,645 FLOPs in testing, whereas the number of FLOPs in TFN and BCLSTM is 8,491,844 and 1,322,044, respectively. The reason for fewer trainable parameters and moderate number of FLOPs is that despite having multiple components in the architecture, our model entails low dimensions for unimodal and multimodal representations. In addition, every component in our architecture has only several layers, which guarantees a reasonable computational load.
Choice of Target Modality: As presented in Table 1, 2 and 3, the results demonstrate that language modality is the best choice for CMUMOSI, while acoustic and visual arrives in second and third place respectively by a slight margin. Actually, all the three modalities perform very closely across the three datasets. There are no hard and fast rules to choose an optimal target modality since their performances are rather similar. It is reasonable because the target modality only serves as a distribution provider so that the source modalities can map their transformed distributions into that of the target modality. During the distribution mapping procedure, all the modalities’ information is well preserved by the use of decoders and classifier. Therefore, theoretically there is no difference what the target modality is.
Acc  F1 score  

W/O AT  80.11  80.23 
W/O C  80.92  80.98 
W/O D  79.72  79.76 
ARGF  81.38  81.52 
Ablation Studies: From Table 5 we can see that the removal of any component in ARGF leads to a decline in performance. Specifically, the removal of decoders leads to the most significant decline of more than 1.5%, closely followed by the removal of adversarial training. In addition, experiments without classifier achieve relatively acceptable results, with a slight decrease of around 0.4%, but classifier is critical for learning a discriminative embedding space as demonstrated in the visualization part. The ablation studies show that these three components are crucial factors contributing to the competitive results of ARGF.
CMUMOSI  IEMOCAP  

Concatenation + FC  80.19  57.98 
Multiplication + FC  80.05  58.60 
Weighted Average  78.86  59.64 
Tensor Fusion  79.65  59.46 
Lowrank Modality Fusion(LMF)  78.72  59.21 
Dynamic Fusion Graph (DFG)  79.92  57.79 
Graph Fusion Network (GFN)  81.38  60.20 
Comparison between fusion strategies: To demonstrate that our fusion method is indeed effective, we conduct a contrast experiment to compare the performance between other fusion strategies with our GFN. We can infer from Table 6 that GFN brings significant improvement on performance compared to other fusion methods, demonstrating its superiority. Specifically, DFG [34] achieves comparable results in two datasets, but our GFN outperforms it by over 2% in IEMOCAP and 1% in CMUMOSI. We argue that it is because we explicitly model unimodal, bimodal and trimodal dynamics and obtain trimodal representations in a more comprehensive way (as also revealed in the following visualization part). From above analysis, we can draw the conclusion that the interpretable, comprehensive and hierarchical fusion strategy is a crucial factor that lead to the marked improvement of ARGF. In addition, all the fusion strategies fitted in our framework obtain relatively good results compared with the original frameworks. For instance, our version of Tensor Fusion [31] achieves the accuracy of 59.46% in IEMOCAP, which is much higher than TFN’s accuracy under the same experimental setting (54.28%, see Table 2). These comparisons prove to some extent that learning joint embedding space before feature fusion is indeed important and effective.
Visualization for Weights of Vertices in Graph Fusion: This visualization is conducted to prove graph fusion’s interpretability. As shown in Fig. 5, for unimodal interactions (the first three columns), obviously language modality is the most predictive for the majority of samples, which is reasonable since language is the most important clue for sentiment analysis. But its importance varies greatly across samples, indicating the advantage of analyzing sentiment in a multimodal form for other modalities can play a dominant role whenever language modality is trivial for a specific sample. For bimodal interactions (from to columns), weights for vertex () and vertex () are very close, followed closely by , possibly because language modality plays a huge role in bimodal fusion. For trimodal interactions, to our surprise, vertices obtained by fusing a bimodal vertex and a unimodal vertex hardly make any difference, but vertices obtained by fusing two bimodal vertices dominate trimodal information. It proves the necessity to explore interactions for each two bimodal vertices, which are not modeled by DFG [34].
Visualization of Embedding Space: We provide a visualization for distributions of sentiments in the embedding space where the the left subfigure on Fig. 6 illustrates the embedding space learnt by ARGF while the right subfigure presents the embedding space learnt without classifier. The visualization is obtained by transforming the concatenated features of all modalities output by encoders. We use tSNE algorithm to transform high dimensional concatenated feature vectors into 2dimension feature points. We can infer from Fig. 6 that in the embedding space learnt by ARGF, the dots tend to form two separate clusters which mainly belong to positive and negative sentiment respectively. The distance between two clusters is large but the dots belonging to same cluster are tied closely, which demonstrates the discrimination of our embedding space. Nevertheless, there are some dots that are extremely difficult to be correctly classified found in the wrong clusters, which drives the need for advanced fusion strategies to explore crossmodal dynamics. In contrast, it is clear that the embedding space learnt without classifier cannot distinguish positive and negative sentiments effectively, which highlights the necessity of classifier in learning a discriminative embedding space.
Conclusions
We propose a novel fusion framework to learn a discriminative joint embedding space and then perform graph fusion. By introducing adversarial training to match distributions, modality gap can be significantly narrowed and the representations can be directly fused. With the aid of GFN, we can explore unimodal, bimodal and trimodal dynamics successively and dynamically change the fusion structure.
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China (61673402,61273270, 60802069), and the Natural Science Foundation of Guangdong Province (2017A030311029).
References
 [1] (201902) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 423–443. External Links: Document, ISSN 01628828 Cited by: Introduction, Introduction.
 [2] (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (4), pp. 335–359. Cited by: Datasets.
 [3] (2014) COVAREP: a collaborative voice analysis repository for speech technologies. In ICASSP, pp. 960–964. Cited by: Experimental details.

[4]
(2010)
Opensmile: the munich versatile and fast opensource audio feature extractor
. In ACM International Conference on Multimedia, pp. 1459–1462. Cited by: Experimental details.  [5] (2014) Generative adversarial nets. In NeurIPS, Cited by: Introduction, Related Work, Joint Embedding Space Learning.
 [6] (2018) Multimodal affective analysis using hierarchical attention strategy with wordlevel alignment. In ACL, pp. 2225–2235. Cited by: Related Work.

[7]
(2013)
3D convolutional neural networks for human action recognition
. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: Experimental details.  [8] (2018) Investigating audio, visual, and text fusion methods for endtoend automatic personality prediction. In ACL short paper, External Links: Link Cited by: Related Work.
 [9] (2014) Largescale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: Experimental details.
 [10] (2015) Adam: a method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: Experimental details.
 [11] (2018) Multimodal language analysis with recurrent multistage fusion. In EMNLP, pp. 150–161. Cited by: Introduction, Graph Fusion Network.
 [12] (2015) Semanticspreserving hashing for crossview retrieval. In CVPR, pp. 3864–3872. Cited by: Introduction, Related Work.
 [13] (2012) A survey of opinion mining and sentiment analysis. In Mining Text Data, C. C. Aggarwal and C. Zhai (Eds.), pp. 415–463. External Links: ISBN 9781461432234, Document, Link Cited by: Introduction.
 [14] (2018) Efficient lowrank multimodal fusion with modalityspecific factors. In ACL, pp. 2247–2256. Cited by: Introduction, Introduction, Table 1, Table 2, Table 3, Table 6.
 [15] (2019) Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Transactions on Multimedia (), pp. 1–1. External Links: Document, ISSN 15209210 Cited by: Related Work, Table 1, Table 2, Table 3.
 [16] (201907) Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In ACL, pp. 481–492. External Links: Link Cited by: Experimental details, Experimental details, Table 1, Table 2, Table 3.
 [17] (201806) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledgebased systems 161, pp. 124–133. Cited by: Table 1, Table 2, Table 3.
 [18] (2016) Adversarial autoencoders. In ICLR Workshop, External Links: Link Cited by: Introduction, Related Work, Joint Embedding Space Learning.
 [19] (2013) Efficient estimation of word representations in vector space. In ICLR Workshop, pp. 1725–1732. Cited by: Experimental details.
 [20] (2016) Deep multimodal fusion for persuasiveness prediction. In Proceedings of ACM International Conference on Multimodal Interaction, pp. 284–288. Cited by: Related Work.
 [21] (201710) CMgans: crossmodal generative adversarial networks for common representation learning. IEEE Transaction on Multimedia, pp. . Cited by: Related Work.

[22]
(2014)
GloVe: global vectors for word representation.
In
Empirical Methods in Natural Language Processing (EMNLP)
, pp. 1532–1543. External Links: Link Cited by: Experimental details.  [23] (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In AAAI, pp. 6892–6899. Cited by: Related Work.
 [24] (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion 37, pp. 98–125. Cited by: Introduction.
 [25] (2017) Contextdependent sentiment analysis in usergenerated videos. In ACL, pp. 873–883. Cited by: Introduction, Introduction, Experimental details, Table 1, Table 2, Table 3, Table 4.
 [26] (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In ICDM, pp. 439–448. Cited by: Related Work.
 [27] (2018) Multimodal learning using 3d audiovisual data for audiovisual speech recognition. In International Conference on Asian Language Processing, Cited by: Introduction.
 [28] (201907) Multimodal transformer for unaligned multimodal language sequences. In ACL, pp. 6558–6569. External Links: Link Cited by: Related Work.
 [29] (2013) YouTube movie reviews: sentiment analysis in an audiovisual context. IEEE Intelligent Systems 28 (3), pp. 46–53. Cited by: Related Work.
 [30] (2018) Deep adversarial metric learning for crossmodal retrieval. In World Wide Web, pp. 1–16. Cited by: Introduction, Related Work.
 [31] (2017) Tensor fusion network for multimodal sentiment analysis. In EMNLP, pp. 1114–1125. Cited by: Introduction, Related Work, Related Work, Results and Discussions, Table 1, Table 2, Table 3, Table 4, Table 6.
 [32] (2018) Memory fusion network for multiview sequential learning. In AAAI, pp. 5634–5641. Cited by: Related Work.
 [33] (2018) Multiattention recurrent network for human communication comprehension. In AAAI, pp. 5642–5649. Cited by: Introduction, Related Work.
 [34] (2018) Multimodal language analysis in the wild: cmumosei dataset and interpretable dynamic fusion graph. In ACL, pp. 2236–2246. Cited by: Introduction, Introduction, Related Work, Datasets, Results and Discussions, Results and Discussions, Table 6.
 [35] (201611) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems 31 (6), pp. 82–88. Cited by: Datasets.
 [36] (2012) ADADELTA: an adaptive learning rate method. preprint, arXiv:1212.5701v1. Cited by: Experimental details.