1 Introduction
Multimodal research has shown great progress in a variety of tasks as an emerging research field of artificial intelligence. Tasks such as speech recognition Yuhas et al. (1989), emotion recognition, De Silva et al. (1997), Chen et al. (1998), Wöllmer et al. (2013), sentiment analysis, Morency et al. (2011) as well as speaker trait analysis and media description Park et al. (2014a) have seen a great boost in performance with developments in multimodal research.
However, a core research challenge yet to be solved in this domain is multimodal fusion. The goal of fusion is to combine multiple modalities to leverage the complementarity of heterogeneous data and provide more robust predictions. In this regard, an important challenge has been on scaling up fusion to multiple modalities while maintaining reasonable model complexity. Some of the recent attempts Fukui et al. (2016), Zadeh et al. (2017) at multimodal fusion investigate the use of tensors for multimodal representation and show significant improvement in performance. Unfortunately, they are often constrained by the exponential increase of cost in computation and memory introduced by using tensor representations. This heavily restricts the applicability of these models, especially when we have more than two views of modalities in the dataset.
In this paper, we propose the Lowrank Multimodal Fusion, a method leveraging lowrank weight tensors to make multimodal fusion efficient without compromising on performance. The overall architecture is shown in Figure 1. We evaluated our approach with experiments on three multimodal tasks using public datasets and compare its performance with stateoftheart models. We also study how different lowrank settings impact the performance of our model and show that our model performs robustly within a wide range of rank settings. Finally, we perform an analysis of the impact of our method on the number of parameters and runtime with comparison to other fusion methods. Through theoretical analysis, we show that our model can scale linearly in the number of modalities, and our experiments also show a corresponding speedup in training when compared with other tensorbased models.
The main contributions of our paper are as follows:

We propose the Lowrank Multimodal Fusion method for multimodal fusion that can scale linearly in the number of modalities.

We show that our model compares to stateoftheart models in performance on three multimodal tasks evaluated on public datasets.

We show that our model is computationally efficient and has fewer parameters in comparison to previous tensorbased methods.
2 Related Work
Multimodal fusion enables us to leverage complementary information present in multimodal data, thus discovering the dependency of information on multiple modalities. Previous studies have shown that more effective fusion methods translate to better performance in models, and there’s been a wide range of fusion methods.
Early fusion is a technique that uses feature concatenation as the method of fusion of different views. Several works that use this method of fusion Poria et al. (2016) , Wang et al. (2016) use inputlevel feature concatenation and use the concatenated features as input, sometimes even removing the temporal dependency present in the modalities Morency et al. (2011). The drawback of this class of method is that although it achieves fusion at an early stage, intramodal interactions are potentially suppressed, thus losing out on the context and temporal dependencies within each modality.
On the other hand, late fusion builds separate models for each modality and then integrates the outputs together using a method such as majority voting or weighted averaging Wortwein and Scherer (2017), Nojavanasghari et al. (2016). Since separate models are built for each modality, intermodal interactions are usually not modeled effectively.
Given these shortcomings, more recent work focuses on intermediate approaches that model both intra and intermodal dynamics. Fukui et al. (2016) proposes to use Compact Bilinear Pooling over the outer product of visual and linguistic representations to exploit the interactions between vision and language for visual question answering. Similar to the idea of exploiting interactions, Zadeh et al. (2017) proposes Tensor Fusion Network, which computes the outer product between unimodal representations from three different modalities to compute a tensor representation. These methods exploit tensor representations to model intermodality interactions and have shown a great success. However, such methods suffer from exponentially increasing computational complexity, as the outer product over multiple modalities results in extremely high dimensional tensor representations.
For unimodal data, the method of lowrank tensor approximation has been used in a variety of applications to implement more efficient tensor operations. Razenshteyn et al. (2016) proposes a modified weighted version of lowrank approximation, and Koch and Lubich (2010) applies the method towards temporally dependent data to obtain lowrank approximations. As for applications, Lei et al. (2014) proposes a lowrank tensor technique for dependency parsing while Wang and Ahuja (2008)
uses the method of lowrank approximation applied directly on multidimensional image data (Datumasis representation) to enhance computer vision applications.
Hu et al. (2017)proposes a lowrank tensorbased fusion framework to improve the face recognition performance using the fusion of facial attribute information. However, none of these previous work aims to apply lowrank tensor techniques for multimodal fusion.
Our Lowrank Multimodal Fusion method provides a much more efficient method to compute tensorbased multimodal representations with much fewer parameters and computational complexity. The efficiency and performance of our approach are evaluated on different downstream tasks, namely sentiment analysis, speakertrait recognition and emotion recognition.
3 Lowrank Multimodal Fusion
In this section, we start by formulating the problem of multimodal fusion and introducing fusion methods based on tensor representations. Tensors are powerful in their expressiveness but do not scale well to a large number of modalities. Our proposed model decomposes the weights into lowrank factors, which reduces the number of parameters in the model. This decomposition can be performed efficiently by exploiting the parallel decomposition of lowrank weight tensor and input tensor to compute tensorbased fusion. Our method is able to scale linearly with the number of modalities.
3.1 Multimodal Fusion using Tensor Representations
In this paper, we formulate multimodal fusion as a multilinear function where
are the vector spaces of input modalities and
is the output vector space. Given a set of vector representations, which are encoding unimodal information of the different modalities, the goal of multimodal fusion is to integrate the unimodal representations into one compact multimodal representation for downstream tasks.Tensor representation is one successful approach for multimodal fusion. It first requires a transformation of the input representations into a highdimensional tensor and then mapping it back to a lowerdimensional output vector space. Previous works have shown that this method is more effective than simple concatenation or pooling in terms of capturing multimodal interactions Zadeh et al. (2017), Fukui et al. (2016). Tensors are usually created by taking the outer product over the input modalities. In addition, in order to be able to model the interactions between any subset of modalities using one tensor, Zadeh et al. (2017) proposed a simple extension to append s to the unimodal representations before taking the outer product. The input tensor formed by the unimodal representation is computed by:
(1) 
where denotes the tensor outer product over a set of vectors indexed by , and is the input representation with appended s.
The input tensor is then passed through a linear layer to to produce a vector representation:
(2) 
where is the weight of this layer and is the bias. With being an order tensor (where is the number of input modalities), the weight will naturally be a tensor of order in . The extra th dimension corresponds to the size of the output representation . In the tensor dot product , the weight tensor can be then viewed as order tensors. In other words, the weight can be partitioned into . Each contributes to one dimension in the output vector , i.e. . This interpretation of tensor fusion is illustrated in Figure 2 for the bimodal case.
One of the main drawbacks of tensor fusion is that we have to explicitly create the highdimensional tensor . The dimensionality of will increase exponentially with the number of modalities as . The number of parameters to learn in the weight tensor will also increase exponentially. This not only introduces a lot of computation but also exposes the model to risks of overfitting.
3.2 Lowrank Multimodal Fusion with ModalitySpecific Factors
As a solution to the problems of tensorbased fusion, we propose Lowrank Multimodal Fusion (LMF). LMF parameterizes from Equation 2 with a set of modalityspecific lowrank factors that can be used to recover a lowrank weight tensor, in contrast to the full tensor . Moreover, we show that by decomposing the weight into a set of lowrank factors, we can exploit the fact that the tensor actually decomposes into , which allows us to directly compute the output without explicitly tensorizing the unimodal representations. LMF reduces the number of parameters as well as the computation complexity involved in tensorization from being exponential in to linear.
3.2.1 Lowrank Weight Decomposition
The idea of LMF is to decompose the weight tensor into sets of modalityspecific factors. However, since itself is an order tensor, commonly used methods for decomposition will result in parts. Hence, we still adopt the view introduced in Section 3.1 that is formed by order tensors stacked together. We can then decompose each separately.
For an order tensor , there always exists an exact decomposition into vectors in the form of:
(3) 
The minimal that makes the decomposition valid is called the rank of the tensor. The vector sets are called the rank decomposition factors of the original tensor.
In LMF, we start with a fixed rank , and parameterize the model with decomposition factors that can be used to reconstruct a lowrank version of these .
We can regroup and concatenate these vectors into modalityspecific lowrank factors. Let , then for modality , is its corresponding lowrank factors. And we can recover a lowrank weight tensor by:
(4) 
Hence equation 2 can be computed by
(5) 
Note that for all shares the same size for the second dimension. We define their outer product to be over only the dimensions that are not shared: . A bimodal example of this procedure is illustrated in Figure 3.
Nevertheless, by introducing the lowrank factors, we now have to compute the reconstruction of for the forward computation. Yet this introduces even more computation.
3.2.2 Efficient Lowrank Fusion Exploiting Parallel Decomposition
In this section, we will introduce an efficient procedure for computing , exploiting the fact that tensor naturally decomposes into the original input , which is parallel to the modalityspecific lowrank factors. In fact, that is the main reason why we want to decompose the weight tensor into modalityspecific factors.
Using the fact that , we can simplify equation 5:
(6) 
where denotes the elementwise product over a sequence of tensors: .
An illustration of the trimodal case of equation 3.2.2 is shown in Figure 1. We can also derive equation 3.2.2 for a bimodal case to clarify what it does:
(7) 
An important aspect of this simplification is that it exploits the parallel decomposition of both and , so that we can compute without actually creating the tensor from the input representations . In addition, different modalities are decoupled in the simplified computation of , which allows for easy generalization of our approach to an arbitrary number of modalities. Adding a new modality can be simply done by adding another set of modalityspecific factors and extend Equation 3.2.2. Last but not least, Equation 3.2.2 consists of fully differentiable operations, which enables the parameters to be learned endtoend via backpropagation.
Using Equation 3.2.2, we can compute directly from input unimodal representations and their modalspecific decomposition factors, avoiding the weightlifting of computing the large input tensor and , as well as the linear transformation. Instead, the input tensor and subsequent linear projection are computed implicitly together in Equation 3.2.2, and this is far more efficient than the original method described in Section 3.1. Indeed, LMF reduces the computation complexity of tensorization and fusion from to .
In practice, we use a slightly different form of Equation 3.2.2, where we concatenate the lowrank factors into order3 tensors and swap the order in which we do the elementwise product and summation:
(8) 
and now the summation is done along the first dimension of the bracketed matrix. indicates the th slice of a matrix. In this way, we can parameterize the model with order3 tensors, instead of parameterizing with sets of vectors.
4 Experimental Methodology
We compare LMF with previous stateoftheart baselines, and we use the Tensor Fusion Networks (TFN) Zadeh et al. (2017) as a baseline for tensorbased approaches, which has the most similar structure with us except that it explicitly forms the large multidimensional tensor for fusion across different modalities.
We design our experiments to better understand the characteristics of LMF. Our goal is to answer the following four research questions:
(1) Impact of Multimodal Lowrank Fusion: Direct comparison between our proposed LMF model and the previous TFN model.
(2) Comparison with the Stateoftheart: We evaluate the performance of LMF and stateoftheart baselines on three different tasks and datasets.
(3) Complexity Analysis: We study the modal complexity of LMF and compare it with the TFN model.
(4) Rank Settings: We explore performance of LMF with different rank settings.
The results of these experiments are presented in Section 5.
4.1 Datasets


Dataset  CMUMOSI  IEMOCAP  POM 
Level  Segment  Segment  Video 


# Train  1284  6373  600 
# Valid  229  1775  100 
# Test  686  1807  203 

We perform our experiments on the following multimodal datasets, CMUMOSI Zadeh et al. (2016a), POM Park et al. (2014b), and IEMOCAP Busso et al. (2008) for sentiment analysis, speaker traits recognition, and emotion recognition task, where the goal is to identify speakers emotions based on the speakers’ verbal and nonverbal behaviors.
CMUMOSI The CMUMOSI dataset is a collection of 93 opinion videos from YouTube movie reviews. Each video consists of multiple opinion segments and each segment is annotated with the sentiment in the range [3,3], where 3 indicates highly negative and 3 indicates highly positive.
POM The POM dataset is composed of 903 movie review videos. Each video is annotated with the following speaker traits: confident, passionate, voice pleasant, dominant, credible, vivid, expertise, entertaining, reserved, trusting, relaxed, outgoing, thorough, nervous, persuasive and humorous.
IEMOCAP The IEMOCAP dataset is a collection of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset. Each segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral).
To evaluate model generalization, all datasets are split into training, validation, and test sets such that the splits are speaker independent, i.e., no identical speakers from the training set are present in the test sets. Table 1 illustrates the data splits for all datasets in detail.
4.2 Features
Each dataset consists of three modalities, namely language, visual, and acoustic modalities. To reach the same time alignment across modalities, we perform word alignment using P2FA Yuan and Liberman (2008) which allows us to align the three modalities at the word granularity. We calculate the visual and acoustic features by taking the average of their feature values over the word time interval Chen et al. (2017).
Language We use pretrained 300dimensional Glove word embeddings Pennington et al. (2014) to encode a sequence of transcribed words into a sequence of word vectors.
Visual The library Facet^{1}^{1}1goo.gl/1rh1JN is used to extract a set of visual features for each frame (sampled at 30Hz) including 20 facial action units, 68 facial landmarks, head pose, gaze tracking and HOG features Zhu et al. (2006).
Acoustic We use COVAREP acoustic analysis framework Degottex et al. (2014) to extract a set of lowlevel acoustic features, including 12 Mel frequency cepstral coefficients (MFCCs), pitch, voiced/unvoiced segmentation, glottal source, peak slope, and maxima dispersion quotient features.
4.3 Model Architecture
In order to compare our fusion method with previous work, we adopt a simple and straightforward model architecture ^{2}^{2}2The source code of our model is available on Github at https://github.com/Justin1904/LowrankMultimodalFusion for extracting unimodal representations. Since we have three modalities for each dataset, we simply designed three unimodal subembedding networks, denoted as , to extract unimodal representations from unimodal input features
. For acoustic and visual modality, the subembedding network is a simple 2layer feedforward neural network, and for language modality, we used an LSTM
Hochreiter and Schmidhuber (1997) to extract representations. The model architecture is illustrated in Figure 1.4.4 Baseline Models
We compare the performance of LMF to the following baselines and stateoftheart models in multimodal sentiment analysis, speaker trait recognition, and emotion recognition.
Support Vector MachinesSupport Vector Machines (SVM) Cortes and Vapnik (1995)
is a widely used nonneural classifier. This baseline is trained on the concatenated multimodal features for classification or regression task
PérezRosas et al. (2013), Park et al. (2014a), Zadeh et al. (2016b).Deep Fusion The Deep Fusion model (DF) Nojavanasghari et al. (2016) trains one deep neural model for each modality and then combine the output of each modality network with a joint neural network.
Tensor Fusion Network The Tensor Fusion Network (TFN) Zadeh et al. (2017) explicitly models viewspecific and crossview dynamics by creating a multidimensional tensor that captures unimodal, bimodal and trimodal interactions across three modalities.
Memory Fusion Network The Memory Fusion Network (MFN) Zadeh et al. (2018a) accounts for viewspecific and crossview interactions and continuously models them through time with a special attention mechanism and summarized through time with a Multiview Gated Memory.
Bidirectional Contextual LSTM The Bidirectional Contextual LSTM (BCLSTM) Zadeh et al. (2017), Fukui et al. (2016) performs contextdependent fusion of multimodal data.
MultiView LSTM The MultiView LSTM (MVLSTM) Rajagopalan et al. (2016) aims to capture both modalityspecific and crossmodality interactions from multiple modalities by partitioning the memory cell and the gates corresponding to multiple modalities.
Multiattention Recurrent Network The Multiattention Recurrent Network (MARN) Zadeh et al. (2018b) explicitly models interactions between modalities through time using a neural component called the Multiattention Block (MAB) and storing them in the hybrid memory called the Longshort Term Hybrid Memory (LSTHM).
4.5 Evaluation Metrics
Multiple evaluation tasks are performed during our evaluation: multiclass classification and regression. The multiclass classification task is applied to all three multimodal datasets, and the regression task is applied to the CMUMOSI and the POM dataset. For binary classification and multiclass classification, we report F1 score and accuracy Acc where k denotes the number of classes. Specifically, Acc stands for the binary classification. For regression, we report Mean Absolute Error (MAE) and Pearson correlation (Corr). Higher values denote better performance for all metrics except for MAE.
5 Results and Discussion
In this section, we present and discuss the results from the experiments designed to study the research questions introduced in section 4.
5.1 Impact of Lowrank Multimodal Fusion
In this experiment, we compare our model directly with the TFN model since it has the most similar structure to our model, except that TFN explicitly forms the multimodal tensor fusion. The comparison reported in the last two rows of Table 2 demonstrates that our model significantly outperforms TFN across all datasets and metrics. This competitive performance of LMF compared to TFN emphasizes the advantage of Lowrank Multimodal Fusion.
5.2 Comparison with the Stateoftheart


Dataset  CMUMOSI  POM  IEMOCAP  
Metric  MAE  Corr  Acc2  F1  Acc7  MAE  Corr  Acc  F1Happy  F1Sad  F1Angry  F1Neutral 


SVM  1.864  0.057  50.2  50.1  17.5  0.887  0.104  33.9  81.5  78.8  82.4  64.9 
DF  1.143  0.518  72.3  72.1  26.8  0.869  0.144  34.1  81.0  81.2  65.4  44.0 
BCLSTM  1.079  0.581  73.9  73.9  28.7  0.840  0.278  34.8  81.7  81.7  84.2  64.1 
MVLSTM  1.019  0.601  73.9  74.0  33.2  0.891  0.270  34.6  81.3  74.0  84.3  66.7 
MARN  0.968  0.625  77.1  77.0  34.7      39.4  83.6  81.2  84.2  65.9 
MFN  0.965  0.632  77.4  77.3  34.1  0.805  0.349  41.7  84.0  82.1  83.7  69.2 


TFN  0.970  0.633  73.9  73.4  32.1  0.886  0.093  31.6  83.6  82.8  84.2  65.4 
LMF  0.912  0.668  76.4  75.7  32.8  0.796  0.396  42.8  85.8  85.9  89.0  71.7 

We compare our model with the baselines and stateoftheart models for sentiment analysis, speaker traits recognition and emotion recognition. Results are shown in Table 2. LMF is able to achieve competitive and consistent results across all datasets.
On the multimodal sentiment regression task, LMF outperforms the previous stateoftheart model on MAE and Corr. Note the multiclass accuracy is calculated by mapping the range of continuous sentiment values into a set of intervals that are used as discrete classes.
On the multimodal speaker traits Recognition task, we report the average evaluation score over 16 speaker traits and shows that our model achieves the stateoftheart performance over all three evaluation metrics on the POM dataset.
On the multimodal emotion recognition task, our model achieves better results compared to the stateoftheart models across all emotions on the F1 score. F1emotion in the evaluation metrics indicates the F1 score for a certain emotion class.
5.3 Complexity Analysis
Theoretically, the model complexity of our fusion method is compared to of TFN from Section 3.1. In practice, we calculate the total number of parameters used in each model, where we choose , , , , , . Under this hyperparameter setting, our model contains about 1.1e6 parameters while TFN contains about 12.5e6 parameters, which is nearly 11 times more. Note that, the number of parameters above counts not only the parameters in the multimodal fusion stage but also the parameters in the subnetworks.


Model  Training Speed (IPS)  Testing Speed (IPS) 


TFN  340.74  1177.17 
LMF  1134.82  2249.90 

Furthermore, we evaluate the computational complexity of LMF by measuring the training and testing speeds between LMF and TFN. Table 3 illustrates the impact of Lowrank Multimodal Fusion on the training and testing speeds compared with TFN model. Here we set rank to be 4 since it can generally achieve fairly competent performance.
Based on these results, performing a lowrank multimodal fusion with modalityspecific lowrank factors significantly reduces the amount of time needed for training and testing the model. On an NVIDIA Quadro K4200 GPU, LMF trains with an average frequency of 1134.82 IPS (data point inferences per second) while the TFN model trains at an average of 340.74 IPS.
5.4 Rank Settings
To evaluate the impact of different rank settings for our LMF model, we measure the change in performance on the CMUMOSI dataset while varying the number of rank. The results are presented in Figure 4. We observed that as the rank increases, the training results become more and more unstable and that using a very low rank is enough to achieve fairly competent performance.
6 Conclusion
In this paper, we introduce a Lowrank Multimodal Fusion method that performs multimodal fusion with modalityspecific lowrank factors. LMF scales linearly in the number of modalities. LMF achieves competitive results across different multimodal tasks. Furthermore, LMF demonstrates a significant decrease in computational complexity from exponential to linear time. In practice, LMF effectively improves the training and testing efficiency compared to TFN which performs multimodal fusion with tensor representations.
Future work on similar topics could explore the applications of using lowrank tensors for attention models over tensor representations, as they can be even more memory and computationally intensive.
Acknowledgements
This material is based upon work partially supported by the National Science Foundation (Award # 1833355) and Oculus VR. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of National Science Foundation or Oculus VR, and no official endorsement should be inferred.
References
 Busso et al. (2008) Carlos Busso, Murtaza Bulut, ChiChun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation 42(4):335–359. https://doi.org/10.1007/s1057900890766.
 Chen et al. (1998) Lawrence S Chen, Thomas S Huang, Tsutomu Miyasato, and Ryohei Nakatsu. 1998. Multimodal human emotion/expression recognition. In Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on. IEEE, pages 366–371.
 Chen et al. (2017) Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and LouisPhilippe Morency. 2017. Multimodal sentiment analysis with wordlevel fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, New York, NY, USA, ICMI 2017, pages 163–171. https://doi.org/10.1145/3136755.3136801.
 Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine learning 20(3):273–297.
 De Silva et al. (1997) Liyanage C De Silva, Tsutomu Miyasato, and Ryohei Nakatsu. 1997. Facial emotion recognition using multimodal information. In Information, Communications and Signal Processing, 1997. ICICS., Proceedings of 1997 International Conference on. IEEE, volume 1, pages 397–401.
 Degottex et al. (2014) Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. Covarep—a collaborative voice analysis repository for speech technologies. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, pages 960–964.
 Fukui et al. (2016) Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 .
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural Comput. 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.
 Hu et al. (2017) Guosheng Hu, Yang Hua, Yang Yuan, Zhihong Zhang, Zheng Lu, Sankha S Mukherjee, Timothy M Hospedales, Neil M Robertson, and Yongxin Yang. 2017. Attributeenhanced face recognition with neural tensor fusion networks.
 Koch and Lubich (2010) Othmar Koch and Christian Lubich. 2010. Dynamical tensor approximation. SIAM Journal on Matrix Analysis and Applications 31(5):2360–2375.
 Lei et al. (2014) Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Lowrank tensors for scoring dependency structures. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1381–1391.
 Morency et al. (2011) LouisPhilippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th International Conference on Multimodal Interactions. ACM, pages 169–176.
 Nojavanasghari et al. (2016) Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and LouisPhilippe Morency. 2016. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, pages 284–288.
 Park et al. (2014a) Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and LouisPhilippe Morency. 2014a. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In Proceedings of the 16th International Conference on Multimodal Interaction. ACM, pages 50–57.
 Park et al. (2014b) Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and LouisPhilippe Morency. 2014b. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In Proceedings of the 16th International Conference on Multimodal Interaction. ACM, New York, NY, USA, ICMI ’14, pages 50–57. https://doi.org/10.1145/2663204.2663260.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation.
 PérezRosas et al. (2013) Verónica PérezRosas, Rada Mihalcea, and LouisPhilippe Morency. 2013. Utterancelevel multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 973–982.
 Poria et al. (2016) Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. Convolutional mkl based multimodal emotion recognition and sentiment analysis. In Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, pages 439–448.

Rajagopalan et al. (2016)
Shyam Sundar Rajagopalan, LouisPhilippe Morency, Tadas Baltrušaitis, and
Roland Goecke. 2016.
Extending long shortterm memory for multiview structured learning.
In European Conference on Computer Vision. 
Razenshteyn et al. (2016)
Ilya Razenshteyn, Zhao Song, and David P Woodruff. 2016.
Weighted low rank approximations with provable guarantees.
In
Proceedings of the fortyeighth annual ACM symposium on Theory of Computing
. ACM, pages 250–263.  Wang et al. (2016) Haohan Wang, Aaksha Meghawat, LouisPhilippe Morency, and Eric P Xing. 2016. Selectadditive learning: Improving crossindividual generalization in multimodal sentiment analysis. arXiv preprint arXiv:1609.05244 .
 Wang and Ahuja (2008) Hongcheng Wang and Narendra Ahuja. 2008. A tensor approximation approach to dimensionality reduction. International Journal of Computer Vision 76(3):217–229.
 Wöllmer et al. (2013) Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and LouisPhilippe Morency. 2013. Youtube movie reviews: Sentiment analysis in an audiovisual context. IEEE Intelligent Systems 28(3):46–53.
 Wortwein and Scherer (2017) Torsten Wortwein and Stefan Scherer. 2017. What really matters—an information gain analysis of questions and reactions in automated ptsd screenings. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pages 15–20.
 Yuan and Liberman (2008) Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America 123(5):3878.
 Yuhas et al. (1989) Ben P Yuhas, Moise H Goldstein, and Terrence J Sejnowski. 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine 27(11):65–71.

Zadeh et al. (2017)
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and LouisPhilippe
Morency. 2017.
Tensor fusion network for multimodal sentiment analysis.
In
Empirical Methods in Natural Language Processing, EMNLP
.  Zadeh et al. (2018a) Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and LouisPhilippe Morency. 2018a. Memory fusion network for multiview sequential learning. arXiv preprint arXiv:1802.00927 .
 Zadeh et al. (2018b) Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and LouisPhilippe Morency. 2018b. Multiattention recurrent network for human communication comprehension. arXiv preprint arXiv:1802.00923 .
 Zadeh et al. (2016a) Amir Zadeh, Rowan Zellers, Eli Pincus, and LouisPhilippe Morency. 2016a. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 .
 Zadeh et al. (2016b) Amir Zadeh, Rowan Zellers, Eli Pincus, and LouisPhilippe Morency. 2016b. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31(6):82–88.

Zhu et al. (2006)
Qiang Zhu, MeiChen Yeh, KwangTing Cheng, and Shai Avidan. 2006.
Fast human detection using a cascade of histograms of oriented
gradients.
In
Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on
. IEEE, volume 2, pages 1491–1498.
Comments
There are no comments yet.