Multimodal deep learning is an active field of research where, for a single event, one is presented with information across multiple modalities, such as video, speech, and text, so that they may be combined to gain a better contextual understanding. Combining, or more precisely, fusing information from multiple modalities is, thus, a vital step for any multimodal task. However, multimodal data is highly heterogeneous in nature making fusion a challenging task. Moreover, the extent to which signals from complimenting modalities are helpful for a downstream task is not always clear.
The most common fusion technique used in the literature is concatenation, where representations from all the modalities are simply concatenated. However, this results in a shallow network (Ngiam et al., 2011)
, where the model, instead of learning inter-modal features, focuses more on learning intra-modal features. Later, zadeh2017tensor proposed tensor fusion network (TFN), in which, the unimodal, bimodal, and trimodal interactions are modelled using a 3-fold Cartesian product. However, it imposes high computational requirements as information from all the modalities is used as-is, without any prior information extraction. liu2018efficient proposed a low rank multimodal fusion technique (LMF) to address the previous problem. Such fusion techniques are effective but often result in a complex architecture with a lot of computation.
In this paper, we propose dynamic fusion techniques which allow the model to decide “how” to combine multimodal data for an event in the best possible manner. The first technique, transfusion, learns to compress multimodal information while preserving as much meaning as possible. Our second technique employs an adversarial network which regularizes the learned latent space for a target modality (text, in our case) according to information presented by the remaining complimentary modalities. Since our models are generic in nature, the need to specify a pre-determined fusion operation such as concatenation or Cartesian product is alleviated, and the network is incentivized to model intermodal interactions by itself. Moreover, our models are based on lightweight components such as linear transformation layers, thereby, checking unnecessary computational load.
We evaluate our models on three benchmark datasets, namely, How2 (Sanabria et al., 2018), Multi30K (Specia et al., 2016), and IEMOCAP (Busso et al., 2008). Quantitative evaluation shows that our models outperform the existing state-of-the-art methods. The rest of the paper is structured as follows: Section 2 covers relevant work, Section 3 discusses the proposed architectures in detail, Section 4 describes the experimental setup, Section 5 shows results, and Section 6 contains our concluding remarks.
represent the video, speech, and text latent vectors respectively, we first concatenate them to obtain. It is then passed through the transfusion layer which outputs the “transfused” vector . Finally, we optimize the loss between (augmented with a null vector of appropriate shape) and . (b) GAN for latent-space regularization: Assuming that , , and are the latent speech, video, and text vectors, respectively, we first transfuse and to give . Simultaneously, we pass through the generator to give . When training the generator, we minimize the loss between and and when training the discriminator , we use and as the two different sources of input. Note: denotes concatenation.
2 Related Work
train end-to-end deep graph neural networks to reconstruct missing modalities at inference time. They demonstrate that better features for one modality can be learned if relevant data from different modalities is available at training time; however, they employ simple concatenation for fusion. Hence, the joint representation learned is shallow and is not guaranteed to learn inter-modal connections. Their findings were later verified by srivastava2012multimodal, who use a Deep Boltzmann Machine(Salakhutdinov and Hinton, 2009) to generate/map data from the image and text modality.
Huang et al. (2018) construct a multilingual common semantic space to achieve better machine translation performance by extending correlation networks (Chandar et al., 2016). They use multiple non-linear transformations to repeatedly reconstruct sentences from one language to another and finally build a common semantic space for all the different languages. Fusion techniques, such as TFN (Zadeh et al., 2017) and LMF (Liu et al., 2018), were also proposed but the problem of efficiently modelling context in multimodal samples still remains unsolved.
|Model||Source modalities||BLEU 1||BLEU 2||BLEU 3||BLEU 4|
|Lal et al., (2019)||s-v-t||-||-||-||51.0|
|Raunak et al., (2019)||t||-||-||-||55.5|
|Wu et al., (2019)||s-v-t||-||-||-||56.2|
|Seq2Seq + attn.||t||79.21||67.34||52.67||47.34|
|Transfusion Net (Ours)||s-t||56.31||33.82||24.63||21.45|
|Transfusion Net + attn (Ours)||s-t||80.34||67.83||61.27||55.01|
|Adversarial Net (Ours)||s-t||60.65||37.43||30.01||28.87|
|Adversarial Net + attn (Ours)||s-t||82.25||69.43||64.33||56.5|
|Tranfusion + attn (Ours)||42.31||61.7|
|Adv. Network (Ours)||44.23||63.8|
|LSTM + attn (t)||53.2||40.6||43.4||43.6|
|LSTM + attn ([s;t])||66.1||65.0||64.7||64.2|
|Transfusion + attn (Ours)||75.3||77.4||76.3||77.8|
|Adversarial + attn (Ours)||77.3||79.1||78.2||79.2|
3 Proposed methods
In this section, we describe the two methods developed for efficiently combining multimodal inputs.
Most fusion methods proposed in the past either result in a shallow network, such as concatenation, where the network learns more intramodal features than intermodal features, or are computationally expensive, such as tensor fusion (Zadeh et al., 2017)
, and there is no intelligent feature extraction. In both cases, the fusion operation is specified by the user, and the network does not have the freedom to learn/relate intermodal features on its own.
In order to mitigate the “staticness” of previous methods, we propose a dynamic yet simple fusion technique, called transfusion, where the model learns to extract intermodal features by itself. In this method, we first concatenate the latent vectors from different modalities, and then pass them through a transformation layer to get a transfused
latent vector whose dimension is much lower than the dimension of the input concatenated vector. We then minimize the euclidean distance between the transfused and the previously obtained concatenated vector. Note that in order to do so, we need to augment the transfused vector with a null vector of appropriate shape to match the concatenated vector’s dimension. Augmentation with a null vector provides another important advantage: it makes sure that the transformation layer is not arbitrarily outputting signals from the previously concatenated latent vector. Instead, it is incentivized to “compress” the information without losing any important cues as much as possible. In other words, it increases correlation between the transfused and the concatenated latent vector. Such a method is applicable to any scenario where multiple features need to be combined. For example, they can be used combine the forward and backward hidden states of the LSTM, instead of pooling methods such as, 1D pooling, max pooling, sum pooling or even simple concatenation.
We now discuss the transfusion network in detail. We pose fusion of multimodal inputs as a compression problem, where we must retain as much information from the individual modalities as possible. Given ( in our case) -dimensional multimodal latent vectors, , we first concatenate them to obtain a vector, , where . Then, we apply a transformation, , to , reducing its dimension to . Finally, we calculate the loss, , between the transfused latent vector, , and
. We use MSE as our loss function for this case. These steps could be followed in Figure1(a) and the loss for transfusion network is given by:
Here represents the concatenation operator.
3.2 GAN for Latent Space Regularization
In addition to the “staticness” of existing methods, there is also the challenge of distinguishing between ambiguous cases. For instance, the sentence “Kevin, this is hilarious,” could be said in a funny, sarcastic, or angry manner. Currently, the existing methods, even when fed with the corresponding speech vector, cannot efficiently distinguish between similar but different emotions such as the sarcastic and the angry setting. We hypothesize that this is due to the fact that they are not learning the conditional distribution of sentiment given an utterance (an utterance includes input from all available modalities).
In order to mitigate this issue, we propose an adversarial training regime that is incentivized to learn the desired conditional distribution. For a task such as emotion recognition, this would be sentiment given an utterance, and for a more challenging generation task, the model could learn to model a more complex behaviour, such as the association of different sentences based on how similar they sound, in addition to their polarity. We show in our experiments that our GAN-based approach is better able to relate intermodal features as compared to the existing methods.
We now describe the GAN-based architecture in detail. For a given multimodal sample , we first encode the inputs from each modality (speech, visual and text) to get the respective latent vectors, . Fixing a target modality (text in our case), we pass through a generator to obtain , and transfuse the remaining latent vectors, simultaneously to obtain . In the event where we have input from only one modality in addition to text, we do not need any transfusion, and can simply treat the other modality’s vector as . Finally, we train the network in adversarial fashion, labelling as positive samples and as negative samples. The adversarial loss, , is given below:
Overall, the generator is incentivized to align features of the target modality (text, in our case) with features from the complimentary modalities (speech and video. in our case), and the discriminator tries to identify the type of its input. Learning the latent space in such an adversarial manner induces a clustering effect on the latent space, where texts attached to similar sounds and visuals are grouped together.
4 Experimental Setup
We evaluate our methods on the tasks of machine translation and emotion recognition. In this section we describe the datasets used and the details of the training process.
4.1.1 Emotion Recognition: IEMOCAP
We use the IEMOCAP dataset (Busso et al., 2008) released by researchers from the University of Southern California (USC). It contains five recorded sessions of conversations from ten speakers and amounts to nearly 12 hours of audio-visual information along with transcriptions. It is annotated with eight categorical emotion labels, namely, angry, happy, sad, neutral, surprised, fear, frustrated and excited. It also contains dimensional labels such as values of the activation and valence from 1 to 5; however, they are not used in this work. The dataset is already split into multiple utterances for each session and we further split each utterance file to obtain wav files for each sentence. This was done using the start timestamp and end timestamp provided for the transcribed sentences. This results in a total of 10K audio files which are then used to extract features.
We identify the task as an emotion recognition problem, where, given a sentence and its audio, we wish to predict the correct emotion.
4.1.2 Machine Translation: How2
We evaluate our models on the multimodal How2 dataset (Sanabria et al., 2018)
, which is comprised of 79,114 instructional videos in addition to word-level time alignments to the ground-truth English subtitles and their respective crowd-sourced Portuguese translations. A brief description of the video clip is also included to encourage future work on image captioning. This dataset was created by scraping videos along with their metadata from YouTube using a keyword based spider, and manually extracting and processing visual, auditory and textual features.
Unlike other popular multimodal datasets that are frequently featured in multimodal deep learning literature, such as CUAVE (Patterson et al., 2002) and AVLetters (Matthews et al., 2002), the How2 data is in fact trimodal, therefore making it suitable to evaluate the contribution of each modality towards different tasks. The speech features are 43-dimensional vectors extracted from 16 kHz raw speech using the toolkit Kaldi111https://github.com/kaldi-asr/kaldi
. A 2048-dimensional video feature vector is also derived per 16 frames in each video. Further, as a large-scale multilingual dataset, it enables a convenient medium for neural machine translation in our project.
4.1.3 Machine Translation: Multi30K
We also run experiments on the Multi30K dataset (Specia et al., 2016). The dataset contains pairs of sentences in English and many languages such as French, German, and Czech. Each sample in the dataset has an image, its description in the source language and its translated version. We only run experiments on the En-Fr version of the dataset.
4.2 Hyperparameters and Training Details
We use Bidirectional LSTM units (Hochreiter and Schmidhuber, 1997) of size 256 to encode text and a unidirectional LSTM of size 256 as the decoder. We preprocess the text where we lowercase and normalize all the words to remove any punctuations and non-ascii characters. We train a word2vec model on all our datasets with embeddings of dimension 300. We finetune a pre-trained VGG (Simonyan and Zisserman, 2014) to encode images in the Multi30K dataset. For experiments on the How2 dataset, we use the already provided feature vectors for speech and video.
For training, the desired fusion network is used before the final step of each of the tasks, i.e., before the final classification layer for emotion recognition, and before the decoder for machine translation. All the models are trained in an end-to-end manner.
, respectively. For the relatively easier task of emotion recognition, we observe that our models perform well across all the evaluation metrics. For the more difficult task of machine translation, we note that our best performing model beats the existing methods in terms of BLEU scores, despite being much lighter than the previous models.
6 Conclusions and Future Work
In this paper, we proposed two dynamic fusion techniques that allowed for better multimodal fusion with an added advantage of being very lightweight. Instead of using a pre-defined fusion operation, we let the model decide the most optimal way to extract signals from multiple modalities. Our results indicate that such adaptive models are better and computationally more efficient for the given task.
One interesting aspect to study would be that of multimodal feature alignment that could help reduce the heterogeneity in multimodal inputs.
- IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335. Cited by: §1, §4.1.1.
- Correlational neural networks. Neural computation 28 (2), pp. 257–285. Cited by: §2.
- Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
- Multi-lingual common semantic space construction via cluster-consistent word embedding. arXiv preprint arXiv:1804.07875. Cited by: §2.
- Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064. Cited by: §2.
- Extraction of visual features for lipreading. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, pp. 198 – 213. External Links: Cited by: §4.1.2.
- Comparison of classifiers for lip reading with cuave and tulips database. Optik - International Journal for Light and Electron Optics 126 (24), pp. 5753–5761 (English). External Links: Cited by: §2.
- Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. Cited by: §1, §2.
- CUAVE: a new audio-visual database for multimodal human-computer interface research. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. II–2017. Cited by: §4.1.2.
- Deep boltzmann machines. In Artificial intelligence and statistics, pp. 448–455. Cited by: §2.
- How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347. Cited by: §1, §4.1.2.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
- A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 543–553. Cited by: §1, §4.1.3.
Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250. Cited by: §2, §3.1.