Maximum Likelihood Estimation for Multimodal Learning with Missing Modality

08/24/2021
by   Fei Ma, et al.
MIT
Tsinghua University
0

Multimodal learning has achieved great successes in many scenarios. Compared with unimodal learning, it can effectively combine the information from different modalities to improve the performance of learning tasks. In reality, the multimodal data may have missing modalities due to various reasons, such as sensor failure and data transmission error. In previous works, the information of the modality-missing data has not been well exploited. To address this problem, we propose an efficient approach based on maximum likelihood estimation to incorporate the knowledge in the modality-missing data. Specifically, we design a likelihood function to characterize the conditional distribution of the modality-complete data and the modality-missing data, which is theoretically optimal. Moreover, we develop a generalized form of the softmax function to effectively implement maximum likelihood estimation in an end-to-end manner. Such training strategy guarantees the computability of our algorithm capably. Finally, we conduct a series of experiments on real-world multimodal datasets. Our results demonstrate the effectiveness of the proposed approach, even when 95

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

08/21/2018

LRMM: Learning to Recommend with Missing Modalities

Multimodal learning has shown promising performance in content-based rec...
02/22/2020

Robust Multimodal Brain Tumor Segmentation via Feature Disentanglement and Gated Fusion

Accurate medical image segmentation commonly requires effective learning...
08/04/2019

Full-semiparametric-likelihood-based inference for non-ignorable missing data

During the past few decades, missing-data problems have been studied ext...
06/07/2021

Counterfactual Maximum Likelihood Estimation for Training Deep Networks

Although deep learning models have driven state-of-the-art performance o...
09/19/2017

Human Activity Recognition Using Robust Adaptive Privileged Probabilistic Learning

In this work, a novel method based on the learning using privileged info...
03/09/2021

SMIL: Multimodal Learning with Severely Missing Modality

A common assumption in multimodal learning is the completeness of traini...
04/07/2020

Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation

Many tasks in computer vision and graphics fall within the framework of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multimodal learning is an important research area, which builds models to process and relate information between different modalities Baltrušaitis et al. (2018). Compared with unimodal learning, multimodal learning can effectively utilize the multimodal data to achieve better performance. It has been successfully used in many applications, such as multimodal emotion recognition Soleymani et al. (2011), multimedia event detection Gan et al. (2015), and visual question-answering Antol et al. (2015). With the emergence of big data, multimodal learning becomes more and more important to combine the multimodal data from different sources.

A number of previous works Tzirakis et al. (2017); Zhang et al. (2017); Elliott et al. (2017); Kim et al. (2020); Zhang et al. (2020) have achieved great successes based on complete observations during the training process. However, in practice, the multimodal data may have missing modalities Du et al. (2018); Ma et al. (2021a, b). This may be caused by various reasons. For instance, the sensor that collects the multimodal data is damaged or the network transmission fails. Examples of the multimodal data are shown in Figure 1.

In the past years, researchers have proposed a few approaches to deal with modality missing. A simple and typical way Hastie et al. (2009)

is to directly discard the data with missing modalities. Since the information contained in the modality-missing data is neglected, such method often has limited performance. Moreover, there are also approaches proposed to heuristically combine the information of the modality-missing data

Ma et al. (2021b); Tran et al. (2017); Chen and Zhang (2020); Liu et al. (2021). However, most of these works lack theoretical explanations, and these empirical methods are often implemented using multiple training stages rather than an end-to-end manner, which lead to the information of the modality-missing data not being well exploited.

To tackle this problem, we propose an efficient approach based on maximum likelihood estimation to effectively utilize the modality-missing data. To be specific, we present a likelihood function to characterize the conditional distribution of the modality-complete data and the modality-missing data, which is theoretically optimal. Furthermore, we adopt a generalized form of the softmax function to efficiently implement our maximum likelihood estimation algorithm. Such training strategy guarantees the computability of our framework in an end-to-end scheme. In this way, our approach can effectively leverage the information of the modality-missing data during the learning process, which has higher efficiency than previous works. Finally, we perform several experiments on real-world multimodal datasets, including eNTERFACE’05 Martin et al. (2006) and RAVDESS Livingstone and Russo (2018). The results show the effectiveness of our approach in handling problems of modality missing. To summarize, our contribution is three-fold:

  • We design a likelihood function to learn the conditional distribution of the modality-complete data and the modality-missing data, which is theoretically optimal.

  • We develop a generalized form of the softmax function to implement our maximum likelihood estimation framework in an end-to-end manner, which is more effective than previous works.

  • We conduct a series of experiments on real-world multimodal datasets. The results validate the effectiveness of our approach, even when 95% of the training data has missing modality.

Figure 1: Examples of the multimodal data: (a) complete observations, (b) observations which may have missing visual modality, and (c) observations which may have missing audio modality.

2 Methodology

Our goal is to handle modality missing based on maximum likelihood estimation for effective multimodal learning. In the following, we first introduce the problem formulation, and then describe the details of our framework.

2.1 Problem Formulation

In this paper, without loss of generality, we consider that the multimodal data has two modalities. Here, the random variables corresponding to these two modalities and their category labels are denoted as

, , and , respectively. In the training process, we assume that there are two independently observed datasets: modality-complete and modality-missing. We use to represent the modality-complete dataset, where and represent the two modalities of the i-th sample of , is their corresponding category label, and the size of is . We then use to represent the modality-missing dataset, where the size of is . In addition, we adopt to represent . , , and are expressed in the same way. The multimodal data of and are assumed to be i.i.d. generated from the underlying distribution . By utilizing the knowledge of the modality-complete data and the modality-missing data, we hope our framework can predict the category labels correctly.

Figure 2: Our proposed system for multimodal learning with missing modality. In the training process, we propose a log-likelihood function , as shown in Equation (2), to learn the conditional distribution of the modality-complete data and the modality-missing data. By developing a generalized form of the softmax function, we implement our maximum likelihood estimation algorithm in an end-to-end manner, which has high efficiency.

2.2 Maximum Likelihood Estimation for Missing Modality

In this section, we first present how to design a likelihood function to learn the conditional distribution of the modality-complete data and the modality-missing data. Then, we show that by adopting a generalized form of the softmax function, we design a training strategy to effectively implement our algorithm.

2.2.1 Likelihood Function Analyses

Maximum likelihood estimation is a statistical method of using the observed data to estimate the underlying distribution by maximizing the likelihood function. The estimated distribution makes the observed data most likely Myung (2003). With this idea, we study the likelihood function on datasets and . For the classification task, the conditional likelihood is commonly used. Inspired by this, we analyze the conditional likelihood, which can be represented as:

(1)

where the step a follows from the fact that datasets and are observed independently, and the step b is due to that samples in each dataset are i.i.d. and are conditional distributions of . In this way, we show the likelihood function using the information of and

. Then, we use the negative log-likelihood as the loss function to train our deep learning network, i.e.,

(2)

It is worth noting that in Daniels (1961), maximum likelihood estimation is proved to be an asymptotically-efficient strategy. Therefore, the theoretical optimality of our method is guaranteed to deal with modality missing.

To optimize

, we use deep neural networks to extract the

-dimensional feature representations from the observation , which are represented as , , and , respectively. We then utilize these features to learn and in . Our framework is shown in Figure 2.

In this way, we show the log-likelihood function . By characterizing the conditional distribution of the modality-complete data and modality-missing data, it efficiently leverages the underlying structure information behind the multimodal data, which constitutes the theoretical basis of our framework.

2.2.2 Maximum Likelihood Estimation Implementation

However, the log-likelihood function in Equation (2

) cannot be used directly, which is mainly due to two facts. Firstly, the representations of the high-dimensional data and how to model them are complicated. Secondly, since

and in are related, how to design models to learn their relationships is difficult. To address these two issues, we develop a generalized form of the softmax function to describe as follows:

(3)

where , , and represent the empirical distributions obtained by using all observed samples of the variables , , and , respectively. represents the function to fuse features and . We study three forms of to investigate its effect in our framework, as shown in Figure 3.

In this way, we show the underlying distribution by adopting a generalized form of the softmax function, which has the following two benefits. Firstly, by depicting the representation of , we can further deduce and directly. It guarantees our algorithm can be effectively implemented in an end-to-end manner. Secondly, it avoids giving the expressions of and and modeling the relationship between them. In fact, it is hard to compute these marginalized distributions since the correlation between the high-dimensional data can be rather complex. In addition, it has been shown in Xu et al. (2018) for the case with two random variables, the generalized version of softmax we adopt is equivalent to the standard softmax function.

The conditional distributions and can be easily obtained correspondingly as follows:

(4)

and

(5)

It is worth pointing out that when we compute in Equation (5), we need to use the information of the modality . Since in the training process, the modality of the dataset is missing, we query all possible values of modality on to compute . This can be regarded as using the modality-complete dataset to complement the modality-missing dataset.

We then plug Equation (4) and Equation (5) into Equation (2). In this way, we can use neural networks to learn features , , and from and for the classification task. It does not need to complement the data before performing the classification task. Additionally, our objective function is a unified structure. Unlike previous works Ma et al. (2021a, b)

, it does not bring hyperparameters which need to be manually adjusted. These factors guarantee the implementation of our approach is more efficient than previous methods.

Figure 3: Three forms of are studied: (a) addition, i.e., , (b) concatenation, i.e., , and (c) outer product, i.e., , where vec

represents the vectorization of outer product.

3 Experiments

In this section, we first describe the real-world multimodal datasets used in our experiment, then explain the data preprocessing and baseline methods, and finally give the experimental results to show the effectiveness of our approach.

3.1 Datasets

We perform experiments on two public real-world multimodal datasets: eNTERFACE’05 Martin et al. (2006) and RAVDESS Livingstone and Russo (2018). eNTERFACE’05 is an audio-visual emotion database in English. It contains 42 subjects eliciting the six basic emotions: anger, disgust, fear, happiness, sadness, and surprise. There are 213 samples for happiness, and 216 samples for each of the remaining emotions. Each recorded sample is in the video form, where the frame rate is 25 frames per second and the audio sampling rate is 48 kHz.

RAVDESS is a multimodal database of emotional speech and song, which consists of 24 professional actors in a neutral North American accent. Here, we use the speech part, which includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each recording is also in the video form. The frame rate is 30 frames per second. The audio is sampled at 48 kHz. Similar to the eNTERFACE’05 dataset, we only consider six basic emotions, each of which has 192 samples.

3.2 Data Preprocessing and Experimental Settings

We perform data preprocessing on these two datasets. We split each video sample into segments of the same length. Then, we extract visual and audio data from these segments. For each video sample, we obtain 30 segments, each of which has a duration of 0.5 seconds. We then take the central frame as the visual data from each segment. In addition, we extract the log Mel-spectrogram of each segment as the audio data. The spectrum representation is similar to the RGB image. We feed these segmented data into our model to obtain the classification result of segment level. Then we average the results of all segments belonging to the same video to predict the category label of the video level.

On each dataset, we split all data into three parts: training set, validation set, and test set. Their proportions are 70%, 15%, and 15%. The cases of incomplete audio modality and incomplete visual modality are separately studied. In these two scenarios, we investigate the following missing rates: 50%, 80%, 90%, and 95%. It is worth noting that to the best of our knowledge, we are the first to consider these settings with high missing rates. We conduct experiments on different conditions to verify that our approach has a good generalization capability to deal with incomplete modalities. On the contrary, previous works only assume a certain modality is incomplete. Following Du et al. (2018); Yu et al. (2020); Du et al. (2021), we assume that in the inference phase, the test data is modality-complete. Therefore, we can directly use Equation (4

) to predict the class label of the given test data. Finally, we run each experiment five times and report the average test accuracy to evaluate the performance of our method and some baseline methods. All experiments are implemented by Pytorch

Paszke et al. (2019) on a NVIDIA TITAN V GPU card.

3.3 Baseline Methods

To show the effectiveness of our method, we compare our approach with the following methods which can also handle missing modalities to some extent.

  • Discarding Modality-incomplete Data (Lower Bound): One of the simplest strategies to handle modality missing is to directly discard the modality-incomplete data, and then only use the modality-complete data for the classification task. This method does not use the information of the data with missing modalities. In our maximum likelihood estimation model, this is equivalent to calculating without calculating . Therefore, this method can also be used as the ablation study of our method.

  • Hirschfeld-Gebelein-Renyi (HGR) Maximal Correlation Hirschfeld (1935); Gebelein (1941); Rényi (1959)

    : HGR maximal correlation is a multimodal learning method based on the statistical dependence between different modalities. It has been successfully used for semi-supervised learning

    Ma et al. (2021a, 2020); Wang et al. (2019). Here, we use it further to deal with incomplete modalities. For the data on , we extract the maximal correlation between , , and . For the data on , we extract the maximal correlation between and .

  • Zero Padding (ZP): Padding the feature representations of the missing modality with zero is a widely used way to copy with incomplete modalities

    Jo et al. (2019); Chen et al. (2020); Shen et al. (2020). For this method, we consider two forms of to fuse features and : addition and concatenation. The reason why the form of outer product is not studied here is that if the feature of one modality is zero, the outer product of it and the non-zero feature of another modality is also zero, which leads to the result that the modality-missing data is useless.

  • Autoencoder1 (AE1): The autoencoder is an architecture to learn feature representations from training data in an unsupervised way. Some previous approaches try to use autoencoders to complement missing modalities

    Tran et al. (2017); Liu et al. (2021); Jaques et al. (2017); Pereira et al. (2020). Following these works, on the modality-complete dataset , we use modality as the input of the autoencoder to reconstruct modality . Then we use the trained autoencoder to predict the modality on

    to impute the training data. Then we use the imputed data to perform the classification task. It is worth noting that the autoencoder used to deal with missing modality has several stages while our method is end-to-end.

  • Autoencoder2 (AE2): In AE1, the data of are not involved in the training process of the autoencoder. Inspired by the self-training approach Yarowsky (1995); McClosky et al. (2006), in each iteration, we predict the modality on as the pseudo value for the next iteration. In this way, the information of the modality-missing dataset can be integrated into the autoencoder to a certain extent. Here, we call this structure AE2.

3.4 Experimental Results

In this section, we demonstrate the effectiveness of our method in two aspects. Firstly, we show that our method achieves high performance in tackling modality missing, even when the missing rate reaches 95%. Secondly, we show that our method has higher efficiency than the autoencoder methods.

Method Visual Missing Audio Missing
80% 90% 95% 80% 90% 95%
Lower Bound Hastie et al. (2009) (Addition) 46.91 35.26 26.39 50.93 35.26 27.53
Lower Bound Hastie et al. (2009) (Concatenation) 46.49 36.39 27.11 46.29 33.71 27.84
Lower Bound Hastie et al. (2009) (Outer product) 42.78 37.53 26.91 48.14 34.95 28.56
HGR Maximal Correlation Ma et al. (2021a, 2020); Wang et al. (2019) (Addition) 58.97 59.69 41.34 77.32 74.12 54.95
HGR Maximal Correlation Ma et al. (2021a, 2020); Wang et al. (2019) (Concatenation) 63.51 57.84 41.34 79.18 75.67 57.42
HGR Maximal Correlation Ma et al. (2021a, 2020); Wang et al. (2019) (Outer product) 64.64 59.69 49.90 77.94 76.29 55.46
ZP Jo et al. (2019); Chen et al. (2020); Shen et al. (2020) (Addition) 69.07 67.84 58.66 80.41 78.35 76.49
ZP Jo et al. (2019); Chen et al. (2020); Shen et al. (2020) (Concatenation) 68.76 67.11 60.21 80.93 78.25 76.70
Ours (Addition) 72.37 71.65 66.29 81.24 80.31 79.38
Ours (Concatenation) 72.27 70.82 64.74 81.24 80.21 79.79
Ours (Outer product) 72.06 71.13 66.08 81.65 81.03 80.31
Table 1: The classification performance with missing modality on the eNTERFACE’05 dataset.

We first conduct emotion classification experiments on the eNTERFACE’05 dataset to compare our method with other end-to-end ones. We make the audio modality and the visual modality missing respectively. In each of these two scenarios, we set the missing rate to 80%, 90%, and 95%. The raw data and the corresponding labels are used as the input of our network. We adopt ResNet-50 He et al. (2016) as backbones to extract features from audio and visual modalities. In addition, we transform the label into the one-hot form and then get the corresponding label features using a fully connected layer. The whole network is trained together. For the fair comparison, different methods are set to have the same structure. We report the classification accuracy of each method in each setting. The results are shown in Table 1. In particular, when audio modality is missing, we analyze the tendency of ZP and ours as the missing rate increases, as shown in Figure 4.

Figure 4: The tendency of ZP and ours as the missing rate increases when audio modality is missing on the eNTERFACE’05 dataset.

We have the following summarizations from Table 1 and Figure 4: (1) The methods of HGR maximal correlation, ZP, and ours can improve the classification performance compared to the Lower Bound method which only uses the modality-complete data. Our method achieves the best performance, especially with in the forms of outer product and addition. This shows that our method based on maximum likelihood estimation can overcome modality missing effectively compared with other methods. (2) When the visual modality is missing, the classification accuracy is lower than that when the audio modality is missing, indicating that the visual modality has a more significant contribution to the classification performance, which is consistent with the previous works Zhang et al. (2017); Ma et al. (2020). (3) When the missing rate increases, the classification accuracies of different methods decrease. Compared with other methods, the accuracy of our method decreases more slowly. This shows that our method is more capable of coping with missing modalities. (4) For our approach, the ways to fuse features and with outer product and addition performs better than the way with concatenation. This indicates that in different scenarios, the discrimination ability of the learned feature representations is different. We need to design the appropriate form of to fuse features in our framework. (5) The method of HGR maximal correlation can deal with modality missing to a certain extent. However, it only focuses on the statistical dependence between different modalities and does not make full use of the information of different types of data, so its classification performance is worse than ZP and ours.

Figure 5: The confusion matrices of different methods on the eNTERFACE’05 dataset.

In addition, we show the classification confusion matrices using the methods of Lower Bound, HGR maximal correlation, ZP, and ours when the missing rate of visual modality reaches 95% on the eNTERFACE’05 dataset, as shown in Figure 5. It can be seen that the classification accuracy of each type of emotion using the Lower Bound method is not high since that it only combines the information from the modality-complete data. Compared with the Lower Bound method, HGR maximal correlation and ZP can improve the recognition accuracy of each type of emotion. The overall classification performance of ZP is lower than ours, but the classification accuracy of “happiness” is slightly higher than ours. This shows that our method is more efficient to exploit the information from most emotions for the classification task.

Figure 6: The performance comparison of different methods when 50% of the training data has missing audio modality on the RAVDESS dataset.

We then compare our method with the method using autoencoders on the RAVDESS dataset to demonstrate that our method has high efficiency. The method using autoencoders needs to be designed to reconstruct one modality using another modality. It is difficult for autoencoders if we directly use the raw data as the input to perform this kind of cross-modal generation task. Therefore, we use some pre-trained networks, including VGG-16 Simonyan and Zisserman (2014)

, ResNet-34 and ResNet-50, to extract audio features and visual features from the raw data as the input of our model and the autoencoder. In other words, we reconstruct the features of different modalities here, and do not reconstruct the raw data. After reconstructing the features using the autoencoder method, we use the imputed feature for classification. We conducte experiments with AE1 and AE2 respectively. Correspondingly, we also use our method to classify the extracted features. We report the classification accuracy of each method within the same number of epochs to compare the efficiency of different methods. The experimental results are shown in Table

2 and Figure 6.

Method Visual Missing Audio Missing
50% 80% 90% 50% 80% 90%
VGG-16 AE1 62.66 57.92 48.67 69.48 68.67 66.59
AE2 66.24 58.15 48.90 70.64 69.48 66.59
Ours 78.84 61.16 49.71 91.45 89.36 87.51
Resnet-34 AE1 60.92 53.64 47.17 67.40 62.66 62.43
AE2 62.31 54.10 49.60 68.67 63.47 62.77
Ours 80.46 64.62 52.37 92.14 90.06 86.59
Resnet-50 AE1 67.05 60.12 54.45 72.60 70.40 68.32
AE2 69.13 61.16 50.64 73.76 70.06 66.94
Ours 84.05 68.90 57.11 91.91 89.48 88.79
Table 2: The classification performance with missing modality on the RAVDESS dataset.

We have the following observations From Table 2 and Figure 6

: (1) In each scenario, the classification accuracy of our method is higher than that of AE1 or AE2 within a certain number of epochs, which shows that our method has higher efficiency. (2) In most cases, the classification accuracy of AE2 is generally higher than that of AE1, especially when the modality missing is not serious. This shows that if there are more modality-complete data for training, the autoencoder with self training can handle missing modalities more effectively for classification. (3) When the size of modality-complete data increases, the classification accuracy of our method increases faster than that of AE1 and AE2. This may be owing to that our method is more efficient than AE1 and AE2 when combining the modality-complete data for classification. (4) For our method, when the visual modality is missing, the classification accuracy using the features extracted by ResNet-50 is higher than that using VGG-16 and ResNet-34. When the audio modality is missing, in most settings, the classification accuracy using ResNet-34 is higher than that using VGG-16 and ResNet-50. This indicates that in different settings with missing modalities, we should adopt appropriate networks to extract features to make the features have high discrimination ability.

4 Related Works

Multimodal learning has achieved great successes in many applications. An important topic in this field is multimodal representations Baltrušaitis et al. (2018); Zhu et al. (2020), which learn feature representations from the multimodal data by using the information of different modalities. How to learn good representations is investigated in Ngiam et al. (2011); Wu et al. (2014); Pan et al. (2016); Xu et al. (2015). Another important topic is multimodal fusion Atrey et al. (2010); Poria et al. (2017), which combines the information from different modalities to make predictions. Feature-based fusion is one of the most common types of multimodal fusion. It concatenates the feature representations extracted from different modalities. This fusion approach is adopted by previous works Tzirakis et al. (2017); Zhang et al. (2017); Castellano et al. (2008); Zhang et al. (2016).

To copy with the problem of modality missing for multimodal learning, a few methods have been proposed. For example, in Ma et al. (2021b), Ma et al. propose a Bayesian meta learning framework to perturb the latent feature space so that embeddings of single modality can approximate embeddings of full modality. In Tran et al. (2017), Tran et al. propose a cascaded residual autoencoder for imputation with missing modalities, which is composed of a set of stacked residual autoencoders that iteratively model the residuals. In Chen and Zhang (2020), Chen et al. propose a heterogeneous graph-based multimodal fusion approach to enable multimodal fusion of incomplete data within a heterogeneous graph structure. In Liu et al. (2021), Liu et al. propose an autoencoder framework to complement the missing data in the kernel space while taking into account the structural information of data and the inherent association between multiple views.

The above methods can combine the information of the modality-missing data to a certain extent. However, our method is more effective. The reason lies in the following two facts. Firstly, by efficiently exploiting the likelihood function to learn the conditional distribution of the modality-complete data and the modality-missing data, our method has a theoretical guarantee, which is skipped by previous works. Secondly, the training process of our approach is more concise and flexible, while the training process of the above methods is relatively cumbersome.

In addition, it is worth noting that in Tsai et al. (2018); Peng et al. (2021); Meyer et al. (2020); Pham et al. (2019), the multimodal data is assumed to be complete during the training process, and modality missing only occurs during the testing stage. These approaches make it hard to deal with missing modalities in the training phase, which may lead to limited performance.

5 Conclusion

Multimodal learning is a hot topic in the research community, of which a key challenge is modality missing. In practice, the multimodal data may not be complete due to various reasons. Previous works usually cannot effectively utilize the modality-missing data for the learning task. To address this problem, we propose an efficient approach to leverage the knowledge in the modality-missing data. Specifically, we present a system based on maximum likelihood estimation to characterize the conditional distribution of the modality-complete data and the modality-missing data, which has a theoretical guarantee. Furthermore, we develop a generalized form of the softmax function to effectively implement our maximum likelihood estimation framework in an end-to-end way. We conduct experiments on the eNTERFACE’05 dataset and the RAVDESS dataset for multimodal learning to demonstrate the effectiveness of our approach. In the future, we will extend our approach to more complex multimodal learning scenarios. For example, we can consider that missing modalities exist in both training and testing phases. In addition, we can further study the scenario with missing modalities and missing labels.

References

  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) Vqa: visual question answering. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 2425–2433. Cited by: §1.
  • [2] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16 (6), pp. 345–379. Cited by: §4.
  • [3] T. Baltrušaitis, C. Ahuja, and L. Morency (2018)

    Multimodal machine learning: a survey and taxonomy

    .
    IEEE transactions on pattern analysis and machine intelligence 41 (2), pp. 423–443. Cited by: §1, §4.
  • [4] G. Castellano, L. Kessous, and G. Caridakis (2008) Emotion recognition through multiple modalities: face, body gesture, speech. In Affect and emotion in human-computer interaction, pp. 92–103. Cited by: §4.
  • [5] J. Chen and A. Zhang (2020) HGMF: heterogeneous graph-based fusion for multimodal data with incompleteness. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1295–1305. Cited by: §1, §4.
  • [6] Z. Chen, S. Wang, and Y. Qian (2020) Multi-modality matters: a performance leap on voxceleb. Proc. Interspeech 2020, pp. 2252–2256. Cited by: 3rd item, Table 1.
  • [7] H. Daniels (1961) The asymptotic efficiency of a maximum likelihood estimator. In

    Fourth Berkeley Symposium on Mathematical Statistics and Probability

    ,
    Vol. 1, pp. 151–163. Cited by: §2.2.1.
  • [8] C. Du, C. Du, and H. He (2021)

    Multimodal deep generative adversarial models for scalable doubly semi-supervised learning

    .
    Information Fusion 68, pp. 118–130. Cited by: §3.2.
  • [9] C. Du, C. Du, H. Wang, J. Li, W. Zheng, B. Lu, and H. He (2018) Semi-supervised deep generative modelling of incomplete multi-modality emotional data. In Proceedings of the 26th ACM international conference on Multimedia, pp. 108–116. Cited by: §1, §3.2.
  • [10] D. Elliott, S. Frank, L. Barrault, F. Bougares, and L. Specia (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, pp. 215–233. Cited by: §1.
  • [11] C. Gan, N. Wang, Y. Yang, D. Yeung, and A. G. Hauptmann (2015) Devnet: a deep event network for multimedia event detection and evidence recounting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2568–2577. Cited by: §1.
  • [12] H. Gebelein (1941) Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik 21 (6), pp. 364–379. Cited by: 2nd item.
  • [13] T. Hastie, R. Tibshirani, and J. Friedman (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §1, Table 1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document Cited by: §3.4.
  • [15] H. O. Hirschfeld (1935) A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 31, pp. 520–524. Cited by: 2nd item.
  • [16] N. Jaques, S. Taylor, A. Sano, and R. Picard (2017) Multimodal autoencoder: a deep learning approach to filling in missing sensor data and enabling better mood prediction. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 202–208. Cited by: 4th item.
  • [17] D. U. Jo, B. Lee, J. Choi, H. Yoo, and J. Y. Choi (2019) Cross-modal variational auto-encoder with distributed latent spaces and associators. arXiv preprint arXiv:1905.12867. Cited by: 3rd item, Table 1.
  • [18] E. Kim, W. Y. Kang, K. On, Y. Heo, and B. Zhang (2020) Hypergraph attention networks for multimodal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14581–14590. Cited by: §1.
  • [19] Y. Liu, L. Fan, C. Zhang, T. Zhou, Z. Xiao, L. Geng, and D. Shen (2021) Incomplete multi-modal representation learning for alzheimer’s disease diagnosis. Medical Image Analysis 69, pp. 101953. Cited by: §1, 4th item, §4.
  • [20] S. R. Livingstone and F. A. Russo (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13 (5), pp. e0196391. Cited by: §1, §3.1.
  • [21] F. Ma, S. Huang, and L. Zhang (2021) An efficient approach for audio-visual emotion recognition with missing labels and missing modalities. In 2021 IEEE International Conference on Multimedia and Expo (ICME), Vol. . Cited by: §1, §2.2.2, 2nd item, Table 1.
  • [22] F. Ma, W. Zhang, Y. Li, S. Huang, and L. Zhang (2020) Learning better representations for audio-visual emotion recognition with common information. Applied Sciences 10 (20). Cited by: 2nd item, §3.4, Table 1.
  • [23] M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng (2021) SMIL: multimodal learning with severely missing modality. arXiv preprint arXiv:2103.05677. Cited by: §1, §1, §2.2.2, §4.
  • [24] O. Martin, I. Kotsia, B. Macq, and I. Pitas (2006) The enterface’ 05 audio-visual emotion database. In 22nd International Conference on Data Engineering Workshops (ICDEW’06), Vol. , pp. 8–8. External Links: Document Cited by: §1, §3.1.
  • [25] D. McClosky, E. Charniak, and M. Johnson (2006) Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 152–159. Cited by: 5th item.
  • [26] J. Meyer, A. Eitel, T. Brox, and W. Burgard (2020) Improving unimodal object recognition with multimodal contrastive learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 5656–5663. Cited by: §4.
  • [27] I. J. Myung (2003) Tutorial on maximum likelihood estimation. Journal of mathematical Psychology 47 (1), pp. 90–100. Cited by: §2.2.1.
  • [28] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In ICML, Cited by: §4.
  • [29] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui (2016) Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4594–4602. Cited by: §4.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In NeurIPS, pp. 8026–8037. Cited by: §3.2.
  • [31] W. Peng, X. Hong, and G. Zhao (2021)

    Adaptive modality distillation for separable multimodal sentiment analysis

    .
    IEEE Intelligent Systems (), pp. 1–1. Cited by: §4.
  • [32] R. C. Pereira, M. S. Santos, P. P. Rodrigues, and P. H. Abreu (2020) Reviewing autoencoders for missing data imputation: technical trends, applications and outcomes.

    Journal of Artificial Intelligence Research

    69, pp. 1255–1285.
    Cited by: 4th item.
  • [33] H. Pham, P. P. Liang, T. Manzini, L. Morency, and B. Póczos (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6892–6899. Cited by: §4.
  • [34] S. Poria, E. Cambria, R. Bajpai, and A. Hussain (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion 37, pp. 98–125. Cited by: §4.
  • [35] A. Rényi (1959) On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica 10 (3-4), pp. 441–451. Cited by: 2nd item.
  • [36] G. Shen, X. Wang, X. Duan, H. Li, and W. Zhu (2020) MEmoR: a dataset for multimodal emotion reasoning in videos. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 493–502. Cited by: 3rd item, Table 1.
  • [37] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.4.
  • [38] M. Soleymani, M. Pantic, and T. Pun (2011) Multimodal emotion recognition in response to videos. IEEE transactions on affective computing 3 (2), pp. 211–223. Cited by: §1.
  • [39] L. Tran, X. Liu, J. Zhou, and R. Jin (2017) Missing modalities imputation via cascaded residual autoencoder. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 4971–4980. External Links: Document Cited by: §1, 4th item, §4.
  • [40] Y. H. Tsai, P. P. Liang, A. Zadeh, L. Morency, and R. Salakhutdinov (2018) Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176. Cited by: §4.
  • [41] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1301–1309. Cited by: §1, §4.
  • [42] L. Wang, J. Wu, S. Huang, L. Zheng, X. Xu, L. Zhang, and J. Huang (2019) An efficient approach to informative feature extraction from multimodal data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5281–5288. Cited by: 2nd item, Table 1.
  • [43] Z. Wu, Y. Jiang, J. Wang, J. Pu, and X. Xue (2014) Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 167–176. Cited by: §4.
  • [44] R. Xu, C. Xiong, W. Chen, and J. Corso (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: §4.
  • [45] X. Xu, S. Huang, L. Zheng, and L. Zhang (2018) The geometric structure of generalized softmax learning. In 2018 IEEE Information Theory Workshop (ITW), Vol. , pp. 1–5. Cited by: §2.2.2.
  • [46] D. Yarowsky (1995) Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pp. 189–196. Cited by: 5th item.
  • [47] G. Yu, Q. Li, D. Shen, and Y. Liu (2020) Optimal sparse linear prediction for block-missing multi-modality data without imputation. Journal of the American Statistical Association 115 (531), pp. 1406–1419. Cited by: §3.2.
  • [48] H. Zhang, L. Dong, G. Gao, H. Hu, Y. Wen, and K. Guan (2020) DeepQoE: a multimodal learning framework for video quality of experience (qoe) prediction. IEEE Transactions on Multimedia 22 (12), pp. 3210–3223. External Links: Document Cited by: §1.
  • [49] S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian (2017) Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology 28 (10), pp. 3030–3043. Cited by: §1, §3.4, §4.
  • [50] S. Zhang, S. Zhang, T. Huang, and W. Gao (2016)

    Multimodal deep convolutional neural network for audio-visual emotion recognition

    .
    In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 281–284. Cited by: §4.
  • [51] W. Zhu, X. Wang, and W. Gao (2020) Multimedia intelligence: when multimedia meets artificial intelligence. IEEE Transactions on Multimedia 22 (7), pp. 1823–1835. External Links: Document Cited by: §4.