Trustworthy Multimodal Regression with Mixture of Normal-inverse Gamma Distributions

by   Huan Ma, et al.
Tianjin University

Multimodal regression is a fundamental task, which integrates the information from different sources to improve the performance of follow-up applications. However, existing methods mainly focus on improving the performance and often ignore the confidence of prediction for diverse situations. In this study, we are devoted to trustworthy multimodal regression which is critical in cost-sensitive domains. To this end, we introduce a novel Mixture of Normal-Inverse Gamma distributions (MoNIG) algorithm, which efficiently estimates uncertainty in principle for adaptive integration of different modalities and produces a trustworthy regression result. Our model can be dynamically aware of uncertainty for each modality, and also robust for corrupted modalities. Furthermore, the proposed MoNIG ensures explicitly representation of (modality-specific/global) epistemic and aleatoric uncertainties, respectively. Experimental results on both synthetic and different real-world data demonstrate the effectiveness and trustworthiness of our method on various multimodal regression tasks (e.g., temperature prediction for superconductivity, relative location prediction for CT slices, and multimodal sentiment analysis).



page 14


Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Multimodal machine learning is a core research area spanning the languag...

Mixture-of-experts VAEs can disregard variation in surjective multimodal data

Machine learning systems are often deployed in domains that entail data ...

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Multimodal sentiment analysis in videos is a key task in many real-world...

LRMM: Learning to Recommend with Missing Modalities

Multimodal learning has shown promising performance in content-based rec...

Integrative Factor Regression and Its Inference for Multimodal Data Analysis

Multimodal data, where different types of data are collected from the sa...

IID Sampling from Intractable Multimodal and Variable-Dimensional Distributions

Bhattacharya (2021b) has introduced a novel methodology for generating i...

Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction

Future prediction is a fundamental principle of intelligence that helps ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are plenty of multimodal data involved in the real world, and we experience the world from different modalities Baltrušaitis et al. (2018). For example, the autonomous driving systems are usually equipped with multiple sensors to collect information from different perspectives Cho et al. (2014). In medical diagnosis Perrin et al. (2009)

, multimodal data usually come from different types of examinations, typically including a variety of clinical data. Effectively exploiting the information of different sources to improve learning performance is a long-standing and challenging goal in machine learning.

Most multimodal regression methods Gan et al. (2017); Ye et al. (2012); Dzogang et al. (2012); Gunes and Piccardi (2006) usually focus on improving the regression performance by exploiting the complementary information among multiple modalities. Despite effectiveness, it is quite risky for these methods to be deployed in cost-sensitive applications due to the lack of reliability and interpretability. One underlying deficiency is that traditional models usually assume the quality of each modality is basically stable, which limits them to produce reliable prediction, especially when some modalities are noisy or even corrupted Liang et al. (2019); Lee et al. (2020). Moreover, existing models only output (over-confident) predictions Snoek et al. (2019); Ulmer and Cinà (2021), which can not well support safe decision and might be disastrous for safety-critical applications.

Uncertainty estimation provides a way for trustworthy prediction Abdar et al. (2020); Han et al. (2021). The decisions made by models without uncertainty estimation are untrustworthy because they are prone to be affected by noises or limited training data. Therefore, it is highly desirable to characterize uncertainty in the learning for AI-based systems. More specifically, when a model is given an input that has never been seen or is severely contaminated, it should be able to express “I don’t know”. The untrustworthy models are vulnerable to be attacked and it may also lead to wrong decisions, where the cost is often unbearable in critical domains Kendall and Gal (2017).

Basically, it can endow the model with trustworthiness by dynamically modeling uncertainty. Therefore, we propose a novel algorithm that conducts multimodal regression in a trustworthy manner. Specifically, our proposed algorithm is a unified framework modeling uncertainty under a fully probabilistic framework. Our model integrates multiple modalities by introducing a Mixture of Normal-inverse Gamma distributions (MoNIG), which hierarchically characterizes the uncertainties and accordingly promotes both regression accuracy and trustworthiness. In summary, the contributions of this work include: (1) We propose a novel trustworthy multimodal regression algorithm. Our method effectively fuses multiple modalities under the evidential regression framework equipped with modality-specific uncertainty and global uncertainty. (2) To integrate different modalities, a novel MoNIG is designed to be dynamically aware of modality-specific noise/corruption with the estimated uncertainty, which promisingly supports trustworthy decision making and also significantly promotes robustness as well. (3) We conduct extensive experiments on both synthetic and real-application data, which validate the effectiveness, robustness, and reliability of the proposed model on different multimodal regression tasks (e.g., critical temperature prediction for superconductivity Hamidieh (2018), relative location of CT slices, and human multimodal sentiment analysis).

2 Related Work

2.1 Uncertainty Estimation

Quantifying the uncertainty of machine learning models has received extensive attention Hafner et al. (2019); Qaddoum and Hines (2012), especially when the systems are deployed in safety-critical domains, such as autonomous vehicle control Khodayari et al. (2010) and medical diagnosis Perrin et al. (2009). Bayesian neural networks Neal (2012); MacKay (1992)

model the uncertainty by placing a distribution over model parameters and marginalizing these parameters to form a predictive distribution. Due to the huge parameter space of modern neural networks, Bayesian neural networks are highly non-convex and difficult in inference. To tackle this problem,

Molchanov et al. (2017) extends Variational Dropout Hinton et al. (2012)

to the case when dropout rates are unbounded, and proposes a way to reduce the variance of the gradient estimator. A more scalable alternative way is MC Dropout 

Gal and Ghahramani (2016), which is simple to implement and has been successfully applied to downstream tasks Kendall and Gal (2017); Mukhoti and Gal (2018). Deep ensembles Lakshminarayanan et al. (2017) have shown strong power in both classification accuracy and uncertainty estimation. It is observed that deep ensembles consistently outperform Bayesian neural networks that are trained using variational inference Snoek et al. (2019). However, the memory and computational cost are quite high. For this issue, different deep sub-networks with shared parameters are trained for integration Antorán et al. (2020). Deterministic uncertainty methods are designed to directly output the uncertainty and alleviate the overconfidence. Built upon RBF networks, van Amersfoort et al. (2020) is able to identify the out-of-distribution samples. Corbière et al. (2019)

introduces a new target criterion for model confidence, known as True Class Probability (TCP), to ensure the low confidence for the failure predictions. The recent approach employs the focal loss to calibrate the deep neural networks 

Mukhoti et al. (2020). Sensoy et al. (2018) places Dirichlet priors over discrete classification predictions and regularizes divergence to a well-defined prior. Our model is inspired by deep evidential regression Amini et al. (2020) which is designed for single-modal data.

2.2 Multimodal Learning

Multimodal machine learning aims to build models that can jointly exploit information from multiple modalities Baltrušaitis et al. (2018); Jiang et al. (2019). Existing multimodal learning methods have achieved significant progress by integrating different modalities at different stages, namely early, intermediate and late fusion Baltrušaitis et al. (2018); Ramachandram and Taylor (2017). The early fusion methods usually integrate the original data or preprocessed features by simply concatenation Kiela et al. (2018); Poria et al. (2015). The intermediate fusion provides a more flexible strategy to exploit multimodal data for diverse practical applications Gan et al. (2017); Antol et al. (2015); Xia et al. (2020); Yi et al. (2015); Ouyang et al. (2014); Zhang et al. (2020a, b). In late fusion, each modality is utilized to train a separate model and the final output is obtained by combining the predictions of these multiple models Ye et al. (2012); Gunes and Piccardi (2006); Xia et al. (2020).

Although these multimodal learning approaches exploit the complementary information of multiple modalities from different perspectives, basically they are weak in modeling uncertainty which is important for trustworthy prediction especially when they are deployed in safety-critical domains.

Regression has been widely used across a wide spectrum of applications. Given a dataset , one principled way for regression is from a maximum likelihood perspective, where the likelihood for parameters given the observed data is as follows:


In practice, a widely used maximum likelihood estimation is based on Gaussian distribution, which assumes that the target

is drawn from a Gaussian distribution. Then we have the following expression of likelihood function:


The mean and variance indicate the prediction and corresponding predictive uncertainty, while the predictive uncertainty consists of two parts Kiureghian and Ditlevsen (2009): epistemic uncertainty (EU) and aleatoric uncertainty (AU). To learn both the aleatoric and epistemic uncertainties explicitly, the mean and variance are assumed to be drawn from Gaussian and Inverse-Gamma distributions Amini et al. (2020), respectively. Then the Normal Inverse-Gamma (NIG) distribution with parameters

can be considered as a higher-order conjugate prior of the Gaussian distribution parameterized with



where is the gamma function. In this case, the distribution of takes the form of a NIG distribution :


where .

During the training stage, the following loss is induced to minimize the negative log likelihood loss:


where and .

The total loss, , consists of two terms for maximizing the likelihood function and regularizing evidence:


where is the penalty for incorrect evidence (more details are shown in the supplement), and the coefficient balances these two loss terms.

For multimodal regression Ye et al. (2012); Gunes and Piccardi (2006); Dzogang et al. (2012), some approaches have achieved impressive performance. However, existing multimodal regression algorithms mainly focus on improving the accuracy and provide a deterministic prediction without the information of uncertainty, making these models limited for trustworthy decision making. For this issue, we develop a novel algorithm which can capture the modality-specific uncertainty and accordingly induces a reliable overall uncertainty for the final output. Beyond trustworthy decision, our approach automatically alleviates the impact from heavily noisy or corrupted modality. Compared with the intermediate fusion Gan et al. (2017); Li et al. (2012); Bahrami et al. (2021), our method can effectively distinguish which modality is noisy or corrupted for different samples, and accordingly can take the modality-specific uncertainty into account for robust integration.

Figure 1: Strategies for multimodal regression. Feature fusion (a), decision fusion (b), the proposed MoNIG (c), and an instance of MoNIG on the human multimodal sentiment analysis task (d).

3 Trustworthy Multimodal Regression

In this section, we introduce the proposed algorithm fusing multiple modalities at both feature and predictive distribution levels. Specifically, we train deep neural networks to model the hyperparameters of the higher-order evidential distributions of multiple modalities (and the pseudo modality) and then merge these predicted distributions into one by the mixture of NIGs (MoNIG). In the following subsections, we first provide discussion about existing fusion strategies and ours in multimodal regression. Then we introduce MoNIG to integrate NIG distributions of different modalities in principle. Finally, we elaborate the overall pipeline of our model in training and define the epistemic uncertainty and aleatoric uncertainty respectively.

3.1 Overview of Fusion Strategy

Consider the following multimodal regression problem: given a dataset , where

is the input feature vector of the

-th modality of the -th sample, is the corresponding target, and is the total number of modalities, the intuitive goal is to learn a function for each modality reaching the following target: . While to model the uncertainty, we assume that the observed target is drawn from a Gaussian distribution, i.e., . Meanwhile, inspired by Amini et al. (2020), as shown in Eq. 3, we assume that the mean and variance are drawn from Gaussian and Inverse-Gamma distributions, respectively.

Considering multimodal fusion, there are typically two representative strategies, i.e., feature fusion and decision fusion. Feature fusion (Fig. 1(a)) jointly learns unified representations and based on which conducts regression. Although simple, the limitation is also obvious. If some modalities of some samples are noisy or corrupted, the algorithms based on this strategy may fail. In other words, these methods do not take the modality-specific uncertainty into account and can not be adaptively aware of quality variation of different modalities for different samples. The other representative strategy is known as decision fusion (Fig. 1(b)) which trains one regression model for each modality, and the final prediction is regarded as a function of the predictions of all modalities. Although flexible, it is also risky for the cases that each modality is insufficient to make trustworthy decision due to partial observation.

We propose MoNIG, which can elegantly address the above issues in a unified framework. Firstly, we explicitly represent both modality-specific and global uncertainties to endow our model with the ability of adaption for dynamical modality variation. The modality-specific uncertainty can dynamically guide the integration for different samples and the global uncertainty represents the uncertainty of the final prediction. Secondly, we introduce the feature level fusion producing a pseudo modality to make full use of the complementary information and, accordingly generate an enhanced branch to supplement the decision fusion (Fig 1(c)).

3.2 Mixture of Normal-inverse Gamma Distributions

In this section, we focus on fusing multiple NIG distributions from original modalities and the generated pseudo modality. The key challenge is how to reasonably integrate multiple NIG distributions into a uniform NIG. The widely used product of experts (PoE) Hinton (2002) is not suitable for our case since there is no guarantee that the product of multiple NIGs is still a NIG. Moreover, integration with PoE tends to be affected by noisy modalities Shi et al. (2019). For these issues, another natural way is integrating a set of NIGs with the following additive strategy:


Although simple in form, there are two main limitations for Eq. 7. First, simply averaging over all NIGs does not take the uncertainties of different NIGs into account, and thus it may be seriously affected by noisy modalities. Second, it is intractable to infer the parameters for the fused NIG distribution in practice since there is no closed-form solution. Therefore, we introduce the NIG summation operation Qian and others (2018) to approximately solve this problem. Specifically, the NIG summation operator defines a novel operation to ensure a new NIG distribution after the fusion of two NIG distributions.

Definition 3.1.

(Summation of NIG distributions) Given two NIG distributions, i.e., and , the definition of the summation of these two NIG distributions is




Therefore, fusing NIG distributions in this way endows our model with several promising properties (shown in Pro. 3.1) which allow us to use it for trustworthy multimodal regression based on uncertainty. Then we substitute Eq. 7 with the following operation:


where represents the summation operation of two NIG distributions.

The NIG summation can reasonably make use of modalities with different qualities. Specifically, the parameter indicates the confidence of a NIG distribution for the mean  Jordan (2009). As shown in Def. 3.1, if one modality is more confident with its prediction then it will contribute more to the final prediction. Moreover, directly reflects both aleatoric uncertainty and epistemic uncertainty (Eq. 12) which consists of two parts, i.e., the sum of and from multiple modalities and the variance between the final prediction and that of every single modality. Intuitively, the final uncertainty is determined jointly by the modality-specific uncertainty and the prediction deviation among different modalities.

Proposition 3.1.

The summation operation in Definition 3.1 of NIG distributions has the following properties:
1. Commutativity:

2. Associativity:

The above two properties can be easily proved (refer to the supplement). Based on the summation operation, our multimodal regression algorithm has the following advantages. (1) Flexibility: according to Def. 3.1 and Pro. 3.1, we can fuse an arbitrary number of NIG distributions conveniently. (2) Explainability: we can explain the detailed fusion process for multiple NIG distributions and observe the belief degree of each NIG distribution according to Eq. 9. (3) Trustworthiness: using Def. 3.1 allows us to fuse multiple NIG distributions into a new NIG distribution, which is critical to explicitly provide both epistemic uncertainty and aleatoric uncertainty for trustworthy decision making. (4) Optimizability: compared with struggling to seek a closed-form solution and conducting complex optimization, it is much more efficient to optimize using Def. 3.1 for multiple NIGs fusion.

3.3 Overall Learning Framework

(a) Prediction
(b) Uncertainty estimation
Figure 2: (a) Regressed curves for each modality and the fused ones (with MoNIG), where gray shadow areas indicate there is no training data available; (b) the left and right subfigures demonstrate the estimated aleatoric and epistemic uncertainties, respectively.

Inspired by multi-task learning, we define the final loss function as the sum of losses of multiple modalities (including the pseudo modality) and the predictive-level fused distribution:


where is obtained according to Eq. 6. Specifically, is the pseudo modality loss defined as , and is the fused distribution loss defined as . More detailed derivation for the overall loss is in the supplement. The proposed method is a general uncertainty-aware fusion module so that the pseudo modality may be obtained in different ways (e.g. features concatenation or concatenation after representation learning).

Given the MoNIG distribution, the aleatoric and epistemic uncertainties are defined as:


For clarification, we provide the following remarks: (1) The objective is a unified learnable framework, and thus the branches of original modalities, pseudo modality, and the fused global branch can be improved with each other with the multi-task learning strategy; (2) The final output is explainable according to the local (modality-specific) and global (fused) uncertainties, and the aleatoric and epistemic uncertainties as well.

4 Experiments

To demonstrate the effectiveness of our model, we conduct experiments on synthetic and real-world data including physical (Superconductivity222, medical (CT slices333, and multimodal sentiment analysis tasks444 Furthermore, we also conduct experiments to validate the effectiveness of uncertainty estimation in a variety of conditions. Finally, the ablation study is conducted further assessing the effectiveness of the approach in terms of uncertainty quantification and investigating whether the promising performance is due to the novel strategy.

4.1 Experiment on Synthetic Data

To intuitively illustrate the properties of our model, firstly we conduct experiments on synthetic data. Following Amini et al. (2020); Hernández-Lobato and Adams (2015), we conduct sampling with , where for . While for the sampling data points we add stronger noise on input, where with , and the -th modality input , . The visualization and corresponding analysis could be found in Fig. 2, where the left and right subfigures demonstrate the prediction ability and estimated aleatoric/epistemic uncertainties, respectively. According to Fig. 2(a), it is observed that all curves fit well to the ground-truth curve given enough training data. From the left subfigure of Fig. 2(b), we find that the estimated aleatoric uncertainty (AU) () is much higher compared with that of , which is consistent with the basic assumption. For the right subfigure, it demonstrates that the epistemic uncertainty (EU) is much higher when there is no training data, validating that our model can well characterize the epistemic uncertainty.

4.2 Temperature Prediction for Superconductivity

(a) Epistemic uncertainty
(b) Superconductivity
(c) CT Slides
Figure 3: Epistemic uncertainty with different number of training samples (a), and relationship between the uncertainty and noise degree (b)-(c).
Mod1 Mod2 GS (IF/DF) EVD (IF/DF) Ours (Pseudo)
12.93 12.82 13.93/13.78 14.18/12.79 12.19 (11.70)
Table 2: OOD (out-of-distribution) detection (AUROC ) using the estimated uncertainty.
Modality Noise GS EVD Ours
Mod1 0.1 0.503 0.503 0.494 0.586 0.536
0.5 0.511 0.486 0.424 0.782 0.551
Mod2 0.1 0.503 0.507 0.486 0.839 0.577
0.5 0.512 0.506 0.439 0.858 0.539
RandMod 0.1 0.502 0.502 0.488 0.699 0.549
0.5 0.508 0.492 0.426 0.812 0.539
AllMod 0.1 0.507 0.509 0.484 0.838 0.584
0.5 0.518 0.494 0.424 0.887 0.546
Table 1: Multimodal regression (RMSE ).

Superconductivity contains 21263 samples, and all samples are described by two modalities. The first one contains 81 features of its experimental properties, and the second contains the chemical formula extracted into 86 features. The goal here is to predict the critical temperature (values are in the range [0, 185]) based on these two modalities. In our experiment, we use 10633, 4000, and 6600 samples as training, validation, and test data, respectively. We compare our model with existing approaches including Gaussian (GS) in Eq. 2, Evidential (EVDAmini et al. (2020), where “IF” and “DF” represent different fusion strategies, i.e, “IF” and “DF” denoting intermediate level (concatenating the features from hidden layer) and data level (concatenating the original features) fusion, respectively. “Pseudo” indicates that the pseudo modality (concatenating the features from hidden layer) is involved. The performance from our approach (with the Adam optimizer: for 400 iterations) is superior to the compared methods, which is with the lowest Root Mean Squared Error (RMSE). We also impose different degrees of noise in different conditions to half of the test samples as out-of-distribution (OOD) samples, and try to distinguish them with uncertainty. “AU” and “EU” indicate the uncertainties used to distinguish OOD samples. According to Table 2, our model achieves clearly better performance in terms of AUROC on different noise conditions (a higher value indicates a better uncertainty estimation).

4.3 Relative Location Prediction for CT Slices

Mod1 Mod2 GS (IF/DF) EVD (IF/DF) Ours (Pseudo)
1.67 4.49 1.64/1.70 2.97/3.27 0.91 (0.79)
Table 4: OOD (out-of-distribution) detection (AUROC ) using the estimated uncertainty.
Modality Noise GS EVD Ours
Mod1 0.1 0.504 0.504 0.502 0.532 0.543
0.5 0.582 0.638 0.610 0.771 0.806
Mod2 0.1 0.503 0.503 0.500 0.519 0.511
0.5 0.529 0.600 0.584 0.870 0.866
RandMod 0.1 0.503 0.504 0.505 0.569 0.582
0.5 0.570 0.619 0.601 0.835 0.852
AllMod 0.1 0.507 0.509 0.508 0.600 0.615
0.5 0.618 0.672 0.631 0.808 0.839
Table 3: Multimodal regression (RMSE ).

CT Slices are retrieved from a set of 53500 CT images from 74 different patients. Each CT slice is described by two modalities. The first modality describes the location of bone structures in the image, containing 240 attributes. The second modality describes the location of air inclusions inside of the body, containing 144 attributes. The goal is to predict the relative location of the image on the axial axis. The target values are in the range [0, 180], where 0 denotes the top of the head and 180 the soles of the feet. We divide the dataset into 26750/10000/16750 samples as training, validation, and test data, respectively. The other settings are the same as those on the Superconductivity dataset. We also test the ability of OOD detection and the results are reported in Table 4. It is clear that our model achieves much better performance than comparisons.

4.4 Human Multimodal Sentiment Analysis

CMU-MOSI & MOSEI. CMU-MOSI Zadeh et al. (2016) is a multimodal sentiment analysis dataset consisting of 2,199 short monologue video clips, while MOSEI Liang et al. (2018) is a sentiment and emotion analysis dataset consisting of 23,454 movie review video clips taken from YouTube. Each task consists of a word-aligned and an unaligned version. For both versions, the multimodal features are extracted from the textual Pennington et al. (2014), visual and acoustic Degottex et al. (2014) modalities.

Table 5: Results on CMU-MOSI. Metric (Word Aligned) CMU-MOSI Sentiment EF-LSTM 33.7 75.3 75.2 1.023 0.608 LF-LSTM 35.3 76.8 76.7 1.015 0.625 RAVEN Wang et al. (2019) 33.2 78.0 76.6 0.915 0.691 MCTN Pham et al. (2019) 35.6 79.3 79.1 0.909 0.676 Gaussian 32.9 78.4 78.4 0.982 0.657 Evidential 33.4 77.6 77.6 0.974 0.655 MoNIG 33.0 80.2 80.4 0.959 0.664 MoNIG (pseudo) 34.1 80.6 80.6 0.951 0.680 (Unaligned) CMU-MOSI Sentiment CTC EF-LSTM Graves et al. (2006) 31.0 73.6 74.5 1.078 0.542 LF-LSTM 33.7 77.6 77.8 0.988 0.624 CTC + MCTN Pham et al. (2019) 32.7 75.9 76.4 0.991 0.613 CTC + RAVEN Wang et al. (2019) 31.7 72.7 73.1 1.076 0.544 Gaussian 32.4 77.6 77.5 1.005 0.634 Evidential 32.9 78.7 78.7 0.988 0.651 MoNIG 35.8 79.3 79.3 0.972 0.664 MoNIG (pseudo) 34.7 79.1 79.1 0.958  0.669 Table 6: Results on CMU-MOSEI. Metric (Word Aligned) CMU-MOSEI Sentiment EF-LSTM 47.4 78.2 77.9 0.642 0.616 LF-LSTM 48.8 80.6 80.6 0.619 0.659 RAVEN Wang et al. (2019) 50.0 79.1 79.5 0.614 0.662 MCTN Pham et al. (2019) 49.6 79.8 80.6 0.609 0.670 Gaussian 49.3 81.6 81.9 0.613 0.677 Evidential 48.9 81.0 81.2 0.612 0.671 MoNIG 50.2 81.8 82.0 0.602 0.682 MoNIG (pseudo) 50.0 81.0 81.5 0.600 0.688 (Unaligned) CMU-MOSEI Sentiment CTC + EF-LSTM Graves et al. (2006) 46.3 76.1 75.9 0.680 0.585 LF-LSTM 48.8 77.5 78.2 0.624 0.656 CTC + RAVEN Wang et al. (2019) 45.5 75.4 75.7 0.664 0.599 CTC + MCTN Pham et al. (2019) 48.2 79.3 79.7 0.631 0.645 Gaussian 48.8 81.0 81.3 0.618 0.676 Evidential 49.2 81.3 81.7 0.608 0.676 MoNIG 50.7 81.7 81.9 0.598 0.693 MoNIG (pseudo) 49.5 81.7 82.0 0.612 0.673

For CMU-MOSEI dataset, similarly to existing work, there are 16326, 1871, and 4659 samples used as the training, validation, and test data, respectively. For CMU-MOSI dataset, we use 1284, 229, and 686 samples as training, validation, and test data, respectively. Similar to previous work Liang et al. (2018); Pham et al. (2019), we employ diverse metrics for evaluation:

-class accuracy (Acc7), binary accuracy (Acc2), F1 score, mean absolute error (MAE), and the correlation (Corr) of the model’s prediction with human. We directly concatenate the features extracted from the temporal convolutional layers of different networks corresponding to different modalities as a pseudo modality. Our method achieves competitive performance even compared with the state-of-the-art multimodal sentiment classification methods.

4.5 Uncertainty Estimation

(a) = 0.1
(b) = 0.3
(c) = 0.5
(d) = 1.0
Figure 4: Sensitivity in identifying noisy modality. We randomly select one from two modalities and add different degree of noise on it.

To validate the ability of epistemic estimation of our model in real data, we gradually increase the ratio of training samples from to of all training data. According to Fig. 3(a), we can find that the overall epistemic uncertainty declines steadily as the number of training samples increases.

We illustrate the relationship between the uncertainty and different degree of noise (Fig. 3(b)-(c)). It is observed that as the noise goes stronger, both aleatoric and epistemic uncertainties become larger, which implies that our model can be adaptively aware of possible noise in practical tasks. It should be pointed out that data noise (AU) will also significantly affect the EU under limited training data.

To investigate the sensitivity for noisy modality, we add different degrees of Gaussian noise (i.e., zero mean and varying variance ) to one of two modalities which is randomly selected. There are 500 samples that are associated with noisy modality 1 and noisy modality 2, respectively. As shown in Fig. 4, our algorithm can effectively distinguish which modality is noisy for different samples. Overall, the proposed method can capture global uncertainty (Fig. 3), and also has a very effective perception of modality-specific noise. Accordingly, the proposed algorithm can provide potential explanation for erroneous prediction.

4.6 Ablation Study

Dataset Method = 0.01 = 0.05 = 0.1
Sup EVD 15.75 28.91 44.80
Our 13.95 17.38 21.82
CT EVD 2.45 3.40 5.35
Ours 0.97 1.34 2.23
Table 7: Comparison between EVD Amini et al. (2020) and ours (RMSE ).

Robustness to noise. We add different degrees of Gaussian noise (i.e., zero mean and varying variance ) to one of two modalities which is randomly selected, then compare our method with EVD Amini et al. (2020) (concatenating the original features). The results in Table 7 validate that our method can dexterously estimate the uncertainty for promising performance due to taking the modality-specific uncertainty into account for the integration, while it is difficult for the counterpart, i.e., EVD with concatenated features.

Dataset Average Weighted (AU/EU) Ours
Sup 13.01 13.17/13.03 12.19
CT 2.87 5.71/1.11 0.91
Table 8: Comparison between different fusion strategies (RMSE ).

Comparison with different decision fusion strategies. To clarify which part contributes to the improvements, we compare different decision fusion strategies: average, weighted average with AU, and weighted average with EU. Since our method employs the modality-specific uncertainty to dynamically guide the integration for different samples, ours performs as the best as shown in Table 8.

Method AU (EVD/Ours) EU (EVD/Ours)
UEIR (%) 13.73/12.65 12.37/11.70
Table 9: Quantitative evaluation for uncertainty (UEIR ).

Effectiveness of uncertainty estimation. To quantitatively investigate the uncertainty estimation, we define a rank-based criterion to compare ours with other methods, which directly measures the inconsistency between the estimated uncertainty and predictive error. The uncertainty-error inconsistency rate (UEIR) is defined as , where is the number of sample pairs that RMSE and uncertainty are in the opposite order (e.g., & ), and is the total number of pairs for test samples. A smaller value implies a better uncertainty estimation. More detailed description of Table 8 and Table 9 is shown in the supplementary material.

5 Conclusion

In this paper, we propose a novel trustworthy multimodal regression model, which elegantly characterizes both modality-specific and fused uncertainties and thus can provide trustworthiness for the final output. This is quite important for real-world (especially cost-sensitive) applications. The introduced pseudo modality further enhances the exploration of complementarity among multiple modalities producing more stable performance. The proposed model is a unified learnable framework, and thus can be efficiently optimized. Extensive experimental results on diverse applications and conditions validate that our model achieves impressive performance in terms of regression accuracy and also obtains reasonable uncertainty estimation to ensure trustworthy decision making. It is interesting to apply the proposed algorithm for more real-world applications in the future. Moreover, it is also important to extend the regression algorithm for trustworthy multimodal classification.


This work was supported in part by the National Natural Science Foundation of China under Grant 61976151, 61732011, and the Natural Science Foundation of Tianjin of China under Grant 19JCYBJC15200. We thank Alexander Amini (Massachusetts Institute of Technology), the author of the paper Amini et al. (2020), for his generous help in explaining his work and providing useful reference materials.


  • [1] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. Rajendra Acharya, V. Makarenkov, and S. Nahavandi (2020)

    A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges

    arXiv:2011.06225. Cited by: §1.
  • [2] A. Amini, W. Schwarting, A. Soleimany, and D. Rus (2020) Deep evidential regression. In NeurIPS, Cited by: Appendix B, §2.1, §2.2, §3.1, §4.1, §4.2, §4.6, Table 7, Acknowledgements.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, and D. Parikh (2015) VQA: visual question answering.

    International Journal of Computer Vision

    123 (1), pp. 4–31.
    Cited by: §2.2.
  • [4] J. Antorán, J. U. Allingham, and J. M. Hernández-Lobato (2020) Depth uncertainty in neural networks. In NeurIPS, Cited by: §2.1.
  • [5] S. Bahrami, F. Dornaika, and A. Bosaghzadeh (2021)

    Joint auto-weighted graph fusion and scalable semi-supervised learning

    Information Fusion 66 (1), pp. 213–228. Cited by: §2.2.
  • [6] T. Baltrušaitis, C. Ahuja, and L. Morency (2018) Multimodal machine learning: a survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.2.
  • [7] C. M. Bishop (2006) Pattern recognition and machine learning (information science and statistics). Springer-Verlag New York, Inc.. Cited by: Appendix A.
  • [8] H. Cho, Y. Seo, B. V. Kumar, and R. R. Rajkumar (2014) A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In ICRA, Cited by: §1.
  • [9] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, and P. Pérez (2019) Addressing failure prediction by learning model confidence. In NeurIPS, Cited by: §2.1.
  • [10] G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer (2014) COVAREP: a collaborative voice analysis repository for speech technologies. In Acoustics Speech and Signal Processing (ICASSP), Cited by: §4.4.
  • [11] F. Dzogang, M. Lesot, M. Rifqi, and B. Bouchon-Meunier (2012) Early fusion of low level features for emotion mining. Biomedical Informatics Insights. Cited by: §1, §2.2.
  • [12] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, Cited by: §2.1.
  • [13] Q. Gan, S. Wang, L. Hao, and Q. Ji (2017)

    A multimodal deep regression bayesian network for affective video content analyses

    In ICCV, Cited by: §1, §2.2, §2.2.
  • [14] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    In ICML, Cited by: §4.4, §4.4.
  • [15] H. Gunes and M. Piccardi (2006) Affect recognition from face and body: early fusion vs. late fusion. In IEEE International Conference on Systems, Cited by: §1, §2.2, §2.2.
  • [16] D. Hafner, D. Tran, T. P. Lillicrap, A. Irpan, and J. Davidson (2019) Noise contrastive priors for functional uncertainty. In UAI, Cited by: §2.1.
  • [17] K. Hamidieh (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science 154, pp. 346–354. Cited by: §1.
  • [18] Z. Han, C. Zhang, H. Fu, and J. T. Zhou (2021) Trusted multi-view classification. In ICLR, Cited by: §1.
  • [19] J. M. Hernández-Lobato and R. P. Adams (2015)

    Probabilistic backpropagation for scalable learning of bayesian neural networks

    In ICML, Cited by: §4.1.
  • [20] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. Computer Science. Cited by: §2.1.
  • [21] G. E. Hinton (2002)

    Training products of experts by minimizing contrastive divergence

    Neural computation 14 (8), pp. 1771–1800. Cited by: §3.2.
  • [22] Y. Jiang, Q. Xu, Z. Yang, X. Cao, and Q. Huang (2019) DM2C: deep mixed-modal clustering. In NeurIPS, Cited by: §2.2.
  • [23] M. I. Jordan (2009) The exponential family: conjugate priors. Cited by: Appendix A, Appendix A, §3.2.
  • [24] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In NeurIPS, Cited by: §1, §2.1.
  • [25] A. Khodayari, A. Ghaffari, S. Ameli, and J. Flahatgar (2010) A historical review on lateral and longitudinal control of autonomous vehicle motions. In International Conference on Mechanical & Electrical Technology, Cited by: §2.1.
  • [26] D. Kiela, E. Grave, A. Joulin, and T. Mikolov (2018) Efficient large-scale multi-modal classification. In AAAI, Cited by: §2.2.
  • [27] A. D. Kiureghian and O. Ditlevsen (2009) Aleatory or epistemic? does it matter?. Structural Safety 31 (2), pp. 105–112. Cited by: §2.2.
  • [28] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, Cited by: §2.1.
  • [29] M. A. Lee, M. Tan, Y. Zhu, and J. Bohg (2020) Detect, reject, correct: crossmodal compensation of corrupted sensors. arXiv:2012.00201. Cited by: §1.
  • [30] X. Li, C. Shen, Q. Shi, A. Dick, and A. Hengel (2012) Non-sparse linear representations for visual tracking with online reservoir metric learning. In CVPR, Cited by: §2.2.
  • [31] P. P. Liang, Z. Liu, A. Zadeh, and L. P. Morency (2018) Multimodal language analysis with recurrent multistage fusion. In Meeting of the Association for Computational Linguistics, Cited by: §4.4, §4.4.
  • [32] P. P. Liang, Z. Liu, Y. H. Tsai, Q. Zhao, R. Salakhutdinov, and L. Morency (2019)

    Learning representations from imperfect time series data via tensor rank regularization

    arXiv:1907.01011. Cited by: §1.
  • [33] D. J. MacKay (1992)

    Bayesian interpolation

    Neural computation 4 (3), pp. 415–447. Cited by: §2.1.
  • [34] D. Molchanov, A. Ashukha, and D. P. Vetrov (2017) Variational dropout sparsifies deep neural networks. In ICML, Cited by: §2.1.
  • [35] J. Mukhoti and Y. Gal (2018) Evaluating bayesian deep learning methods for semantic segmentation. CoRR. Cited by: §2.1.
  • [36] J. Mukhoti, V. Kulharia, A. Sanyal, S. Golodetz, P. Torr, and P. Dokania (2020) Calibrating deep neural networks using focal loss. In NeurIPS, Cited by: §2.1.
  • [37] R. M. Neal (2012) Bayesian learning for neural networks. Springer Science & Business Media. Cited by: §2.1.
  • [38] W. Ouyang, C. Xiao, and X. Wang (2014)

    Multi-source deep learning for human pose estimation

    In CVPR, Cited by: §2.2.
  • [39] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Conference on Empirical Methods in Natural Language Processing

    Cited by: §4.4.
  • [40] R. J. Perrin, A. M. Fagan, and D. M. Holtzman (2009) Multimodal techniques for diagnosis and prognosis of alzheimer’s disease. Nature 461 (7266), pp. 916–922. Cited by: §1, §2.1.
  • [41] H. Pham, P. P. Liang, T. Manzini, L. Morency, and B. Póczos (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In AAAI, External Links: Link, Document Cited by: §4.4, §4.4, §4.4.
  • [42] S. Poria, E. Cambria, and A. Gelbukh (2015)

    Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis

    In Proceedings of the 2015 conference on empirical methods in natural language processing, Cited by: §2.2.
  • [43] K. Qaddoum and E. L. Hines (2012) Reliable yield prediction with regression neural networks. In WSEAS international conference on systems theory and scientific computation, Cited by: §2.1.
  • [44] H. Qian et al. (2018)

    Big data bayesian linear regression and variable selection by normal-inverse-gamma summation

    Bayesian Analysis 13 (4), pp. 1011–1035. Cited by: §3.2.
  • [45] D. Ramachandram and G. W. Taylor (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Processing Magazine. Cited by: §2.2.
  • [46] M. Sensoy, L. M. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In NeurIPS, Cited by: §2.1.
  • [47] Y. Shi, S. Narayanaswamy, B. Paige, and P. H. S. Torr (2019)

    Variational mixture-of-experts autoencoders for multi-modal deep generative models

    In NeurIPS, Cited by: §3.2.
  • [48] J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. V. Dillon, J. Ren, and Z. Nado (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In NeurIPS, Cited by: §1, §2.1.
  • [49] D. Ulmer and G. Cinà (2021)

    Know your limits: uncertainty estimation with relu classifiers fails at reliable ood detection

    arXiv:2012.05329. Cited by: §1.
  • [50] J. van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal (2020) Uncertainty estimation using a single deep deterministic neural network. In ICML, Cited by: §2.1.
  • [51] Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L. Morency (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In AAAI, Cited by: §4.4, §4.4.
  • [52] Y. Xia, F. Liu, Y. Dong, J. Cai, and H. Roth (2020)

    3D semi-supervised learning with uncertainty-aware multi-view co-training

    In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.2.
  • [53] G. Ye, D. Liu, I. Jhuo, and S. Chang (2012) Robust late fusion with rank minimization. In CVPR, Cited by: §1, §2.2, §2.2.
  • [54] D. Yi, Z. Lei, and S. Z. Li (2015)

    Shared representation learning for heterogenous face recognition

    In 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), Cited by: §2.2.
  • [55] A. Zadeh, R. Zellers, E. Pincus, and L. Morency (2016) MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. CoRR. External Links: 1606.06259 Cited by: §4.4.
  • [56] C. Zhang, Y. Cui, Z. Han, J. T. Zhou, H. Fu, and Q. Hu (2020) Deep partial multi-view learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. External Links: Document Cited by: §2.2.
  • [57] C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao, and D. Xu (2020) Generalized latent multi-view subspace clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. External Links: Document Cited by: §2.2.

Appendix A Detailed derivation of the loss function

In this section, we elaborate on the training of neural networks to estimate the evidential distribution for each modality. From Bayesian probability theory, we marginalize over the likelihood parameters

to obtain the model evidence. Specifically, we should maximize the following likelihood function:


where are the likelihood parameters and are the evidential distribution parameters. Since it is difficult to compute the exact posterior inference, Eq. 13 is difficult to obtain. Fortunately, this is a scale mixture of a Gaussian with respect to a gamma density [23]. This scale mixture is intrinsically a Student t distribution, and the derivation is following:


Accordingly we have:


For the -th modality, the negative log likelihood loss is defined as Eq. 5. Maximizing the likelihood function by using the standard parameterization for Student t distribution makes our model fit the data. In order to regularize the total evidence, we minimize the evidence of incorrect predictions by adding an incorrect evidence penalty scaled on the error of the predictions. A famous interpretation of the parameters of this kind of conjugate prior distributions is in terms of “virtual observations” [23]. For a sample, the mean of a NIG distribution can be interpreted as being estimated from virtual observations, and the variance can be interpreted as being estimated from virtual observations with the sum of squared deviations  [7], thus, the total evidence can be denoted as the sum of virtual observations: , and the evidence penalty is defined as:


The total loss for each modality, (Eq. 6), consists of the two terms for maximizing the likelihood function and regularizing evidence.

Appendix B Experimental details

For experiment on synthetic data, 800 data points are sampled from the range for training the model, and we present the test data in the region

. The model consists of 100 neurons with 4 hidden layers and is trained with the Adam optimizer:

for 60 iterations, , and the coefficient .

For physical and medical tasks, the model consists of 6 hidden layers, trained with the Adam optimizer: for 400 iterations, and the coefficient . Then we impose different degrees of Gaussian noise (i.e., zero mean and varying variance ) in different ways to half of the test samples considered as OOD samples, and distinguish them by aleatoric and epistemic uncertainties. Our model still achieves relatively promising performance compared with EVD with concatenating the original features, validating the advantages of estimating the uncertainty of the proposed MoNIG. For human multimodal sentiment analysis, the experimental settings of all the different methods are consistent, with 32 samples for each batch, trained with the Adam optimizer: for 400 iterations, and the coefficient .

Baseline. We compare our model with existing approaches including Gaussian (GS) in Eq. 2 and Evidential (EVD) [2], where both the two methods are applied by concatenating features from multiple modalities in two different ways, i.e., early fusion and intermediate fusion. For early fusion, we simply concatenate preprocessed features (data level) as a new representation, while for intermediate fusion, we train each modality with several individual neural layers, then concatenate the last layer of each modality (intermediate level) as a new representation. The new representation is then used as multimodal representation input for prediction.

Comparison with different decision fusion. We compare different decision fusion (Fig. 1(b)) strategies: average, weighted average with AU, and weighted average with EU. “Average” indicates the average of multiple results from multiple modalities, and “weighted average with AU” indicates the final prediction is the weighted average of the results from multiple modalities, where the weights are averaged aleatoric uncertainties of every modalities. “Weighted average with EU” indicates the weights are epistemic uncertainties.

Effectiveness of uncertainty estimation. We define a rank-based criterion to measure the consistency between uncertainty and prediction error. All test samples are used to compare the rank relationship between uncertainty and RMSE in pairs, where the total number of pairs of test samples is (i.e., if there are test samples, ). The numbers in the Table 9 indicates the proportion of samples that RMSE and uncertainty are in the opposite order. A smaller value implies a better uncertainty estimation.

Appendix C Experiments on the rationality and effectiveness of uncertainty estimation

(a) Superconductivity
(b) CT Slides
Figure 5: Relationship between the uncertainty and RMSE.

We evaluate the aleatoric and epistemic uncertainties of each sample on test data. According to Fig. 5, we can find that both aleatoric uncertainty and epistemic uncertainty are non-decreasing when the RMSE becomes larger, which is quite important to evaluate the trustworthiness for regression.

We show the modality-specific uncertainty estimation on 3-modality data (Fig. 6), which also demonstrates the effectiveness of our model in uncertainty estimation.

Figure 6: Perception of noisy modality on 3-modality data.

Overall, the experimental results in estimating modality-specific uncertainty as shown in Fig. 6 and Fig. 7(a) clearly show that our method achieves promising performance in estimating modality-specific uncertainty. The modality-specific uncertainty can dynamically guide the integration for different samples, while the counterpart EVD cannot be aware of the different quality of modalities. We also intuitively show why our method makes improvements in (Fig. 7(b)).

(a) Uncertainty evaluation.
(b) Integration weights.
Figure 7: Left: The relationship between the modality-specific uncertainty and prediction error, which validates the rationality of modality-specific uncertainty. Right: The relationship between the weight of each modality to the final prediction and the absolute error comparison of two modalities’ predictions. “Error1 - Error2” indicates the normalized difference between the Mod1’s prediction error and Mod2’s prediction error, i.e., “-1” indicates Mod1 can give a much more accurate prediction than Mod2, and vice versa. We can find that the modality with a smaller error has a larger weight for the final prediction and and vice versa, which potentially reduces the impact from noisy modality.

Appendix D Analysis of the training time and space complexity

We report the training time on the sentiment analysis task (Table 10). It is observed that although the NIG summation operation is introduced for multimodal integration our method has the same level of computational complexity.

Method GS EVD Ours
Time (s) 2819.85 2869.59 2981.52
Table 10: Training time on MOSI. (Platform: RTX 2080)

Appendix E Proof of the proposition

The commutativity and associativity of summation in Def. 3.1 can be proved as follows: