1 Introduction
Multimodal data analysis, where the input data comes from a wide range of sources, is a relatively common task. For instance, automatic driving vehicles may take actions based on the fusion of the information provided by multiple sensors. In the medical domain, automated diagnosis often relies on data from multiple complementary modalities. Recently, we have seen the development of successful multimodal techniques, such as visionandsound classification [5], sound source localization [4], visionandlanguage navigation [35] or organ segmentation from multiple medical imaging modalities [9, 40, 38]. However, current multimodal models typically rely on complex structures that neglect the uncertainty present in each modality. Although they can obtain promising results under specific scenarios, they are fragile when facing situations where modalities contain high uncertainties due to noise in the data or the presence of abnormal information. Such issue can reduce their prediction accuracy and limit their applicability in safetycritical applications [14].
Uncertainty is a crucial issue in many machine learning tasks because of the inherent randomness of machine learning processes. For instance, the randomness of data collection, data labeling, model initialization and training are sources of uncertainty that can result in large disagreements between models trained under similar conditions. According to
[13, 1, 18], total uncertainty comprise: 1) aleatoric uncertainty (also known as data uncertainty), representing inherent noise in the data due to issues in data acquisition or labeling; and 2) epistemic uncertainty (i.e., model or knowledge uncertainty), which is related to the model estimation of the input data that may be inaccurate due to insufficient training steps/data, poor convergence, etc. Total uncertainty is defined as:(1) 
where indicates the given dataset, and are the inputs and outputs of the model, and represents the measurement of disagreement (e.g., entropy). The estimation of aleatoric uncertainty is considered as the expectation of the predicted disagreement for each model on data points posterior parameterized by ; while the epistemic uncertainty is shown by the disagreement of different models parameterized by sampled from the posterior. In this paper, we focus on estimating total uncertainty.
In multimodal methods, existing methods typically assume that each modality contributes equally to the prediction outcome [27, 33, 9]
. This strong assumption may not hold if one of the modalities leads to a highly uncertain prediction, which can damage the model performance. In general, deep learning models that can estimate uncertainty
[20, 19, 2] were not designed to deal with multimodal data. These models are usually based on Bayesian learning that have slow inference time and poor training convergence, or on abstention mechanisms [32] that may suffer from the low representational power of characterising all types of uncertainties with a single abnormal class. Recently, there have been a couple of methods designed to model multimodal uncertainty [14, 26], but they are limited to work with very specific classification and segmentation problems, or they show numerical instabilities.In this paper, we propose a novel approach to estimate the total uncertainty present in multimodal data by measuring feature density via Crossmodal Random Network Prediction (CRNP). CRNP measures uncertainty for multimodal Learning using random network predictions (RNP) [3], where the model is designed to be easily adaptable to disparate tasks (e.g., classification and segmentation) and training is based on a stable optimization that mitigates numerical instabilities. To summarize, the main contributions of this paper are:

We propose a new uncertaintyaware multimodal learning model through a feature distribution learner based on RNP, named as Crossmodal Random Network Prediction (CRNP). CRNP is designed to be easily adapted to disparate tasks (e.g. classification and segmentation) and to be robust to numerical instabilities during optimization.

This paper introduces a novel uncertainty estimation based on fitting the output of an RNP, which from a technical viewpoint, represents a departure from more common uncertainty estimation methods based on Bayesian learning or abstention mechanisms.
The adaptability of CRNP is shown by its application on two 3D multimodal medical image segmentation tasks and three multimodal 2D computer vision classification tasks, where the proposed model achieves stateoftheart results on all problems. We perform a thorough analysis of multiple CRNP fusion strategies and present visualization to validate the effectiveness of the proposed model.
2 Related Work
2.1 Multimodal Learning
Multimodal learning has attracted increasing attention from computer vision (CV) and medical image analysis (MIA). In MIA, Jia et al. [16] introduced a sharedandspecific feature representation learning for semisupervised multiview learning. Dou et al. [9]
proposed a chilopodshaped multimodal learning architecture with separate feature normalization for each modality and a knowledge distillation loss function. In CV, Shen et al.
[4] defined a trusted middleground for videoandsound source localization. In videoandsound classification, Chen et al. [5] proposed to distill multimodal image and sound knowledge into a video backbone network through compositional contrastive learning. Also in videoandsource classification, Patrick et al. [29, 30]brought the idea of selfsupervision learning into multimodal by training the networks on external data, which boosted classification accuracy greatly. By exchanging channels, Wang et al.
[39] showed that the multimodal features are able to fuse in a better manner. Analyzing existing multimodal learning methods, even though successful on several tasks, they do not consider that when reaching a decision, some modalities may be more reliable than others, which can damage the accuracy of the model.2.2 Uncertaintybased Learning Models
Uncertainty also has been widely studied in deep learning. Corbiere et al. [7] proposed to predict a single uncertainty value by an external confidence network via training on the groundtruth class. Sensoy et al. [32] introduced the Dirichlet distribution for an overall classification uncertainty measurement based on evidence. Kohl et al. [20] proposed a probabilistic UNet segmentation architecture to optimize a variant of the evidence lower bound (ELBO) objective. Based on the probabilistic UNet model, Kohl et al. [19] and Baumgartner et al. [2] further updated the model in a hierarchical manner from either the backbone network or prior/posterior networks. Jungo et al. [17] used two medical datasets to compare several uncertainty measurement models, namely: softmax entropy [12], Monte Carlo dropout [12], aleatoric uncertainty [18], ensemble methods [21] and auxiliary network [8, 31]. In MIA, multiple uncertainty measurements have been proposed as well [36, 37, 24, 22]. However, none of the methods above are designed for multimodal tasks and some of them contain long and complex pipelines that are not easily adaptable to new tasks. Bayesian or ensemblebased methods demand long training and inference times and have slow convergence. Evidential methods have drawbacks too, where the main issue is the representational power of the abstention class. In contrast, our proposed model, by introducing random network fitting for crossmodal uncertainty measurement, is not only technically novel, but it is also simple and easily adaptable to many tasks without requiring any restrictive assumption about uncertainty representation.
2.3 Combining Uncertainty and Multimodal Analysis
Some methods have studied the combination of uncertainty modeling and multimodal learning. For example, a trusted multiview classification model has been developed by modeling multiview uncertainties through Dirichlet distribution and merging multimodal features via Dempster’s Rule [14]. However, it is rigidly designed for classification problems, and cannot be easily translated to other tasks, such as segmentation. Monteiro et al. [26] took pixelwise coherence into account by optimizing lowrank covariance metrics to apply on lung nodules and brain tumor segmentation. Nevertheless, the method by Monteiro et al. [26] requires a timeconsuming step to generate binary brain masks to remove blank areas, and the method is also numerically unstable when training in areas of infinite covariance such as the air outside the segmentation target^{2}^{2}2As stated by SSN implementation [26] at https://github.com/biomediamira/stochastic_segmentation_networks.. From an implementation perspective, this method [26]
is also memory intensive when indexing the identity matrix to create onehot encodings. Differently, in our model, the uncertainty is measured by modeling the overall distribution directly from features without constructing any secondorder relation matrix, leading to a numerically more stable optimization and a smaller memory consumption.
3 Crossmodal Random Network Prediction
Below, we first introduce the Random Network Prediction (RNP), with a theoretical justification for its use to measure uncertainties. Then we present the CRNP model training and inference with the crossmodal uncertainty measuring mechanism to take the RNP uncertainty prediction from one modality to enhance or suppress the outputs for other modalities when producing a classification or segmentation prediction.
3.1 Random Network Prediction
The uncertainty of a particular modality is estimated with the RNP depicted in Fig. 1. Specifically, for each RNP, we train a prediction network to fit the outputs of a weightfixed and randomlyinitialized network for feature density modeling. The intuition is that the prediction network will fit better the random network outputs of samples (i.e., with low uncertainty), populating denser regions of the feature space; but the fitting will be worse (i.e., with high uncertainty) for samples belonging to sparser regions. This phenomenon is depicted in the graph inside Fig. 1.
Formally, we consider input images from two modalities , where and represent the modalities. After the input image pass through the encoder (similarly for ), the features of the two modalities are analyzed by each RNP module. The RNP module feeds and
to a randomly initialized neural network
, where , with fixed weights . Meanwhile, and are fed to a learnable prediction network with parameters . The prediction network has the same output space but a different structure from the random network, where the capacity of is smaller than to prevent potential trivial solutions. The cost function used to train the RNP module is based on the mean square error (MSE) between the outputs of the prediction and random networks:(2) 
where denotes the number of training samples, , and . The cost function in (2) provides a simple yet powerful supervisory signal to enable the prediction network to learn the uncertainty measuring function.
3.2 Theoretical Support for Uncertainty Measurement
The RNP has a strong relation with uncertainty measurement. Let us consider a regression process from a set of perturbed data . Considering a Bayesian setting, the objective is to minimize the distance between the ground truth and a sum made up of a generated prior randomly sampled from a Gaussian and an additive posterior term with a regularization . Formally, the optimization is as follows:
(3) 
where, according to Lemma 3 in [28], the sum is an approximator of the genuine posterior. If we fix the target with zeros, then the objective to be optimized would be equivalent to minimize the distance between the posterior and the randomly sampled prior . Thus, each output element within the randomized function or the predict function can be viewed as a member of a set of weightshared ensemble functions [3]
. The predicted error, therefore, can be viewed as an estimate of the variance of the ensemble uncertainty.
3.3 Training and Inference of CRNP
This section introduces our proposed CRNP, which fuses the multiple modalities with their inferred uncertainties to produce the final predictions (e.g., classification or segmentation), as shown in Fig. 2. During the multimodal fusion phase, the features of the two modalities and are crossattended by the uncertainty maps produced by the RNP module from both modalities. The uncertainty map for modality is represented as:
(4) 
and similarly for for modality . The feature crossattended by the uncertainty maps is represented by:
(5) 
where represents the operator that fuses the original and crossattended features, is the channelwise normalized CRNP uncertainty map, and is the elementwise product operator. is similarly defined as in (5). Different fusion operations are thoroughly discussed in Sec. 4.5.1.
We utilize selfattention to further fuse features and , taking both unimodal and crossmodal relations between feature elements into consideration. As shown in Fig. 2 , we first concatenate and to form the query, key and value inputs for the selfattention module with . Then the output of the selfattention is denoted by:
(6) 
where , , and are linear projection weights for queries, keys and values, respectively. refers to the dimensions of queries, keys and values. The decoder after the multimodal fusion is denoted by (similarly for ), where is the space of the output from the crossmodal RNP module and input to the decoder, and is the classification simplex (output from softmax). Note that although the annotations of multimodal data are similar, they can have significant differences, particularly in segmentation tasks. Hence, without losing generality, we may need to have multiple separate decoders, one for each modality. But multidecoders are not needed in tasks where the multimodal annotation is exactly the same. For segmentation problems, the output of is the space per pixel. The training of CRNP alternates the training of the RNP modules using (2) and the training of the whole model. During RNP training, only the weights of the prediction network inside the RNP are updated by minimising (2), and all other CRNP weights are kept fixed. During the training of the whole model, all CRNP weights are updated, except for the weights of the prediction network of the RNP. The whole model training minimizes the multiclass crossentropy loss for a classification problem or the Dice and elementwise crossentropy losses for a segmentation model.
During inference, CRNP receives multimodal inputs, where each modality branch estimates an uncertainty output that will weight the other modality, and the results of both modalities will be fused to produce the final prediction. CRNP works by assigning large weights to the other modality when the current modality is uncertain. When both modalities have large uncertainties, the final prediction will rely on a balanced analysis of both modalities. For the analysis of more than two modalities, the uncertainty map for a particular modality, say , in (4) is computed by summing the MSE results produced by all other modalities, with . The decoders and from two modalities can be separated or shareweighted, depending on the corresponding output requirements.
4 Experiments
4.1 Datasets
Medical Image Segmentation Datasets. We conduct experiments on two publicly available multimodal 3D segmentation datasets: MultiModality Whole Heart Segmentation dataset (MMWHS) and Multimodal Brain Tumor Segmentation Challenge 2020 dataset (BraTS2020). The MMWHS dataset contains 20 CTs and 20 MRs for training/validation and other 40 CTs and 40 MRs for testing [41]. Seven classes (background excluded) are considered for each pixel. The two modalities have individual groundtruth (GT) for each CT or MR. The BraTS2020 dataset has 369 cases for training/validation and other 125 cases for evaluation, where each case (with four modalities, namely: Flair, T1, T1CE and T2) share one segmentation GT. The evaluation is performed online^{3}^{3}3https://ipp.cbica.upenn.edu/categories/brats2020. Four classes (background included) are considered for each pixel.
Computer Vision Classification Datasets. We also validate our method on three computer vision classification datasets, namely: Handwritten^{4}^{4}4https://archive.ics.uci.edu/ml/datasets/Multiple+Features, CUB [34] and Scene15 [10]. Each sample of the Handwritten dataset contains 2000 samples from six views and it is a tenclass classification problem, CUB contains 11,788 bird images from 200 different categories. Following Han et al. [14]
, we also adopt the first ten classes and two modalities (image and text features) extracted by GoogleNet and doc2vec. Three modalities are included in Scene15, which contains 4,485 images from 15 indoor and outdoor classes.
4.2 Implementation Details
Medical Image Segmentation Tasks. To keep a fair comparison, the implementation of all models evaluated on MMWHS and BraTS2020 is based on the 3D UNet (with 3D convolution and normalization) as our backbone network. On MMWHS, we adopt the official test set proposed by Zhuang et al. [41] (40 CTs and 40 MRs) for testing; on BraTS2020, we evaluate all models on the online validation set. For overall performance evaluation, the models were trained for 100,000 iterations on MMWHS and 180,000 iterations on BraTS2020 without model selection. Following Dou et al. [9]
, our hyperparameter tuning and ablation are conducted on MMWHS with 16 CTs and 16 MRs for training, 4 CTs and 4 MRs for validation. The batch size is set to 2. Stochastic gradient descent optimizer with a momentum of 0.99 is chosen for the model training. The initial learning rate is set to
on both datasets with cosine annealing [23]learning rate tuning strategy. For the reproduction of Probability UNet
[20], we use prior/posterior mean instead of random sampling a latent variablefor prediction. The evaluation of the methods is based on the Dice score and Jaccard index for MMWHS; and the Dice score and Hausdorff95 index for BraTS2020. For crossmodal RNP modules training, the randomized network is made up of 3 depthwise convolutional hidden layers; the prediction network has 2 depthwise convolutional hidden layers. Between every two layers, both the randomized network and the prediction network adopt LeakyReLU as their activation function, where the negative slope is set to
. We set 256 as RNP output dimension for both tasks. For performance evaluation, the CRNP is placed at the bottleneck of our 3D UNet backbone. For the ensemble version of CRNP on both datasets, following Wang et al. [40], we average the logits of 3 CRNP models to reduce the prediction variance.
Computer Vision Classification Tasks. For the model evaluation on computer vision datasets, we follow [14]
to split the data into 80% for training and 20% for testing. To keep a fair comparison, we uniformly trained all models for 500 epochs without model selection and then evaluated them on the test set. The learning rate is set to
; Adam optimizer with weight decay and coefficients (0.9, 0.999) are adopted. Following Han et al. [14], we apply accuracy and multiclass AUROC as evaluation metrics. We used similar setups for crossmodal RNP modules as on the medical data, with the following differences: the RNP output dimension is set to 32 for computer vision classification tasks and CRNP is placed at the layer before the fully connected layer. The training of CRNP model is conducted in an endtoend manner without any pretraining or postprocessing. Also, the hyperparameters do not require much effort to tune.
4.3 Medical Image Segmentation Model Performance
Performance on MMWHS Dataset. We compare our approach with: Individual (CT or MR single modality segmentation with separate 3D UNet), 3D UNet (multimodal fusion by concatenation), the multimodal learning model Ummkd [9], and the uncertainty model Probability UNet^{5}^{5}5 We also tried SSN [26], but it requires the creation of onehot encodings that are memory intensive for seven classes on MMWHS dataset. [20], which proposes a prior net to approximate the posterior distribution, combining the knowledge of inputs and ground truth, in a latent space. The evaluation is based on the Dice scores of the segmentation of the left ventricle blood cavity (LV), the myocardium of the left ventricle (Myo), the right ventricle blood cavity (RV), the left atrium blood cavity (LA), the right atrium blood cavity (RA), the ascending aorta (AA), the pulmonary artery (PA) and Whole Heart (WH). All results on MMWHS data are obtained by using the official evaluation toolkit^{6}^{6}6http://www.sdspeople.fudan.edu.cn/zhuangxiahai/0/mmwhs/.
Models  LV  Myo  RV  LA  RA  AA  PA  WH  
CT  Individual  0.9297  0.8943  0.8597  0.9254  0.8701  0.9335  0.7833  0.8989 
3D UNet  0.9138  0.8781  0.8822  0.9274  0.8680  0.9088  0.8239  0.8957  
Ummkd  0.9145  0.9066  0.8410  0.9157  0.8853  0.8928  0.7579  0.8734  
ProbUNet  0.9071  0.8775  0.8978  0.9262  0.8657  0.9318  0.8425  0.8997  
CRNP (Ours)  0.9369  0.9036  0.9076  0.9375  0.8885  0.9538  0.8628  0.9187  
CRNP* (Ours)  0.9373  0.9060  0.9085  0.9366  0.8910  0.9503  0.8629  0.9193  
MR  Individual  0.8777  0.7923  0.6146  0.5686  0.7528  0.5854  0.3993  0.6729 
3D UNet  0.8850  0.7723  0.8559  0.8548  0.8676  0.8551  0.7964  0.8535  
Ummkd  0.8721  0.7966  0.8086  0.8577  0.8278  0.7998  0.7224  0.8211  
ProbUNet  0.8742  0.7389  0.8332  0.8495  0.8531  0.8537  0.7895  0.8386  
CRNP (Ours)  0.8962  0.7787  0.8605  0.8637  0.8748  0.8736  0.7969  0.8615  
CRNP* (Ours)  0.8963  0.7811  0.8742  0.8850  0.8688  0.8692  0.8329  0.8758 
CT  MR  
Models  Dice  Jaccard  Dice  Jaccard 
GUT  0.9080  0.8320  0.8630  0.7620 
KTH  0.8940  0.8100  0.8550  0.7530 
CUHK1  0.8900  0.8050  0.7830  0.6530 
CUHK2  0.8860  0.7980  0.8100  0.6870 
UCF  0.8790  0.7920  0.8180  0.7010 
SIAT  0.8490  0.7420  0.6740  0.5320 
UT  0.8380  0.7420  0.8170  0.6950 
UB1  0.8870  0.7980  0.8690  0.7730 
UB2      0.8740  0.7780 
UOE  0.8060  0.6970  0.8320  0.7200 
Ours  0.9193  0.8486  0.8758  0.7814 
As shown in Tab. 1, our proposed CRNP and its ensemble version have 7 out of the 8 best Dice results on both CT and MR. On CT (Tab. 1), CRNP raises LV Dice score from 0.9297 to 0.9369 and PA Dice score from 0.8425 to 0.8628, when compared to the secondbest models. On whole heart segmentation Dice score, CRNP outperforms the secondbest model by 1.9%. The ensemble version of CRNP further improves segmentation accuracy. A similar result is observed on MR. On LV, CRNP raises the Dice score from 0.8850 to 0.8962 and AA Dice score from 0.8551 to 0.8736 when compared to the secondbest models. On whole heart segmentation, CRNP increases MR Dice from 0.8535 to 0.8615. Model ensemble further improves the performance.
Interestingly, the Individual model obtains accurate results on CT (0.8989 for WH score). However, performance (0.6729 for WH score) drops drastically on MR evaluation, with particularly poor accuracy on RV, AA and PA. But when considering both modalities (3D UNet model), the model performance increases substantially. This shows the bounds of considering a single modality, especially for MR segmentation. The proposed CRNP outperforms the 3D Unet by a large margin. Ummkd [9] performs consistently well on Myo on both CT and MR. We hypothesize that the domainspecific normalization and knowledge distillation loss contribute more to Myo segmentation than to other organs. Probability UNet tries to model posterior latent space rather than a deterministic prediction, which may explain its performance. In general, we note that the CT segmentation results are better than MR, which resonates with the conclusion from [41].
From the number of parameters perspective, the randomized network is made up of 3 convolutional hidden layers and the prediction network has 2 convolutional hidden layers. So the change in number of parameters is minimal. More specifically, the number of parameters of competing methods are: 1) UNet: 41.05M, 2) Ummkd (with UNet backbone for fair comparison): 41.05M, and 3) ProbUNet: 57.44M. Our CRNP has 42.18M parameters, where the RNP module has 0.29M, and the attention module has 0.84M parameters.
We also compare the proposed CRNP model with the stateoftheart models reported by the official challenge report [41]. The results are shown in Tab. 2. On whole heart segmentation, CRNP has a particularly accurate Dice score and Jaccard index for CT and MR. Compared to the secondbest models, our CRNP model increases the Dice score from 0.9080 to 0.9193 and from 0.8740 to 0.8758 on CT and MR, respectively. Similar results are shown forJaccard index.
Performance on BraTS2020 dataset. Developing automated segmentation models to delineate intrinsically heterogeneous brain tumors is the main goal of BraTS2020 Challenge. Following [38], we compare the proposed CRNP model with many other strong methods, including 3D UNet [6], Basic VNet [25], Deeper VNet [25], Residual 3D UNet, ModalPairing [40], TransBTS [38], as well as uncertaintyaware models ProbUNet [20] and SSN [26] that models aleatoric uncertainty by considering spatially coherence. We evaluate the Dice and Hausdorff95 indexes of all models on four organs: enhancing tumor (ET); tumor core (TC) that consists of ET, necrotic and nonenhancing tumor core; and whole tumor (WT) that contains TC and the peritumoral edema.
Dice  Hausdorff95  
Models  ET  WT  TC  ET  WT  TC 
3D UNet [6]  0.6876  0.8411  0.7906  50.9830  13.3660  13.6070 
Basic VNet [25]  0.6179  0.8463  0.7526  47.7020  20.4070  12.1750 
Deeper VNet [25]  0.6897  0.8611  0.7790  43.5180  14.4990  16.1530 
Residual 3D UNet  0.7163  0.8246  0.7647  37.4220  12.3370  13.1050 
ProbUNet [20]  0.7392  0.8782  0.7955  36.2458  6.9518  7.7183 
SSN [26]  0.6795  0.8420  0.7866  43.6574  14.6945  19.5171 
ModalPairing* [40]  0.7850  0.9070  0.8370  35.0100  4.7100  5.7000 
TransBTS [38]  0.7873  0.9009  0.8173  17.9470  4.9640  9.7690 
CRNP (Ours)  0.7887  0.9086  0.8372  26.5972  4.0490  6.0040 
CRNP* (Ours)  0.7902  0.9109  0.8550  26.4682  4.1096  5.3337 
In Tab. 3, our models have 5 out of the 6 best results. The CRNP improves the ET Dice score, compared with the secondbest model, from 0.7873 to 0.7887; and from 0.9070 to 0.9086 on WT. Similar results are shown on Hausdorff95 indexes. Note that the ModalPairing model adopts an ensemble strategy. When applying the ensemble strategy to CRNP, the results improved even further. The WT Dice of CRNP* can reach 0.9109; the TC Dice can reach 0.8550, which is one more percent increment; and improves the TC Hausdorff95 to 5.3337. The performance improvements show the effectiveness of the proposed CRNP model.
4.4 Computer Vision Classification Model Performance
In this section, we show results that demonstrate the effectiveness of CRNP on multiple CV classification tasks. The evaluation metrics include accuracy and multiclass AUROC on Handwritten, CUB and Scene15 datasets. Following Han et al. [14], the comparison models include multiple uncertaintyaware models: Monte Carlo dropout (MCDO) [11] that adopts dropout at inference as a Bayesian approximator; deep ensemble (DE) [21], which uses an ensemble strategy to reduce uncertainty; uncertaintyaware attention (UA) [15]
that creates uncertainty attention maps from a learned Gaussian distribution; evidential deep learning (EDL)
[32] that predicts an extra Dirichlet distribution for all logits based on evidence; and trusted multiview classification (TMC) [14], which is a multiview version of EDL.Data  Metric  MCDO [11]  DE [21]  UA [15]  EDL [32]  TMC [14]  CRNP 
Handwritten  Acc  0.9737  0.9830  0.9745  0.9767  0.9851  0.9925 
AUROC  0.9970  0.9979  0.9967  0.9983  0.9997  0.9996  
CUB  Acc  0.8978  0.9019  0.8975  0.8950  0.9100  0.9167 
AUROC  0.9929  0.9877  0.9869  0.9871  0.9906  0.9961  
Scene15  Acc  0.5296  0.3912  0.4120  0.4641  0.6774  0.7057 
AUROC  0.9290  0.7464  0.8526  0.9141  0.9594  0.9734 
As shown in Tab. 4, CRNP model can outperform its counterparts on 5 out of 6 measures across datasets. CRNP performs particularly well on Scene15, increasing the accuracy from 0.6774 to 0.7057 (a 2.83% improvement) and AUROC from 0.9594 to 0.9734 (a 1.4% improvement). CRNP also has promising results on Handwritten and CUB data. On AUROC of Handwritten, CRNP gets slightly worse but comparable results than TMC (0.9996 vs. 0.9997).
4.5 Ablation Study
4.5.1 Effectiveness of Each Component
In the ablation study, we examine each component of the proposed CRNP. The “Base” model is the plain multimodal 3D UNet with dual branches; “CA” means crossattention by assigning the query from one modality, while keep the key and value the other modality; “SA” means applying selfattention as we propose. We conducted the ablation on the validation set split of the MMWHS dataset and we measured the average Dice scores of each organ on CT and MR. As shown in Tab. 5, compared with the Base 3D UNet model, the CRNP model is able to improve (around 1% increment of Dice scores) the performance across multiple organs, where the improvements are especially obvious on Myo, LA, RA, AA and WH. From the table, we can perceive that, with the help of either crossattention or selfattention, the model performance can be further boosted. But applying the selfattention as described in Sec. 3.3, causes the model to produce the best results (6 best results out of 8) across multiple organs. This is mainly because the selfattention on the multimodal feature fusion not only models the crossmodal relations, but also considers unimodal attentions.
Models  LV  Myo  RV  LA  RA  AA  PA  WH 
Base  0.9334  0.8596  0.8876  0.8932  0.8794  0.8239  0.8168  0.8706 
CRNP  0.9324  0.8685  0.8644  0.9007  0.8957  0.9216  0.8225  0.8865 
CRNP+CA  0.9323  0.8683  0.8802  0.9147  0.9116  0.9098  0.8194  0.8909 
CRNP+SA  0.9356  0.8891  0.8814  0.9232  0.8987  0.9148  0.8277  0.8958 
4.5.2 Discussion of Different CRNP Fusion functions
In terms of different CRNP fusion functions that can be applied in (Sec. 3.3), we compare and discuss three types, as shown in Tab. 6
: (a) “Replace” represents a naive replacement of the original modality features by the uncertainty map attended features; (b) “Concat” applies the concatenation operation on the original modality features and the uncertainty map attended features; and (c) “Residual”, which is the default fusion strategy of the proposed CRNP, denotes an addition operation performed between two feature tensors. This experiment is conducted on the MMWHS dataset and averages both CT and MR Dice results. From the results, we note that all three types of fusion functions have pros and cons. However, the “Residual” model performs better (4 best results out of 8) than other functions. This advantage is more noticeable on RA, AA and PA, on which more than 1% improvement is gained on Dice score.
Models  LV  Myo  RV  LA  RA  AA  PA  WH 
Replace  0.9342  0.8688  0.8688  0.897  0.8812  0.9074  0.8128  0.8815 
Concat  0.9327  0.8676  0.8798  0.9031  0.8781  0.9098  0.8042  0.8822 
Residual  0.9324  0.8685  0.8644  0.9007  0.8957  0.9216  0.8225  0.8865 
4.6 Visualization
We also conduct a visualization experiment in Fig. 3 that shows the MMWHS segmentation visualization (Sub Fig.1), TSNE visualization of in and out of distribution data points produced by the uncertainty maps from the RNP module on the CT images from MMWHS (Sub Fig.2), and the CRNP uncertainty heatmaps for BraTS2020 images (Sub Fig.3). As the two cases from validation set shown in Sub Fig.(1), (a) (c) are segmented by the Base model and (b) (d) are from CRNP. The color masks denote the segmentation results (e.g., pink) overlaid on the ground truth (e.g., purple). The obvious segmentation differences are highlighted by yellow boxes. When comparing segmentation from two models, we can notice that our CRNP has better segmentation results, especially on the organ edges. This is mainly because organ edges contain more uncertain regions. The proposed CRNP can perceive uncertain segmented regions within one modality and assign more weights to the other one. By leveraging this information, CRNP is able to alleviate segmentation uncertainties in organ edges. Moreover, we visualize the in and out of distribution uncertainty maps processed by TSNE in Sub Fig.(2). Following Han et al. [14], we consider the original features as the in distribution data and noisy features modified by additive Gaussian noise as the out of distribution data. Then, these samples are fed into the crossmodal RNP modules to get the uncertainty map predictions. The TSNE is able to clearly split these uncertainty predictions into two clusters. This shows further evidence of the effectiveness of our CRNP model to estimate uncertainties. In Sub Fig.(3), we show the CRNP uncertainty heatmaps for a BraTS image, where the maps are estimated in the feature space and mapped back to the original image space. In this figure, (a)(c)(e)(g) are the flair, t1, t1ce and t2 modalities; (b)(d)(f)(h) are the CRNP uncertainty maps for the modalities above (brighter pixel = higher uncertainty); and (i)(j) are the ground truth (GT) segmentation and CRNP prediction. Note that the high uncertainty regions are concentrated around the areas with brain tumors, which is reasonable since tumors are sparsely represented in the feature space, resulting in a large difference between RNP’s random and prediction networks. Also note that the flair image has a stronger tumor signal than the other modalities, producing a larger uncertainty for the other modalities. In particular, this larger uncertainty will notify the other modalities to pay more attention to these areas.
5 Conclusions
In this paper, we proposed the Uncertaintyaware Multimodal Learning model, named Crossmodal Random Network Prediction (CRNP). CRNP measures the total uncertainty in the feature space for each modality to better guide multimodal fusion. Moreover, technically speaking, the proposed CRNP is the first approach to explore random network prediction to estimate uncertainty and fuse multimodal data. CRNP has a stable training process compared with a recent multimodal approach that uses potentially unstable covariance measures to estimate uncertainty [26], and CRNP can also be easily translated between different prediction tasks. Through experiments on two medical image segmentation datasets and three computer vision classification datasets, the effectiveness of the proposed CRNP model is verified. Also, ablation and visualization studies further validate CNRP as an effective multimodal analysis method.
References
 [1] (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion 76, pp. 243–297. Cited by: §1.
 [2] (2019) Phiseg: capturing uncertainty in medical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 119–127. Cited by: §1, §2.2.
 [3] (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894. Cited by: §1, §3.2.

[4]
(2021)
Localizing visual sounds the hard way.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 16867–16876. Cited by: §1, §2.1.  [5] (2021) Distilling audiovisual knowledge by compositional contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7016–7025. Cited by: §1, §2.1.
 [6] (2016) 3D unet: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computerassisted intervention, pp. 424–432. Cited by: §4.3, Table 3.
 [7] (2019) Addressing failure prediction by learning model confidence. Advances in Neural Information Processing Systems 32. Cited by: §2.2.
 [8] (2018) Leveraging uncertainty estimates for predicting segmentation quality. arXiv preprint arXiv:1807.00502. Cited by: §2.2.
 [9] (2020) Unpaired multimodal segmentation via knowledge distillation. In IEEE Transactions on Medical Imaging, Cited by: §1, §1, §2.1, §4.2, §4.3, §4.3.
 [10] (2005) A bayesian hierarchical model for learning natural scene categories. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 524–531. Cited by: §4.1.

[11]
(2015)
Bayesian convolutional neural networks with bernoulli approximate variational inference
. arXiv preprint arXiv:1506.02158. Cited by: §4.4, Table 4.  [12] (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §2.2.
 [13] (2021) A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342. Cited by: §1.
 [14] (2021) Trusted multiview classification. arXiv preprint arXiv:2102.02051. Cited by: §1, §1, §2.3, §4.1, §4.2, §4.4, §4.6, Table 4.
 [15] (2018) Uncertaintyaware attention for reliable interpretation and prediction. Advances in neural information processing systems 31. Cited by: §4.4, Table 4.
 [16] (2020) Semisupervised multiview deep discriminant representation learning. IEEE transactions on pattern analysis and machine intelligence 43 (7), pp. 2496–2509. Cited by: §2.1.
 [17] (2019) Assessing reliability and challenges of uncertainty estimations for medical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 48–56. Cited by: §2.2.
 [18] (2017) What uncertainties do we need in bayesian deep learning for computer vision?. Advances in neural information processing systems 30. Cited by: §1, §2.2.
 [19] (2019) A hierarchical probabilistic unet for modeling multiscale ambiguities. arXiv preprint arXiv:1905.13077. Cited by: §1, §2.2.
 [20] (2018) A probabilistic unet for segmentation of ambiguous images. Advances in neural information processing systems 31. Cited by: §1, §2.2, §4.2, §4.3, §4.3, Table 3.
 [21] (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30. Cited by: §2.2, §4.4, Table 4.

[22]
(2021)
Dualconsistency semisupervised learning with uncertainty quantification for covid19 lesion segmentation from ct images
. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 199–209. Cited by: §2.2.  [23] (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.2.
 [24] (2021) Efficient semisupervised gross target volume of nasopharyngeal carcinoma segmentation via uncertainty rectified pyramid consistency. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 318–329. Cited by: §2.2.
 [25] (2016) Vnet: fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pp. 565–571. Cited by: §4.3, Table 3.
 [26] (2020) Stochastic segmentation networks: modelling spatially correlated aleatoric uncertainty. Advances in Neural Information Processing Systems 33, pp. 12756–12767. Cited by: §1, §2.3, §4.3, Table 3, §5, footnote 2, footnote 5.
 [27] (2016) Fully convolutional networks for multimodality isointense infant brain image segmentation. In 2016 IEEE 13Th international symposium on biomedical imaging (ISBI), pp. 1342–1345. Cited by: §1.

[28]
(2018)
Randomized prior functions for deep reinforcement learning
. Advances in Neural Information Processing Systems 31. Cited by: §3.2.  [29] (2020) Multimodal selfsupervision from generalized data transformations. arXiv preprint arXiv:2003.04298. Cited by: §2.1.
 [30] (2021) Spacetime crop & attend: improving crossmodal video representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10560–10572. Cited by: §2.1.
 [31] (2018) Realtime prediction of segmentation quality. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 578–585. Cited by: §2.2.
 [32] (2018) Evidential deep learning to quantify classification uncertainty. Advances in Neural Information Processing Systems 31. Cited by: §1, §2.2, §4.4, Table 4.
 [33] (2018) Multimodal learning from unpaired images: application to multiorgan segmentation in ct and mri. In 2018 IEEE winter conference on applications of computer vision (WACV), pp. 547–556. Cited by: §1.
 [34] (2011) The caltechucsd birds2002011 dataset. Cited by: §4.1.
 [35] (2020) Soft expert reward learning for visionandlanguage navigation. In European Conference on Computer Vision, pp. 126–141. Cited by: §1.
 [36] (2021) Tripleduncertainty guided mean teacher model for semisupervised medical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 450–460. Cited by: §2.2.
 [37] (2021) Medical matting: a new perspective on medical segmentation with uncertainty. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 573–583. Cited by: §2.2.
 [38] (2021) Transbts: multimodal brain tumor segmentation using transformer. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 109–119. Cited by: §1, §4.3, Table 3.
 [39] (2020) Deep multimodal fusion by channel exchanging. Advances in Neural Information Processing Systems 33, pp. 4835–4845. Cited by: §2.1.
 [40] (2020) Modalitypairing learning for brain tumor segmentation. In International MICCAI Brainlesion Workshop, pp. 230–240. Cited by: §1, §4.2, §4.3, Table 3.
 [41] (2019) Evaluation of algorithms for multimodality whole heart segmentation: an openaccess grand challenge. Medical image analysis 58, pp. 101537. Cited by: §4.1, §4.2, §4.3, §4.3.