Multidimensional Uncertainty-Aware Evidential Neural Networks

12/26/2020 ∙ by Yibo Hu, et al. ∙ Virginia Polytechnic Institute and State University The University of Texas at Dallas 2

Traditional deep neural networks (NNs) have significantly contributed to the state-of-the-art performance in the task of classification under various application domains. However, NNs have not considered inherent uncertainty in data associated with the class probabilities where misclassification under uncertainty may easily introduce high risk in decision making in real-world contexts (e.g., misclassification of objects in roads leads to serious accidents). Unlike Bayesian NN that indirectly infer uncertainty through weight uncertainties, evidential NNs (ENNs) have been recently proposed to explicitly model the uncertainty of class probabilities and use them for classification tasks. An ENN offers the formulation of the predictions of NNs as subjective opinions and learns the function by collecting an amount of evidence that can form the subjective opinions by a deterministic NN from data. However, the ENN is trained as a black box without explicitly considering inherent uncertainty in data with their different root causes, such as vacuity (i.e., uncertainty due to a lack of evidence) or dissonance (i.e., uncertainty due to conflicting evidence). By considering the multidimensional uncertainty, we proposed a novel uncertainty-aware evidential NN called WGAN-ENN (WENN) for solving an out-of-distribution (OOD) detection problem. We took a hybrid approach that combines Wasserstein Generative Adversarial Network (WGAN) with ENNs to jointly train a model with prior knowledge of a certain class, which has high vacuity for OOD samples. Via extensive empirical experiments based on both synthetic and real-world datasets, we demonstrated that the estimation of uncertainty by WENN can significantly help distinguish OOD samples from boundary samples. WENN outperformed in OOD detection when compared with other competitive counterparts.



There are no comments yet.


page 2

page 5

Code Repositories


Multidimensional Uncertainty-Aware Evidential Neural Networks ( AAAI2021)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep Learning (DL) models have recently gained tremendous attention in the data science community. Despite their superior performance in various decision making tasks, inherent uncertainty derived from data based on different root causes has not been sufficiently explored. Predictive uncertainty estimation using Bayesian neural networks (BNNs) has been explored for classification prediction or regression in computer vision applications Kendall and Gal (2017). They considered well-known uncertainty types, such as aleatoric uncertainty (AU) and epistemic uncertainty (EU), where AU only considers data uncertainty caused by statistical randomness (e.g., observation noises) while EU refers to model uncertainty introduced by limited knowledge or ignorance in collected data. On the other hand, in the belief/evidence theory, Subjective Logic (SL) Jøsang et al. (2018) considered vacuity, which is caused by a lack of evidence, as the key dimension of uncertainty. In addition to vacuity, they also defined other types of uncertainty, such as dissonance (e.g., uncertainty due to conflicting evidence) or vagueness (e.g., uncertainty due to multiple beliefs on a same observation).

Although conventional deep NNs (DNNs) have been commonly used to solve classification tasks, uncertainty associated with classification classes has been significantly less considered in NNs even if the risk introduced by misclassification may bring disastrous consequence in real-world situations, such as car crash due to the misclassification of objects in roads. Recently, techniques using evidential neural networks (ENNs) Sensoy et al. (2018) have been proposed to explicitly model the uncertainty of class probabilities. An ENN uses the predictions of an NN as subjective opinions and learns a function that collects an amount of evidence to form the opinions by a deterministic NN from data. However, the ENN is trained as a black box without explicitly considering different types of uncertainty in the data (e.g., vacuity or dissonance), which often results in overconfidence when tested with out-of-distribution (OOD) samples. We measure the extent of confidence in a given classification decision based on the high class probability of a given class (i.e., a belief in SL). Overconfidence refers to a high class probability in an incorrect class prediction. To mitigate the overconfidence issue, regularization methods have proposed to hand-pick auxiliary OOD samples to train the model Malinin and Gales (2018); Zhao et al. (2019). However, the regularization methods with prior knowledge require a large amount of OOD samples to ensure the good generalization of a model behaviour to the whole data space.

In this work, we propose a model called WGAN-ENN (WENN) that combines ENNs with Wasserstein GAN (WGAN) Arjovsky et al. (2017) to jointly train a model with prior knowledge of a certain class (e.g., high vacuity OOD or high dissonance in-class boundary samples) to reinforce and achieve high prediction confidence only for in-distribution (ID) regions, high vacuity only for OOD samples, and high dissonance only for ID boundary samples.

To briefly demonstrate the performance of uncertainty estimation by conventional NNs, ENNs, and our proposed WENN model, we explain it with a simple three-class classification problem in Fig 1. We measure entropy Shannon (1948) estimated on the predictive class probabilities by three different approaches: (i) Fig 1 (a) shows the prediction for boundary or OOD samples by traditional NNs (i.e., BNN to indirectly infer uncertainty through weight uncertainties) using the softmax and demonstrates overconfidence; (ii) Fig 1 (b) shows the overconfidence in the prediction of OOD samples by the ENN; and (iii) Fig 1 (c) shows the high confidence in the prediction of the ID region by WENN.

Figure 1: Uncertainty (entropy) estimation based on synthetic data: (a) using standard NNs with softmax function; (b) using ENNs; and (c) using WENN. Only (c) shows high uncertainty in both boundary and OOD regions.

This work provides the following key contributions:

  • We considered inherent uncertainties derived from different root causes by taking a hybrid approach that leverages both deep learning and belief model (i.e., Subjective Logic or SL). Although both fields have studied uncertainty-aware approaches to tackle various kinds of decision making problems, there has been lack of efforts to leverage both of their merits. We believe this work sheds light on the direction of incorporating both fields.

  • We considered ENNs to quantify multidimensional uncertainty types in data and learn subjective opinions. In particular, the subjective opinions formulated by the SL can be easily leveraged for the quantification of multidimensional uncertainties where we measured vacuity and dissonance based on SL.

  • Our proposed WENN, combining WGAN and ENNs, can generate a sufficient amount of auxiliary OOD samples for training and use the Wasserstein distance to measure the variety of those samples. Our proposed alternating algorithm can leverage all the intermediate samples more efficiently than other regularized methods.

  • We demonstrated that WENN outperforms competitive state-of-the-art counterparts in OOD detection, showing 7% better performance than the best of the counterparts in the most difficult scenario CIFAR10 vs CIFAR100.

Related Work

Uncertainty Quantification in Bayesian Deep Learning (BDL): Machine/deep learning (ML/DL) researchers considered aleatoric uncertainty (AU) and epistemic

uncertainty (EU) based on Bayesian Neural Networks (BNNs) for computer vision applications. AU consists of homoscedastic uncertainty (i.e., constant errors for different inputs) and heteroscedastic uncertainty (i.e., different errors for different inputs) 

Gal (2016). A BDL framework was presented to estimate both AU and EU simultaneously in regression settings (e.g., depth regression) and classification settings (e.g., semantic segmentation) Kendall and Gal (2017). Dropout variational inference Gal and Ghahramani (2016) was proposed as one of key approximate inference techniques in BNNs  Blundell et al. (2015); Pawlowski et al. (2017). Later distributional uncertainty is defined based on distributional mismatch between the test and training data distributions Malinin and Gales (2018).

Uncertainty Quantification in Belief/Evidence Theory: In belief/evidence theory, uncertainty reasoning has been substantially explored in Fuzzy Logic De Silva (2018), Dempster-Shafer Theory (DST) Sentz et al. (2002), or Subjective Logic (SL) Jøsang (2016). Unlike the efforts in ML/DL above, belief/evidence theory focused on reasoning of inherent uncertainty in information resulting from unreliable, incomplete, deceptive, and/or conflicting evidence. SL considered uncertainty in subjective opinions in terms of vacuity (i.e., a lack of evidence) and vagueness (i.e., failure of discriminating a belief state) Jøsang (2016). Recently, other dimensions of uncertainty have been studied, such as dissonance (due to conflicting evidence) and consonance (due to evidence about composite subsets of state values) Jøsang et al. (2018). In DNNs, Sensoy et al. (2018) proposed ENN models to explicitly modeling uncertainty using SL. However, it only considered predictive entropy to qualify uncertainty.

Out-of-Distribution Detection

: Recent OOD detection approaches began to use NNs in a supervised fashion that outperformed traditional models, such as kernel density estimation and one-class support vector machine in handling complex datasets 

Hendrycks and Gimpel (2016). Many of these models Liang et al. (2017); Hendrycks et al. (2018) integrated auxiliary datasets to adjust the estimated scores derived from prediction probabilities. These OOD detection models were specifically designed to detect OOD samples. In addition, well-designed uncertainty estimation models are leveraged for OOD detection. Recently, uncertainty models Sensoy et al. (2020) have shown their preliminary results on OOD detection by conducting performance comparison of various OOD detection models.


This section provides the background knowledge to understand this work, including: (1) subjective opinions in SL; (2) uncertainty characteristics of subjective opinion; and (3) ENNs to predict subjective opinions.

Subjective Opinions in SL

A multinomial opinion in a given proposition is represented by where a domain is

, a random variable

takes value in and . The additivity requirement of is given as . Each parameter indicates,

  • : belief mass function over ;

  • : uncertainty mass representing vacuity of evidence;

  • : base rate distribution over , with .

The projected probability distribution of a multinomial opinion is given by:


Multinomial probability density over a domain of cardinality is represented by the -dimensional Dirichlet PDF where the special case with is the Beta PDF as a binomial opinion. Denote a domain of mutually disjoint elements in and

the strength vector over

and the probability distribution over . Dirichlet PDF with as -dimensional variables is defined by:


where , , and if .

We term evidence

as a measure of the amount of supporting observations collected from data in favor of a sample to be classified into a certain class. Let

be the evidence derived for the singleton . The total strength for the belief of each singleton is given by:


where is a non-informative weight representing the amount of uncertain evidence and is the base rate distribution. Given the Dirichlet PDF, the expected probability distribution over is:


The observed evidence in the Dirichlet PDF can be mapped to the multinomial opinions by:


where . We set the base rate and the non-informative prior weight , and hence for each , as these are default values considered in subjective logic.

Uncertainty Characteristics of Subjective Opinions

The multidimensional uncertainty dimensions of a subjective opinion based on the formalism of SL are discussed in  Jøsang et al. (2018). As we deal with a multinomial opinion in this work, we discuss two main types of uncertainty dimensions, which are vacuity and dissonance.

The main cause of vacuity (a.k.a. ignorance) is a lack of evidence or knowledge, which corresponds to uncertainty mass, , of an opinion in SL as:


This type of uncertainty refers to uncertainty caused by insufficient information or knowledge to understand or analyze a given opinion.

The dissonance of an opinion can happen when there is an insufficient amount of evidence that can clearly support a particular belief. For example, when a same amount of evidence is supporting multiple extremes of beliefs, high dissonance is observed. Hence, the dissonance is estimated by the difference between singleton belief masses (e.g., class labels), leading to ‘inconclusiveness’ in decision making situations. Given a multinomial opinion with non-zero belief masses, the measure of dissonance can be obtained by:


where the relative mass balance between a pair of belief masses and is expressed by:

Figure 2: Illustration of different vacuity and dissonance of subjective opinions based on their evidence measures.

The above two uncertainty measures (i.e., vacuity and dissonance) can be interpreted using class-level evidence measures of subjective opinions. As in Fig. 2, given three classes (1, 2, and 3), we have three subjective opinions , represented by the three-class evidence measures as: representing low uncertainty (entropy, dissonance and vacuity) which implies high certainty (often represented as high confidence in a decision making context), indicating high inconclusiveness due to high conflicting evidence which gives high entropy and high dissonance, showing the case of high vacuity which is commonly observed in OOD samples. Based on our observations from Fig 2 (b) and (c), we found that entropy cannot distinguish uncertainty due to vacuity or dissonance, which naturally results in inability to distinguish boundary samples from OOD samples. However, vacuity can effectively detect OOD samples because the cause of uncertainty is from a lack of evidence.

Evidential Neural Networks (ENNs)

ENNs Sensoy et al. (2018)

are similar to classical NNs except that the softmax layer is replaced by an activation layer (e.g., ReLU) to ascertain non-negative output, which is taken as the evidence vector for the predicted Dirichlet distribution. Given sample

, let represents the evidence vector predicted by the network for the classification, where is the input feature and is the network parameters. Then, the corresponding Dirichlet distribution has parameters . Let be the ground-truth label, the Dirichlet density is the prior on the Multinomial distribution, . The following sum of squared loss is used to estimate the parameters based on the sample :


Eq. (Evidential Neural Networks (ENNs)) is based on class labels of training samples. However, it does not directly measure the quality of the predicted Dirichlet distributions such that the uncertainty estimates may not be accurate.

Training ENNs with Wasserstein GAN

Given the various characteristics of uncertainty based on SL, we propose a novel model that combines ENNs and WGAN to quantify multiple types of uncertainty (i.e., vacuity and dissonance) and solving classification tasks.

Regularized ENNs

Given a set of samples , let and

be the distributions of the OOD and ID samples respectively. We propose a training method using a regularized ENN to minimize the following loss function over the parameters

of the model’s function :


The first item (Eq. (Evidential Neural Networks (ENNs))) ensures a good estimation of the class probabilities of the ID samples. Since it assigns large confidence on training samples during the classification process, it also contributes to reducing the vacuity of ID samples. The second item is to increase the vacuity estimation from the model on OOD samples. is the trade-off parameter. Therefore, minimizing Eq. (10) is to achieve high classification accuracy, low vacuity output for ID samples and high vacuity output for OOD samples. To ensure the model’s generalization to the whole data space, the choice of effective is important. While some methods Lee et al. (2017); Hein et al. (2019); Sensoy et al. (2020) only use close or adversarial samples, we found that both close and far-away samples are equally important. Instead of using hand-picked auxiliary dataset which requires a lot of tuning Zhao et al. (2019), we used generative models to provide sufficient various OOD samples.

Wasserstein Generators for OOD

We chose WGAN Arjovsky et al. (2017) as our generators because (i) it provides higher stability than original GAN Goodfellow et al. (2014b); and (ii) it offers a meaningful loss metric by leveraging the Wasserstein distance, which can measure the distance between the generated samples and the ID region.

WGAN consists of two main components: Discriminator and generator . maps input latent variable into generator output where represents the probability of input sample coming from ID. The objective function is to recover from . WGAN uses the Wasserstein distance instead of the original divergence in the GAN’s loss function, which is considered as continuous and differentiable for optimization. The Wasserstein distance between two distributions and is informally defined as the minimum cost of transporting mass to transform into . Under the Lipschitz constraint, the loss function of WGAN can be written as:


where is the ID and is the the generated distribution defined by and , which is usually sampled from uniform or Gaussian noise distribution. WGAN employs weight clipping or gradient penalty (WGAN-GP) Gulrajani et al. (2017) to enforce a Lipschitz constraint to keep the training stable. We can estimate the Wasserstein distance at the step after updates and before updates during the alternating training process:


The estimated curve of during WGAN training shows high correlation with high visual quality of the generated samples Arjovsky et al. (2017). When training WGAN from the scratch, the initial large distance indicates that the generated samples has very low-quality, showing far-away samples to the ID region. Through the progress of the training, the distance decreases continuously, which leads to higher sample quality. This implies that the samples are getting close to the ID region. Therefore we adopted to measure the variety of generated samples, which are used as prior knowledge of our model.

However, original WGAN is designed to generate ID samples. To reinforce recover OOD , we propose the following new WGAN loss with uncertainty regularization.


where is a trade-off parameter and is the uncertainty estimation from a classifier trained on ID. This regularization item enforces the generated samples to have high vacuity uncertainty.

Jointly Training ENNs and WGAN

To improve the OOD detection accuracy, we developed a joint training algorithm which allows ENNs to utilize various types of OOD samples generated from WGAN:


where the sum of the first two items is the regularized ENN loss (Eq. (10)) and the sum of the latter two represents WGAN-GP with uncertainty regularization (Eq. (13)). We ignore the gradient penalty item here for simplicity. We pretrain the ENN classifier for a good classification accuracy and then feed it into Algorithm 1 to accelerate the training.

Pretrained ENN with weights , initial ’s weights , initial ’s weights . , : The number of iterations of and ENN per iteration. : the trade-off weight. : the batch size.
2:     for   do
3:          Sample and
4:          Update by descending its gradient (with penalty)
6:     end for
7:     Get the approximated Wasserstein distance
9:     Update once by ascending its gradient
11:     for   do
12:          Sample , ,
13:          Update ENN by descending the gradient
15:          Update ENN by ascending the gradient
17:     end for
18:until  convergence
Algorithm 1 Alternating minimization for WGAN and ENN

Each batch of the OOD samples correspond to different decreasing . This enables the ENN to utilize a wide range of OOD samples. ENNs improves uncertainty estimation based on OOD samples from , and achieves a better OOD quality due to uncertainty estimation from ENNs simultaneously. We stop the training when converges in case the ENN may forget the effect of previous far-away OOD samples from ID regions. Fig 6 illustrates the change of and output vacuity during the training process.


We first illustrated the advantage of evidential uncertainty in (1) a synthetic experiment. Then we compared our approach with the recent uncertainty estimation models on (2) predictive uncertainty estimation and (3) adversarial uncertainty estimation. (4) We also investigated the effect of different types of uncertainties on the OOD detection.

Figure 3: A synthetic experiment that shows the benefit of using evidential uncertainties (vacuity, dissonance) over entropy ranged in [0, 1]. Entropy cannot distinguish ID and OOD samples at class boundaries.


Fig 3

shows three Gaussian distributed classes with equidistant means and tied isotropic variance

(a large degree of class overlap). We used our proposed WENN method, a small NN with 2 hidden layer of 500 neurons each was trained on this data. Fig 

3 demonstrates that entropy and two evidential uncertainties, which are vacuity, dissonance, exhibit distinctive behaviors. Entropy is high both in overlapping and far-away regions from training data, which makes it hard to distinguish ID and OOD samples at class boundaries. In contrast, vacuity is low over the whole region of training data while vacuity is high for the outside of the region of training data. This allows the ID region to be clearly distinguished from the OOD region. In addition, high dissonance is observed over decision boundary which indicates high chances of misclassification.

(a) MNIST: ID vs OOD
(b) CIFAR10: ID vs OOD
(c) CIFAR10: Successful vs Fail predictions
Figure 4: Boxplots of predictive uncertainty of different models on ID/OOD testing datasets: (a) MNIST test set (ID) vs notMNIST (OOD); (b) Samples from the first five (ID) vs the last five (OOD) categories of CIFAR10 test set; (c) Successful and failed predictions in CIFAR10 test set (ID). Our model uses entropy (Ent), vacuity (Vac), dissonance (Dis) as a measure of uncertainty, while other models use entropy.

Predictive Uncertainty Estimation

Comparing Schemes: We compared our model with the following schemes: (i) L2 refers to the standard deterministic NNs with softmax output and weight decay; (ii) DP uses Dropout, the uncertainty estimation model (i.e., BNNs) Gal and Ghahramani (2016); (iii) DE refers to Deep Ensembles Lakshminarayanan et al. (2017); (iv) BBB refers to Bayes by Backprop  Blundell et al. (2015); (v) BBH refers to Bayes by Hypernet Pawlowski et al. (2017), a Bayesian model based on implicit weight uncertainty; (vi) MNF refers to the variational approximation based model in Louizos and Welling (2017); (vii) ENN uses evidential DL model Sensoy et al. (2018); (viii) GEN

combines ENNs and Adversarial Autoencoder 

Sensoy et al. (2020); and (ix) Ent, Vac and Dis are the entropy, vacuity and dissonance of our proposed model WENN.

Setup: We followed the same experiments in Sensoy et al. (2020) on MNIST LeCun et al. (1998) and CIFAR10 Krizhevsky et al. (2009): (1) For the MNIST dataset, we used the same LeNet-5 architecture from Sensoy et al. (2020). We trained the model on MNIST training set and tested on MNIST testing set as ID samples and notMNIST Bulatov, Y. (2011) as OOD samples; and (2) For the CIFAR10 dataset, we used ResNet-20 He et al. (2016) as a classifier in all the models considered in this work. We trained on the samples for the first five categories {airplane, automobile, bird, cat, deer} in the CIFAR10 training set (i.e., ID), while using the other five categories {ship, truck, dog, frog, horse} as testing OODs. We used the source code of BBH and GEN, which also contained implementations of other approaches. But we changed all the classifiers to the same LeNet-5 and ResNet-20 respectively. All the baselines were fairly trained with their default best parameters and we reported the average results. For WENN, we set , , , , in Algorithm 1 in all the experiments, which were fine-tuned considering the performance of both the OOD detection and ID classification accuracy. For more details refer to Appendix and our source code 111

Metrics: Our proposed model used vacuity and dissonance estimated based on Eq. (6) and Eq. (7). To be consistent with other works that used entropy as a measure of uncertainty, we also compared the predictive entropy over the range of possible entropy . We used the boxplots to show the distribution of predictive uncertainty.

Results: To evaluate OOD uncertainty qualification, Fig 4 (a) and (b) show the boxplots of the predictive uncertainty under all models trained with MNIST and CIFAR10 and tested on their corresponding ID and OOD datasets. The ideal model is expected to have a low ID box and a high OOD box, i.e., the model is certain about the ID inputs while totally uncertain about the OOD inputs. To measure ID uncertainty qualification, Fig 4 (c) shows the boxplots of different models’ predictive uncertainty for correct and mis-classified examples in CIFAR10 ID testing set. The figure indicates that a standard network is overconfident of any inputs. BBH performs the best among all the Bayesian models on MNIST but fails to give a disparity between ID and OOD on CIFAR10. ENN and GEN perform well on MNIST. However, Fig 4 (b) and (c) show that they force high uncertainty for mis-classified ID samples the same as OOD samples on CIFAR10. Sensoy et al. (2020) admits that ENN and GEN may classify the boundary ID samples as OOD because of their high entropy. The above results all indicate the limitation of entropy in uncertainty estimation.

WENN using entropy beats other counterparts in estimating OOD uncertainty because it benefits from our algorithm using vacuity. However, WENN is more powerful when using vacuity and dissonance to measure OOD and ID uncertainty respectively. For ID uncertainty, Fig 4 (c) illustrates that high dissonance implies conflicting evidence, which can result in mis-classification. For OOD uncertainty, Fig 4 (b) and (c) show that all the ID samples, i.e., even the mis-classified samples, have extreme low vacuity, compared to the high vacuity of OOD samples. However, WENN still assigns medium entropy to boundary ID samples. This is consistent with the synthetic experiment’s result, showing the advantage of adopting vacuity in distinguishing boundary ID and OOD samples.

Adversarial Uncertainty Estimation

We also evaluated these models on CIFAR10 using adversarial examples generated by FGSM Goodfellow et al. (2014a) with different perturbation values . DE is excluded because it is trained on adversarial samples. Fig 5 shows that as increases, WENN’s accuracy immediately drops to random and the uncertainty simultaneously increases to maximum entropy, i.e., WENN will assign the highest uncertainty with the inputs if it can’t make easy predictions. It knows what it doesn’t know and never becomes overconfident. We observe the same behaviors for MNIST dataset (in Appendix).

Figure 5: Accuracy (solid) vs entropy (dashed) as a function of the adversarial perturbation on CIFAR10.

Out-of-Distribution Detection

Comparing Schemes: We compared with several recent methods specifically designed for OOD detection, together with uncertainty models ENN and GEN: (i) MSP refers to maximum softmax probability, a baseline of OOD detection in Hendrycks and Gimpel (2016); (ii) CCC Lee et al. (2017) uses GAN to generate boundary OOD samples as regularizers; (iii) ODIN

calibrates the estimated confidence by scaling the logits before softmax layers 

Liang et al. (2017); (iv) ACET uses adversarial examples to enhance the confidence Hein et al. (2019); (v) OE

refers to Outlier Exposure 

Hendrycks et al. (2018)

that enforces uniform confidence on 80 million Tiny ImageNet

Torralba et al. (2008); (vi) CCU

integrates Gaussian mixture models in OOD detection DL models 

Meinke and Hein (2019); and (vii) Ent and Vac refer to our WENN model using entropy or vacuity as scores.

Setup: We used FashionMNIST Xiao et al. (2017), notMNIST, CIFAR10, CIFAR100 Krizhevsky et al. (2009)


Netzer et al. (2011), the classroom class of LSUN (i.e., LSUN_CR) Yu et al. (2015) and uniform noise as ID or OOD datasets. We used the source code in Meinke and Hein (2019) which contained implementations of other baselines, but we used ResNet-20 for all the models except CCC. We used VGG-13 Simonyan and Zisserman (2014)

for CCC because we couldn’t achieve an acceptable accuracy using ResNet-20. And ODIN, OE and CCU were directly trained or calibrated on the Tiny ImageNet. Other settings were the same as the previous uncertainty estimation experiments.

Metrics: Our model uses vacuity to distinguish between ID and OOD samples. ENN and GEN use the entropy of the predictive probabilities as recommended in their papers. The other methods use their own OOD scores. We use area under the ROC (AUROC) curves to evaluate the performance of different type of uncertainty.

Results: Table 1 shows the AUROC curves performance of different approaches. WENN’s vacuity beats all the other uncertainty scores, including its own entropy. ENN and GEN are not originally designed for OOD detection because they assign the same high entropy to mis-classified ID samples as OOD. CCC doesn’t generalize well and it lacks scalability to recent deep architectures like ResNet to ensure a better classification accuracy. The result of ACET proves that the effect of using purely close adversarial examples is limited. Hein et al. (2019) admits that ACET will yield high-confidence predictions far away from the training data. ODIN, OE and CCU are directly trained or tuned using a large auxiliary dataset which should contain both far-away and close samples. The outperformance of WENN indicates that our algorithm using vacuity can generate and utilize sufficient OOD samples more effectively.

FMNIST MNIST 96.9 90.1 96.2 99.9 99.2 95.6 92.0 96.2 100.0 100.0
notMNIST 87.5 87.8 93.6 96.4 90.2 92.4 93.0 96.7 99.9 100.0
Uniform 93.0 91.6 96.9 95.4 94.9 100.0 99.3 100.0 99.9 100.0
CIFAR10 CIFAR100 86.3 75.0 84.0 84.0 87.1 85.2 86.0 92.5 98.6 99.5
SVHN 88.9 78.6 85.4 80.5 85.1 89.6 92.1 98.9 100.0 100.0
LSUN_CR 88.8 64.4 98.0 99.7 92.8 89.1 92.7 98.6 100.0 100.0
Uniform 93.8 84.6 82.4 82.4 99.3 100.0 100.0 100.0 100.0 100.0
SVHN CIFAR10 95.2 52.9 50.2 98.6 95.8 96.3 100.0 100.0 99.3 100.0
CIFAR100 94.9 51.8 51.0 98.2 95.3 95.6 100.0 100.0 99.3 100.0
LSUN_CR 94.9 55.5 53.9 100.0 95.6 97.0 100.0 100.0 99.8 99.9
Uniform 95.8 53.9 53.2 100.0 96.6 100.0 100.0 100.0 99.8 100.0
  • ID: In-distribution training sets, OOD: Out-of-distribution testing sets

Table 1: AUROC for OOD detection.

To further explain how our model generates and utilizes variable OOD samples, Fig 6 illustrates the alternating optimization process when the model is trained on CIFAR10 training set. The initial ENN classifier is overconfident and assigns arbitrary inputs with low vacuity 0.2. As the Wasserstein distance decreases gradually, implying that the generated samples keep moving from far-away to closer to the ID region, the model learns to output low vacuity on ID samples from CIFAR10 testing set and high vacuity on OOD samples from CIFAR100 and SVHN. The output vacuity of CIFAR100 is lower than that of SVHN. This indicates vacuity is a reasonable uncertainty metric because CIFAR100 is often considered as near-distribution outliers of CIFAR10. However, the medium vacuity of CIFAR100 is still good enough for perfect classification.

Figure 6: The change of Wasserstein distance (Dist) (normalized to [0, 1]) and the vacuity of ID samples from CIFAR10 testing set and the vacuity of OOD samples from CIFAR100 and SVHN datasets when the model is trained on CIFAR10 training set.


We proposed a novel DL model, called WENN that combines ENNs with WGAN, to jointly train a model with prior knowledge of a certain class (i.e., high vacuity OOD samples). Via extensive experiments based on both synthetic and real datasets, we proved that: (1) vacuity can distinguish boundary samples from OOD samples; (2) the proposed model with vacuity regularization can produce and utilize various types of OOD samples successfully. Our model achieved the state of the art performance in both uncertainty estimation and OOD detection benchmarks.


This work is supported by NSF awards IIS-1815696 and IIS-1750911.