1 Introduction
Likelihood models are naturally considered as owning ideal capability of detecting outofdistribution (OOD) inputs, due to the intuitive assumption bishop1994novelty that these models assign lower likelihoods to the OOD inputs than indistribution(ID) inputs. However, emerging works nalisnick2018deep; hendrycks2018deep; choi2018waic; lee2017training; nalisnick2019detecting; maaloe2019biva have reported that the deep generative models, such as variational autoencoders(VAEs) kingma2013auto; rezende2014stochastic, Pixel CNN van2016conditional and Glow kingma2018glow, all based on likelihood models, are not able to correctly detect OOD inputs. Counterintuitively, the OOD inputs are assigned higher likelihoods than the ID inputs, which does not accord with the acknowledged assumption. Hence, when we employ the likelihood model as a detector on OOD detection tasks or general generation tasks, it is necessary to ensure that the adopted model possesses a good understanding and performance for OOD inputs.
The phenomenon that VAE assigns higher likelihoods to OOD inputs than ID inputs was firstly reported by nalisnick2018deep. Ever since, it has been an increasing hot topic in the generative model field. Some studies serra2019input; nalisnick2019detecting; butepage2019modeling have made great efforts to interpret this empirical phenomenon. For instance, butepage2019modeling demonstrated that it is caused by model assumptions and evaluation schemes, where oversimplified likelihood function assumed in VAE model (e.g., iid Bernoulli or iid Gaussian) impacts the judgment for data distribution of ID inputs. However, the true likelihood function is often unknown and has certain deviations from the assumed one. In some datasets, local evaluation under the approximated posterior will cause overconfidence. nalisnick2019detecting conjectured that high likelihood region conflicts with model’s typical set. serra2019input posited that the input complexity will produce a strong impact on likelihood based models.
To solve the OOD detection problem, various studies choi2018waic; nalisnick2018deep have suggested that likelihood models with reliable uncertainty estimation may contribute to improving OOD detection. Additionally, noise contrastive priors (NCPs) hafner2018reliable
, as a specific prior for neural networks in data space, encourage network weights not only to explain ID inputs but also capture the high uncertainty of OOD samples. Inspired by these pioneer studies, we introduce a novel method, named Improved Noise Contrastive Priors Variational Autoencoder (INCPVAE), to obtain VAE’s reliable uncertainty estimation and therefore to solve the OOD detection problem. Notwithstanding, the original NCP method is often applied to the classifier models which cannot directly apply to VAE framework. Hence, we improved the loss function of NCP to make it be suitable for VAE framework. Moreover, we adapt the improved NCP for the encoder of VAE and then generate OOD samples by adding Gaussian noise to the origin inputs. Besides, for the fact that using the simple likelihood function of VAE failed on the OOD detection task, we exploit the INCPKL divergence of INCPVAE, rather than the likelihood, for detecting OOD inputs. Our experiments show that the INCPVAE can obtain better performances compared to the traditional VAE, as well as reduce the overconfidence when facing OOD data. In brief, our contributions of this work are:

We improve the noise contrastive prior to be suitable for VAE framework (Section 3.3). To the best of our knowledge, this is the first work applying noise contrastive prior to unsupervised generative model to obtain reliable uncertainty estimation.

We apply tailored metrics to uncertainty estimation (Section 3.4), by using which our INCPVAE framework achieve reliable uncertainty estimation and enhanced robustness.

We propose a novel OOD detection method via INCPKL divergence of INCPVAE (Section 3.5). Experiments demonstrate that the INCPVAE gains an excellent understanding for the OOD inputs and our detection method achieves stateoftheart (SOTA) performance on the challenging cases raised by nalisnick2018deep.
2 Related Work
OOD detection:
Capability of OOD detection is vital to machine learning models. From algorithmic perspectives, there are two categories of mainstream approaches to detect OOD samples: one is supervised/discriminative method, and the other follows unsupervised/generative fashion
daxberger2019bayesian. Most of existing methods belong to the former, aiming at acquiring a decision boundary or likelihood ratio between ID and OOD inputs through combining a dataset of anomalies with training data. The supervised approaches devries2018learning; liang2017enhancing; hendrycks2016baseline; lakshminarayanan2017simple can make full use of deep discriminative models and prevent poorlycalibrated neural networks guo2017calibration from mistakenly making highconfidence predictions on OOD inputs to some extent, producing consequential effects on various applications (such as anomaly detection vyas2018out; PidhorskyiGenerative; hendrycks2016baseline, adversarial defense song2018constructing). However, these methods can only suit for taskdependent scenarios, which is a severe limitation because the anomalous data is usually rare or not known ahead of time in realworld scenarios.In contrast, the unsupervised approaches aim to solve the problem by training deep generative models in a more general manner, where density estimation is widely applied oord2016pixel; kingma2018glow. However, as mentioned in the Introduction section, likelihood estimates in the popular deep generative models are not reliable for OOD detection, many researches have attempted to explain and tackle this problem serra2019input; nalisnick2019detecting; butepage2019modeling. But an efficient and robust solution is still missing and urgently needed.
Uncertainty estimation: Uncertainty estimation is bound up with OOD detection, with the goal to yield calibrated confidence measures for predictive distribution. The uncertainty estimation in MC Dropout gal2016dropout, DeepEnsemble lakshminarayanan2017simple and ODIN liang2017enhancing involves presenting a calibrated predictive distribution for classifiers. An alternative solution, variational information bottleneck (VIB) alemi2018uncertainty, conducts OOD detection via divergence estimation in latent space. However, these existing methods are modeldependent and rely heavily on taskspecific information for gaining integrated estimate of uncertainty. A more general method is of high needs. On the other hand, recent studies choi2018waic; nalisnick2018deep suggested that likelihood models with reliable uncertainty estimation maybe help to mitigate high OOD likelihood problem for generative models in a taskindependent manner. Here, we provide a novel hybrid scheme linking uncertainty estimation with noise contrastive priors and Gaussian noise, to aid in both reliability of uncertainty estimation in VAE and model independence in OOD detection.
3 Method
3.1 Improved Noise Contrastive Priors
hafner2018reliable proposed NCPs, as a kind of data priors that are applied to both ID inputs and OOD inputs . The OOD inputs are usually generated by imposing noise. In this work, to obtain the VAE’s uncertainty, we modify the loss function (See below) to make the original NCPs suitable for VAE framework. For it is hard to exactly generate OOD data, we add Gaussian noise to ID image to realize OOD data generation.
Generating OOD Inputs lee2017training reported that OOD samples are produced by sampling from the boundary of the ID with high uncertainty. hafner2018reliable
advanced an algorithm inspired by noise contrastive estimation
gutmann2010noise; mnih2013learning, where a complement distribution is approximated using random noise. For continuous ID inputs , we add Gaussian noise to obtain OOD inputs, which is . The distribution density of OOD inputs is formulated as,(1) 
where is the distribution density of ID inputs, and
are the mean and variance of Gaussian distribution of noise. In order to make noise contrastive prior equal in all directions of data manifold, we set
. The variance is a hyperparameter to tune the sampling distance from the boundary of training distribution. The complexity of OOD inputs is correlated with the variance.Data Priors The data priors consist of inputs prior and outputs prior
. To obtain a reliable VAE’s uncertainty estimation, an appropriate inputs prior should include OOD inputs so that it can obtain better performance than the baseline under training distribution. A good output prior should be a highentropy distribution that serves as high uncertainty about VAE’s target outputs given OOD inputs. The data priors are listed as follows:
(2)  
where is the distribution of OOD inputs, and are the parameter of OOD data outputs priors, is a hyperparameter tuning the level of target outputs uncertainty.
Loss Function Improved Noise Contrastive Priors (INCPs) have the merit of estimating the model’s uncertainty which is easily generalized to OOD samples. To train INCPs, we modified the loss function as follows:
(3) 
where denote OOD data priors, is the parameter of neural network. The hyperparameter represents the tradeoff between them. INCPs can be trained by minimizing this loss. Notice that in the Eq. 3, the first term makes the neural network suit for true ID data outputs prior by minimizing the KL divergence. And the second term represents the analogous term on the OOD data outputs prior. This loss function optimizes the ID and OOD posterior for two distinct targets simultaneously (the true ID data outputs prior and the assumed OOD data outputs prior), whereas the origin NCP loss hafner2018reliable makes the ID and OOD conditional distribution for one target.
3.2 Variational Autoencoder
VAEs rezende2014stochastic; kingma2013auto are a variety of latent variable models optimized by the maximum marginal likelihood of an observation variable . The marginal likelihood can be written as follows:
(4)  
where and are the prior (e.g., Vamp Prior tomczak2017vae, Resampled Prior bauer2018resampled
) by using a standard normal distribution and the true posterior respectively.
is the variational posterior (encoder) by employing a Guassian distribution, andis the generative model (decoder) by using a Bernoulli distribution. Both are modeled by a neural network with their parameter
, , respectively.However, the true posterior cannot be computed analytically. Assuming variational posterior has arbitrarily highcapacity for modeling, approximates intractably and the KLdivergence between and will be zero. Thus, we train VAE with ID (or OOD) samples to maximize the following objective variational evidence lower bound (called IELBO (or OELBO)):
(5)  
where and are variational posteriors for matching the true posteriors ( and ) which are given by and respectively. For a given dataset, the marginal likelihood is a constant. Substituting Eq. 5 to Eq. 4, we get
(6) 
which means maximizing IELBO is equivalent to minimizing the KLdivergence between and .
3.3 INCP Variational Autoencoder
INCPVAE consists of an encoder and decoder, and the improved NCPs are imposed on the encoder network of VAE. The INCPVAE is trained on both ID and OOD inputs by minimizing IELBO and OELBO. From Eq. 5, We have all the ELBO of INCPVAE as follows:
(7) 
The assumption of ID inputs variational posterior has highcapacity for modelling, then true posterior can be replaced by . Considering the definition of OOD outputs prior (Eq. 2), the true OOD data posterior is:
(8) 
where , is hyperparameter that determines how large we want the outputs uncertainty to be. And the KLdivergence between and (called INCPKL) becomes tractable and can be analytically computed. Maximizing the ELBO of INCPVAE can be replaced by minimizing the following loss function:
(9) 
The hyperparameter is a setting for tradeoff between them.
3.4 Metrics for Uncertainty Estimation
We proposed the objective variational evidence lower bound (ELBO) Ratios for quantitative evaluation of variational autoencoder. From Eq. 5, we tested all the ID samples of ELBO (IELBO) and get the maximum one (called ). ELBO Ratio that is defined as
(10) 
where is the VAE of degree of uncertainty on data. The greater scalar is , the higher uncertainty acquires.
3.5 INCPKL Ratios for OOD Detection
The density estimation of VAE always be used for OOD detection, but the OOD inputs get a higher likelihoods than ID inputs that occur some datasets (e.g., FashionMNIST vs MNIST, CIFAR10 vs SVHN). To solve this problem, ren2019likelihood proposed Likelihood Ratios for OOD detection. In Eq. 9, the second term of the INCPVAE loss is INCPKL which is the KL divergence between the OOD variational posterior and the True OOD posterior so that we proposed a hypothesis that INCPKL divergence of test samples from the distribution of OOD (e.g., Baseline+Noise, Baseline; See Fig 1) will be smaller than the samples from others distribution. Inspired by it, we proposed an INCPKL Ratios for OOD detection. We test all the OOD samples of INCPKL and get the maximum one (called ). INCPKL Ratio that is defined as
(11)  
where , the test sample is OOD data; , is not OOD data ( does not belong to OOD data).
4 Experiments and Results
4.1 Experimental settings
In this section, we design experiments on multiple datasets to evaluate our method and compare with other baseline methods. The experiments involve uncertainty estimation and OOD detection.
Firstly, we conduct experiments to generate OOD samples and evaluate the uncertainty estimation of VAE and INCPVAE on the different datasets (Please see details in Appendix A). For evaluation of uncertainty experiments (Details are presented in Section 4.2) , we train VAE and INCPVAE with the samples only from pure training sets, and then run inference process with testing samples with different level noise (See Fig 1). We quantify VAE and INCPVAE’ uncertainty using ELBO metrics (See details in Section 3.5). In order to reliably estimate the uncertainty, we compute the likelihoods of traditional VAE and INCPVAE with 1000 random samples from the ID testing sets (See the results in Fig 2(ad)).
Secondly, We follow the settings of nalisnick2018deep and conduct the following two experiments (Details are shown in Appendix B). We train the traditional VAE on the training set and compute the Likelihoods of 1000 random samples from the testing set of ID and their corresponding OOD samples. We exhibit the histogram of Likelihoods for each test (See Fig 2(eh) ).
Finally, We apply the INCPKL Ratios (Details are in Section 4.3) to OOD detection tasks on four pair datasets. In the OOD detection experiments, we train INCPVAE with the samples only from training set and then compute INCPKL Ratios of 1000 random samples from the OOD testing sets (Details in Appendix B). We quantify the OOD detection performance with INCPKL Ratios and plot the histograms of Likelihoods (See Fig 3(b,c,e,f)). Using INCPVAE trained on the training set, we test the OOD sample detection on a variety of datasets by NCPKL and Likelihood of VAE, as well as other baseline methods. The area under the ROC curve (AUROC) and the area under the precisionrecall curve (AUPRC) are used as the metrics for evaluation.
More details related to the datasets and experimental settings are shown in Appendix C. All the code will be available at GitHub.
4.2 Uncertainty Estimation Results
In order to verify the effectiveness of our model, we impose Gaussian noise on input images with different noise during testing (More details in Appedix A
). Note that here we set standard deviation as noise level, controlling degree of deviation shift from original data distribution.
We run experiments on FashionMNIST, MNIST, CIFAR10, SVHN datasets, respectively. From Fig 2(ad), we obtained reliable patterns from these four datasets. When the testing data is drawn without additional perturbations (the noise level is 0), INCPVAE and VAE model present similar uncertainty, suggesting that our model is consistent with standard VAE when it is applied to the ID data. As the noise level increases from 0.01 to 0.1, the INCPVAEestimated uncertainty of the OOD samples gradually increases in all four datasets, whereas the VAEestimated uncertainty only shows a slight increase in FashionMNIST dataset and maintains unchanged in the other 3 datasets (MNIST, CIFAR10 and SVHN). These results demonstrate that our INCPVAE model has a strong capability of capturing substantial peculiarity of ID and OOD data with outstanding robustness.
The likelihood distribution from VAE and INCPVAE is illustrated in Fig 2(eh). Standard VAE and our INCPVAE were trained and no noise were imposed during testing. We can see that in these four different datasets, INCPVAE and VAE present coincident likelihood distribution, which manifests our model can not only distinguish OOD samples from ID distribution, but also possess the same generative ability as traditional VAE model.
4.3 OOD Detection Performance
We carry on OOD detection experiments on FashionMNIST and CIFAR10 datasets. Fig 3(a) and Fig 3(d)
depict that, as for standard VAE model, OOD data can get higher likelihoods than training samples with very high probability no matter on FashionMNIST or CIAFR10, which has been a nervewracking and tricky problem in likelihood models field. Here, we conduct two type of compassion experiments of OOD detection. Take FashionMNIST as an example, first, we choose FashionMNIST data with Gaussian noise as ID, with original FashionMNIST data as OOD. After training process, we utilize the pretrained models to compute INCPKL divergence between test data(e.g.,FashionMNIST, MNIST) of variational posterior and OOD data of true posterior. The evaluation results on FashionMNIST and CIFAR10 are shown as
Fig 3(b) and Fig 3(e). In addition, pure FashionMNIST data were trained as ID and FashionMNIST testing images with noise were used for OOD so as to measure the INCPKL divergence from test data(e.g.,FashionMNIST+Niose, MNIST). Similar operations on CIFAR10 are conducted. The results of the second experiment are described as Fig 3(c) and Fig 3(f). It is evident that interchanging the ID and OOD data subject is not influential on the final performance, meanwhile INCPVAE exhibits obvious disparity between two different OOD sets, which proves our model is equipped with strong robustness and detection capability.Besides, we show comprehensive performance comparisons with other baselines and outstanding models. Table 1 and Table 2 present AUROC and AUPRC metrics on FashionMNIST vs. MNIST, and CIFAR10 vs. SVHN, respectively. Evidently, our model achieves the highest AUROC and AUPRC scores on both test datasets, compared with all the listed work.
Model  AUROC  AUPRC 

NCPKL Ratio(Baseline+Noise)  
NCPKL Ratio(Baseline)  
Likelihood  0.035  0.313 
Likelihood Ratio() ren2019likelihood  0.973  0.951 
Likelihood Ratio(, ) ren2019likelihood  0.994  0.993 
ODIN liang2017enhancing  0.752  0.763 
Mahalanobis distance lee2018simple  0.942  0.928 
Ensemble, 20 classifiers lakshminarayanan2017simple  0.857  0.849 
WAIC,5 models choi2018waic  0.221  0.401 
Model  AUROC  AUPRC 

NCPKL Ratio(Baseline+Noise)  
NCPKL Ratio(Baseline)  
Likelihood  0.057  0.314 
Likelihood Ratio() ren2019likelihood  0.931  0.888 
Likelihood Ratio(, ) ren2019likelihood  0.930  0.881 
5 Discussion and Conclusion
We investigate deep generative modelindependent methods for reliable uncertainty estimation and OOD detection. We adapt the noise contrastive prior for unsupervised models and propose a hybrid method integrating INCP with the encoder of VAE framework, which is trained with ID data and OOD data jointly. In our INCPVAE model, OOD samples are generated by adding gussian noise, endowing VAE with reliable uncertainty estimation for inputs and the ability of distinguishing OOD data. We reproduce the results that traditional VAE easily assign a higher likelihood for OOD samples than ID samples, consistent with nalisnick2018deep; hendrycks2018deep; choi2018waic; lee2017training; nalisnick2019detecting. Using INCPKL ratios our model achieves SOTA performance to differentiate OOD and ID data, compared with baseline methods (Table 1 and Table 2). As a modelindependent method to OOD detection, INCPVAE model paves a bright way for future VAE applications, and has significant potential for anomaly detection and adversarial example detection.
Although OOD detection for other deep generative models is hard to be transferred to VAEs xiao2020likelihood, we successfully introduce INCP to VAE framework to solve it. Here we only focused on VAE model; in future we will extend INCP to other generative models. Besides, alternative methods to generate appropriate OOD inputs for training INCPVAEs are worthy of investigation. For example, lee2017training applies GAN to generate OOD data, which can be potentially used to generate priors of INCPVAE. INCPVAE can also be trained with adversarial examples (AEs) goodfellow2014explaining to enhance robustness of VAE.
References
Appendix A Settings for uncertainty estimation
In this section, we introduce detailed settings for uncertainty estimation. To evaluate uncertainty estimation from the traditional Variational Autoencoder (VAE) and from the Improved Noise Contrastive Priors VAE (INCPVAE), we train VAE on indistribution (ID) training set and INCPVAE on the ID and outofdistribution (OOD) training set. Then we test both of VAE and INCPVAE on ID testing set and OOD testing set0/set1/set2, respectively. See full lists in Table 3. The OOD training set and testing set0/set1/set2 are generated by adding three levels of Gaussian noise to the baseline (See Table 4).
Dataset / Model  VAE  INCPVAE 

ID training set  Baseline  Baselline 
OOD training set    Baseline+Noise() 
ID tesing set  Baseline  Baseline 
OOD tesing set0  Baseline+Noise()  Baseline+Noise() 
OOD tesing set1  Baseline+Noise()  Baseline+Noise() 
OOD tesing set2  Baseline+Noise()  Baseline+Noise() 
Noise level / Datasets  FashionMNIST  MNIST  CIFAR10  SVHN 

0.0001  0.001  0.01  0.001  
0.00028  0.008  0.05  0.009  
0.1000  0.010  0.10  0.010 
Dataset/Uncertainty level  

FashionMNIST  
MNIST  
CIFAR10  
SVHN 
For each image dataset, the true OOD posterior of INCPVAE (or OOD data output prior) is assumed by Gaussian distribution with a specific variance (See Table 5), which represents that these four datasets have various uncertainties.
Appendix B Settings for OOD Detection
In this section, we introduce detailed settings of OOD detection experiments. Firstly, following the most challenging experiment reported by Nalisnick et al., we train VAE on ID training set and test on ID and OOD testing set (See Table 6). Secondly, to evaluate the OOD detection of INCPVAE, we train INCPVAE on the ID and OOD training set, and test INCPVAE on OOD testing set and OOD testing set1 (See Table 7). The ID and OOD training set, as well as the OOD testing set, are generated by adding Gaussian noise with three levels to baseline(See Table 8).
Dataset/Exp  Exp1  Exp2 
ID training set  FashionMNIST  CIFAR10 
ID testing set  FashionMNIST  CIFAR10 
OOD testing set  MNIST  SVHN 
Exp  ID training set  OOD training set  OOD tesing set  OOD testing set1 

Exp1  Fashion  Fashion+Noise()  Fashion+Noise()  MNIST 
Exp2  Fashion+Noise()  Fashion  Fashion  MNIST 
Exp3  CIFAR10  CIFAR10+Noise()  CIFAR10+Noise()  SVHN 
Exp4  CIFAR10+Noise()  CIFAR10  CIFAR10  SVHN 
Noise level / Datasets  FashionMNIST  CIFAR10 

0.00028  0.05  
0.00050  0.09 
For different datasets, the true OOD posterior of INCPVAE( or OOD data output prior) is Gaussian distribution with different variance (See Table 9), which represents that different datasets have different uncertainties.
Dataset/Uncertainty level  

FashionMNIST  
CIFAR10 
Appendix C Settings for Implementation Detail
In the experiments, VAE and INCPVAE are trained on FashionMNIST and CIFAR10. All models are trained with images normalized to on 1 NVIDIA TITAN RTX GPU. In all experiments, VAE and INCPVAE consist of an encoder with the architecture given in Table 10 and a decoder shown in Table 11
. Both VAE and INCPVAE use Leaky Relu activation function. We train the VAE for 200 epochs with a constant learning rate
, meanwhile using Adam optimizer and batch size 64 in each experiment.Operation  kernel  stride  Features  padding 

Input         
Dense      3136/4096   
Dense      1568/2048   
Transposed Convolution  32  0  
Transposed Convolution  256  0  
Transposed Convolution  3  0 
Comments
There are no comments yet.