Likelihood models are naturally considered as owning ideal capability of detecting out-of-distribution (OOD) inputs, due to the intuitive assumption bishop1994novelty that these models assign lower likelihoods to the OOD inputs than in-distribution(ID) inputs. However, emerging works nalisnick2018deep; hendrycks2018deep; choi2018waic; lee2017training; nalisnick2019detecting; maaloe2019biva have reported that the deep generative models, such as variational auto-encoders(VAEs) kingma2013auto; rezende2014stochastic, Pixel CNN van2016conditional and Glow kingma2018glow, all based on likelihood models, are not able to correctly detect OOD inputs. Counter-intuitively, the OOD inputs are assigned higher likelihoods than the ID inputs, which does not accord with the acknowledged assumption. Hence, when we employ the likelihood model as a detector on OOD detection tasks or general generation tasks, it is necessary to ensure that the adopted model possesses a good understanding and performance for OOD inputs.
The phenomenon that VAE assigns higher likelihoods to OOD inputs than ID inputs was firstly reported by nalisnick2018deep. Ever since, it has been an increasing hot topic in the generative model field. Some studies serra2019input; nalisnick2019detecting; butepage2019modeling have made great efforts to interpret this empirical phenomenon. For instance, butepage2019modeling demonstrated that it is caused by model assumptions and evaluation schemes, where oversimplified likelihood function assumed in VAE model (e.g., iid Bernoulli or iid Gaussian) impacts the judgment for data distribution of ID inputs. However, the true likelihood function is often unknown and has certain deviations from the assumed one. In some datasets, local evaluation under the approximated posterior will cause overconfidence. nalisnick2019detecting conjectured that high likelihood region conflicts with model’s typical set. serra2019input posited that the input complexity will produce a strong impact on likelihood based models.
To solve the OOD detection problem, various studies choi2018waic; nalisnick2018deep have suggested that likelihood models with reliable uncertainty estimation may contribute to improving OOD detection. Additionally, noise contrastive priors (NCPs) hafner2018reliable
, as a specific prior for neural networks in data space, encourage network weights not only to explain ID inputs but also capture the high uncertainty of OOD samples. Inspired by these pioneer studies, we introduce a novel method, named Improved Noise Contrastive Priors Variational Auto-encoder (INCPVAE), to obtain VAE’s reliable uncertainty estimation and therefore to solve the OOD detection problem. Notwithstanding, the original NCP method is often applied to the classifier models which cannot directly apply to VAE framework. Hence, we improved the loss function of NCP to make it be suitable for VAE framework. Moreover, we adapt the improved NCP for the encoder of VAE and then generate OOD samples by adding Gaussian noise to the origin inputs. Besides, for the fact that using the simple likelihood function of VAE failed on the OOD detection task, we exploit the INCP-KL divergence of INCPVAE, rather than the likelihood, for detecting OOD inputs. Our experiments show that the INCPVAE can obtain better performances compared to the traditional VAE, as well as reduce the overconfidence when facing OOD data. In brief, our contributions of this work are:
We improve the noise contrastive prior to be suitable for VAE framework (Section 3.3). To the best of our knowledge, this is the first work applying noise contrastive prior to unsupervised generative model to obtain reliable uncertainty estimation.
We apply tailored metrics to uncertainty estimation (Section 3.4), by using which our INCPVAE framework achieve reliable uncertainty estimation and enhanced robustness.
We propose a novel OOD detection method via INCP-KL divergence of INCPVAE (Section 3.5). Experiments demonstrate that the INCPVAE gains an excellent understanding for the OOD inputs and our detection method achieves state-of-the-art (SOTA) performance on the challenging cases raised by nalisnick2018deep.
2 Related Work
Capability of OOD detection is vital to machine learning models. From algorithmic perspectives, there are two categories of mainstream approaches to detect OOD samples: one is supervised/discriminative method, and the other follows unsupervised/generative fashiondaxberger2019bayesian. Most of existing methods belong to the former, aiming at acquiring a decision boundary or likelihood ratio between ID and OOD inputs through combining a dataset of anomalies with training data. The supervised approaches devries2018learning; liang2017enhancing; hendrycks2016baseline; lakshminarayanan2017simple can make full use of deep discriminative models and prevent poorly-calibrated neural networks guo2017calibration from mistakenly making high-confidence predictions on OOD inputs to some extent, producing consequential effects on various applications (such as anomaly detection vyas2018out; PidhorskyiGenerative; hendrycks2016baseline, adversarial defense song2018constructing). However, these methods can only suit for task-dependent scenarios, which is a severe limitation because the anomalous data is usually rare or not known ahead of time in real-world scenarios.
In contrast, the unsupervised approaches aim to solve the problem by training deep generative models in a more general manner, where density estimation is widely applied oord2016pixel; kingma2018glow. However, as mentioned in the Introduction section, likelihood estimates in the popular deep generative models are not reliable for OOD detection, many researches have attempted to explain and tackle this problem serra2019input; nalisnick2019detecting; butepage2019modeling. But an efficient and robust solution is still missing and urgently needed.
Uncertainty estimation: Uncertainty estimation is bound up with OOD detection, with the goal to yield calibrated confidence measures for predictive distribution. The uncertainty estimation in MC Dropout gal2016dropout, Deep-Ensemble lakshminarayanan2017simple and ODIN liang2017enhancing involves presenting a calibrated predictive distribution for classifiers. An alternative solution, variational information bottleneck (VIB) alemi2018uncertainty, conducts OOD detection via divergence estimation in latent space. However, these existing methods are model-dependent and rely heavily on task-specific information for gaining integrated estimate of uncertainty. A more general method is of high needs. On the other hand, recent studies choi2018waic; nalisnick2018deep suggested that likelihood models with reliable uncertainty estimation maybe help to mitigate high OOD likelihood problem for generative models in a task-independent manner. Here, we provide a novel hybrid scheme linking uncertainty estimation with noise contrastive priors and Gaussian noise, to aid in both reliability of uncertainty estimation in VAE and model independence in OOD detection.
3.1 Improved Noise Contrastive Priors
hafner2018reliable proposed NCPs, as a kind of data priors that are applied to both ID inputs and OOD inputs . The OOD inputs are usually generated by imposing noise. In this work, to obtain the VAE’s uncertainty, we modify the loss function (See below) to make the original NCPs suitable for VAE framework. For it is hard to exactly generate OOD data, we add Gaussian noise to ID image to realize OOD data generation.
Generating OOD Inputs lee2017training reported that OOD samples are produced by sampling from the boundary of the ID with high uncertainty. hafner2018reliable
advanced an algorithm inspired by noise contrastive estimationgutmann2010noise; mnih2013learning, where a complement distribution is approximated using random noise. For continuous ID inputs , we add Gaussian noise to obtain OOD inputs, which is . The distribution density of OOD inputs is formulated as,
where is the distribution density of ID inputs, and. The variance is a hyper-parameter to tune the sampling distance from the boundary of training distribution. The complexity of OOD inputs is correlated with the variance.
Data Priors The data priors consist of inputs prior and outputs prior
. To obtain a reliable VAE’s uncertainty estimation, an appropriate inputs prior should include OOD inputs so that it can obtain better performance than the baseline under training distribution. A good output prior should be a high-entropy distribution that serves as high uncertainty about VAE’s target outputs given OOD inputs. The data priors are listed as follows:
where is the distribution of OOD inputs, and are the parameter of OOD data outputs priors, is a hyper-parameter tuning the level of target outputs uncertainty.
Loss Function Improved Noise Contrastive Priors (INCPs) have the merit of estimating the model’s uncertainty which is easily generalized to OOD samples. To train INCPs, we modified the loss function as follows:
where denote OOD data priors, is the parameter of neural network. The hyper-parameter represents the trade-off between them. INCPs can be trained by minimizing this loss. Notice that in the Eq. 3, the first term makes the neural network suit for true ID data outputs prior by minimizing the KL divergence. And the second term represents the analogous term on the OOD data outputs prior. This loss function optimizes the ID and OOD posterior for two distinct targets simultaneously (the true ID data outputs prior and the assumed OOD data outputs prior), whereas the origin NCP loss hafner2018reliable makes the ID and OOD conditional distribution for one target.
3.2 Variational Autoencoder
VAEs rezende2014stochastic; kingma2013auto are a variety of latent variable models optimized by the maximum marginal likelihood of an observation variable . The marginal likelihood can be written as follows:
where and are the prior (e.g., Vamp Prior tomczak2017vae, Resampled Prior bauer2018resampled
) by using a standard normal distribution and the true posterior respectively.is the variational posterior (encoder) by employing a Guassian distribution, and
is the generative model (decoder) by using a Bernoulli distribution. Both are modeled by a neural network with their parameter, , respectively.
However, the true posterior cannot be computed analytically. Assuming variational posterior has arbitrarily high-capacity for modeling, approximates intractably and the KL-divergence between and will be zero. Thus, we train VAE with ID (or OOD) samples to maximize the following objective variational evidence lower bound (called I-ELBO (or O-ELBO)):
where and are variational posteriors for matching the true posteriors ( and ) which are given by and respectively. For a given dataset, the marginal likelihood is a constant. Substituting Eq. 5 to Eq. 4, we get
which means maximizing I-ELBO is equivalent to minimizing the KL-divergence between and .
3.3 INCP Variational Autoencoder
INCPVAE consists of an encoder and decoder, and the improved NCPs are imposed on the encoder network of VAE. The INCPVAE is trained on both ID and OOD inputs by minimizing I-ELBO and O-ELBO. From Eq. 5, We have all the ELBO of INCPVAE as follows:
The assumption of ID inputs variational posterior has high-capacity for modelling, then true posterior can be replaced by . Considering the definition of OOD outputs prior (Eq. 2), the true OOD data posterior is:
where , is hyper-parameter that determines how large we want the outputs uncertainty to be. And the KL-divergence between and (called INCP-KL) becomes tractable and can be analytically computed. Maximizing the ELBO of INCPVAE can be replaced by minimizing the following loss function:
The hyper-parameter is a setting for trade-off between them.
3.4 Metrics for Uncertainty Estimation
We proposed the objective variational evidence lower bound (ELBO) Ratios for quantitative evaluation of variational auto-encoder. From Eq. 5, we tested all the ID samples of ELBO (I-ELBO) and get the maximum one (called ). ELBO Ratio that is defined as
where is the VAE of degree of uncertainty on data. The greater scalar is , the higher uncertainty acquires.
3.5 INCP-KL Ratios for OOD Detection
The density estimation of VAE always be used for OOD detection, but the OOD inputs get a higher likelihoods than ID inputs that occur some datasets (e.g., FashionMNIST vs MNIST, CIFAR10 vs SVHN). To solve this problem, ren2019likelihood proposed Likelihood Ratios for OOD detection. In Eq. 9, the second term of the INCPVAE loss is INCP-KL which is the KL divergence between the OOD variational posterior and the True OOD posterior so that we proposed a hypothesis that INCP-KL divergence of test samples from the distribution of OOD (e.g., Baseline+Noise, Baseline; See Fig 1) will be smaller than the samples from others distribution. Inspired by it, we proposed an INCP-KL Ratios for OOD detection. We test all the OOD samples of INCP-KL and get the maximum one (called ). INCP-KL Ratio that is defined as
where , the test sample is OOD data; , is not OOD data ( does not belong to OOD data).
4 Experiments and Results
4.1 Experimental settings
In this section, we design experiments on multiple datasets to evaluate our method and compare with other baseline methods. The experiments involve uncertainty estimation and OOD detection.
Firstly, we conduct experiments to generate OOD samples and evaluate the uncertainty estimation of VAE and INCPVAE on the different datasets (Please see details in Appendix A). For evaluation of uncertainty experiments (Details are presented in Section 4.2) , we train VAE and INCPVAE with the samples only from pure training sets, and then run inference process with testing samples with different level noise (See Fig 1). We quantify VAE and INCPVAE’ uncertainty using ELBO metrics (See details in Section 3.5). In order to reliably estimate the uncertainty, we compute the likelihoods of traditional VAE and INCPVAE with 1000 random samples from the ID testing sets (See the results in Fig 2(a-d)).
Secondly, We follow the settings of nalisnick2018deep and conduct the following two experiments (Details are shown in Appendix B). We train the traditional VAE on the training set and compute the Likelihoods of 1000 random samples from the testing set of ID and their corresponding OOD samples. We exhibit the histogram of Likelihoods for each test (See Fig 2(e-h) ).
Finally, We apply the INCP-KL Ratios (Details are in Section 4.3) to OOD detection tasks on four pair datasets. In the OOD detection experiments, we train INCPVAE with the samples only from training set and then compute INCP-KL Ratios of 1000 random samples from the OOD testing sets (Details in Appendix B). We quantify the OOD detection performance with INCP-KL Ratios and plot the histograms of Likelihoods (See Fig 3(b,c,e,f)). Using INCPVAE trained on the training set, we test the OOD sample detection on a variety of datasets by NCP-KL and Likelihood of VAE, as well as other baseline methods. The area under the ROC curve (AUROC) and the area under the precision-recall curve (AUPRC) are used as the metrics for evaluation.
More details related to the datasets and experimental settings are shown in Appendix C. All the code will be available at GitHub.
4.2 Uncertainty Estimation Results
In order to verify the effectiveness of our model, we impose Gaussian noise on input images with different noise during testing (More details in Appedix A
). Note that here we set standard deviation as noise level, controlling degree of deviation shift from original data distribution.
We run experiments on FashionMNIST, MNIST, CIFAR10, SVHN datasets, respectively. From Fig 2(a-d), we obtained reliable patterns from these four datasets. When the testing data is drawn without additional perturbations (the noise level is 0), INCPVAE and VAE model present similar uncertainty, suggesting that our model is consistent with standard VAE when it is applied to the ID data. As the noise level increases from 0.01 to 0.1, the INCPVAE-estimated uncertainty of the OOD samples gradually increases in all four datasets, whereas the VAE-estimated uncertainty only shows a slight increase in FashionMNIST dataset and maintains unchanged in the other 3 datasets (MNIST, CIFAR10 and SVHN). These results demonstrate that our INCPVAE model has a strong capability of capturing substantial peculiarity of ID and OOD data with outstanding robustness.
The likelihood distribution from VAE and INCPVAE is illustrated in Fig 2(e-h). Standard VAE and our INCPVAE were trained and no noise were imposed during testing. We can see that in these four different datasets, INCPVAE and VAE present coincident likelihood distribution, which manifests our model can not only distinguish OOD samples from ID distribution, but also possess the same generative ability as traditional VAE model.
4.3 OOD Detection Performance
depict that, as for standard VAE model, OOD data can get higher likelihoods than training samples with very high probability no matter on FashionMNIST or CIAFR10, which has been a nerve-wracking and tricky problem in likelihood models field. Here, we conduct two type of compassion experiments of OOD detection. Take FashionMNIST as an example, first, we choose FashionMNIST data with Gaussian noise as ID, with original FashionMNIST data as OOD. After training process, we utilize the pre-trained models to compute INCP-KL divergence between test data(e.g.,FashionMNIST, MNIST) of variational posterior and OOD data of true posterior. The evaluation results on FashionMNIST and CIFAR10 are shown asFig 3(b) and Fig 3(e). In addition, pure FashionMNIST data were trained as ID and FashionMNIST testing images with noise were used for OOD so as to measure the INCP-KL divergence from test data(e.g.,FashionMNIST+Niose, MNIST). Similar operations on CIFAR10 are conducted. The results of the second experiment are described as Fig 3(c) and Fig 3(f). It is evident that interchanging the ID and OOD data subject is not influential on the final performance, meanwhile INCPVAE exhibits obvious disparity between two different OOD sets, which proves our model is equipped with strong robustness and detection capability.
Besides, we show comprehensive performance comparisons with other baselines and outstanding models. Table 1 and Table 2 present AUROC and AUPRC metrics on FashionMNIST vs. MNIST, and CIFAR10 vs. SVHN, respectively. Evidently, our model achieves the highest AUROC and AUPRC scores on both test datasets, compared with all the listed work.
|Likelihood Ratio() ren2019likelihood||0.973||0.951|
|Likelihood Ratio(, ) ren2019likelihood||0.994||0.993|
|Mahalanobis distance lee2018simple||0.942||0.928|
|Ensemble, 20 classifiers lakshminarayanan2017simple||0.857||0.849|
|WAIC,5 models choi2018waic||0.221||0.401|
|Likelihood Ratio() ren2019likelihood||0.931||0.888|
|Likelihood Ratio(, ) ren2019likelihood||0.930||0.881|
5 Discussion and Conclusion
We investigate deep generative model-independent methods for reliable uncertainty estimation and OOD detection. We adapt the noise contrastive prior for unsupervised models and propose a hybrid method integrating INCP with the encoder of VAE framework, which is trained with ID data and OOD data jointly. In our INCPVAE model, OOD samples are generated by adding gussian noise, endowing VAE with reliable uncertainty estimation for inputs and the ability of distinguishing OOD data. We reproduce the results that traditional VAE easily assign a higher likelihood for OOD samples than ID samples, consistent with nalisnick2018deep; hendrycks2018deep; choi2018waic; lee2017training; nalisnick2019detecting. Using INCP-KL ratios our model achieves SOTA performance to differentiate OOD and ID data, compared with baseline methods (Table 1 and Table 2). As a model-independent method to OOD detection, INCPVAE model paves a bright way for future VAE applications, and has significant potential for anomaly detection and adversarial example detection.
Although OOD detection for other deep generative models is hard to be transferred to VAEs xiao2020likelihood, we successfully introduce INCP to VAE framework to solve it. Here we only focused on VAE model; in future we will extend INCP to other generative models. Besides, alternative methods to generate appropriate OOD inputs for training INCPVAEs are worthy of investigation. For example, lee2017training applies GAN to generate OOD data, which can be potentially used to generate priors of INCPVAE. INCPVAE can also be trained with adversarial examples (AEs) goodfellow2014explaining to enhance robustness of VAE.
Appendix A Settings for uncertainty estimation
In this section, we introduce detailed settings for uncertainty estimation. To evaluate uncertainty estimation from the traditional Variational Auto-encoder (VAE) and from the Improved Noise Contrastive Priors VAE (INCPVAE), we train VAE on in-distribution (ID) training set and INCPVAE on the ID and out-of-distribution (OOD) training set. Then we test both of VAE and INCPVAE on ID testing set and OOD testing set0/set1/set2, respectively. See full lists in Table 3. The OOD training set and testing set0/set1/set2 are generated by adding three levels of Gaussian noise to the baseline (See Table 4).
|Dataset / Model||VAE||INCPVAE|
|ID training set||Baseline||Baselline|
|OOD training set||-||Baseline+Noise()|
|ID tesing set||Baseline||Baseline|
|OOD tesing set0||Baseline+Noise()||Baseline+Noise()|
|OOD tesing set1||Baseline+Noise()||Baseline+Noise()|
|OOD tesing set2||Baseline+Noise()||Baseline+Noise()|
|Noise level / Datasets||FashionMNIST||MNIST||CIFAR10||SVHN|
For each image dataset, the true OOD posterior of INCPVAE (or OOD data output prior) is assumed by Gaussian distribution with a specific variance (See Table 5), which represents that these four datasets have various uncertainties.
Appendix B Settings for OOD Detection
In this section, we introduce detailed settings of OOD detection experiments. Firstly, following the most challenging experiment reported by Nalisnick et al., we train VAE on ID training set and test on ID and OOD testing set (See Table 6). Secondly, to evaluate the OOD detection of INCPVAE, we train INCPVAE on the ID and OOD training set, and test INCPVAE on OOD testing set and OOD testing set1 (See Table 7). The ID and OOD training set, as well as the OOD testing set, are generated by adding Gaussian noise with three levels to baseline(See Table 8).
|ID training set||FashionMNIST||CIFAR10|
|ID testing set||FashionMNIST||CIFAR10|
|OOD testing set||MNIST||SVHN|
|Exp||ID training set||OOD training set||OOD tesing set||OOD testing set1|
|Noise level / Datasets||FashionMNIST||CIFAR10|
For different datasets, the true OOD posterior of INCPVAE( or OOD data output prior) is Gaussian distribution with different variance (See Table 9), which represents that different datasets have different uncertainties.
Appendix C Settings for Implementation Detail
In the experiments, VAE and INCPVAE are trained on FashionMNIST and CIFAR10. All models are trained with images normalized to on 1 NVIDIA TITAN RTX GPU. In all experiments, VAE and INCPVAE consist of an encoder with the architecture given in Table 10 and a decoder shown in Table 11, meanwhile using Adam optimizer and batch size 64 in each experiment.