DOI: Divergence-based Out-of-Distribution Indicators via Deep Generative Models

by   Wenxiao Chen, et al.

To ensure robust and reliable classification results, OoD (out-of-distribution) indicators based on deep generative models are proposed recently and are shown to work well on small datasets. In this paper, we conduct the first large collection of benchmarks (containing 92 dataset pairs, which is 1 order of magnitude larger than previous ones) for existing OoD indicators and observe that none perform well. We thus advocate that a large collection of benchmarks is mandatory for evaluating OoD indicators. We propose a novel theoretical framework, DOI, for divergence-based Out-of-Distribution indicators (instead of traditional likelihood-based) in deep generative models. Following this framework, we further propose a simple and effective OoD detection algorithm: Single-shot Fine-tune. It significantly outperforms past works by 5 8 in AUROC, and its performance is close to optimal. In recent, the likelihood criterion is shown to be ineffective in detecting OoD. Single-shot Fine-tune proposes a novel fine-tune criterion to detect OoD, by whether the likelihood of the testing sample is improved after fine-tuning a well-trained model on it. Fine-tune criterion is a clear and easy-following criterion, which will lead the OoD domain into a new stage.



There are no comments yet.


page 1

page 2

page 3

page 4


How to fine-tune deep neural networks in few-shot learning?

Deep learning has been widely used in data-intensive applications. Howev...

OSOA: One-Shot Online Adaptation of Deep Generative Models for Lossless Compression

Explicit deep generative models (DGMs), e.g., VAEs and Normalizing Flows...

Detecting Out-of-Distribution Inputs to Deep Generative Models Using a Test for Typicality

Recent work has shown that deep generative models can assign higher like...

Do Deep Generative Models Know What They Don't Know?

A neural network deployed in the wild may be asked to make predictions f...

Understanding Failures in Out-of-Distribution Detection with Deep Generative Models

Deep generative models (DGMs) seem a natural fit for detecting out-of-di...

Divergence Frontiers for Generative Models: Sample Complexity, Quantization Level, and Frontier Integral

The spectacular success of deep generative models calls for quantitative...

Can We Trust Deep Speech Prior?

Recently, speech enhancement (SE) based on deep speech prior has attract...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

111This paper is developed at the same time as Zhisheng et al. (2020) independently. The key idea of this paper is very similar to Zhisheng et al. (2020). This paper only has incremental content (theorems, methods, and experiments) compared with Zhisheng et al. (2020). Since Zhisheng et al. (2020) has been published in NIPS 2020, this paper is provided here for the community to share our large-scale experiments and theorems. The code is in

Machine learning has achieved impressive success in the classification domain through deep neural network classifiers Szegedy et al. (2016); He et al. (2016); Zagoruyko and Komodakis (2016). Knowing when a machine learning (ML) model is qualified to make predictions on input is critical to the safe deployment of ML technology in the real world Choi et al. (2018). When training distribution (called in-distribution) differs from testing distribution (called out-of-Distribution), neural networks may provide (with high confidence) arbitrary predictions on inputs that they are unaccustomed to seeing. This is known as the Out-of-Distribution (OoD) problemChoi et al. (2018)

. For example, a classifier trained on CIFAR-10 

Krizhevsky et al. (2009)

may recognize the house number in SVHN 

Netzer et al. (2011) as a horse, which might lead to potential risks.

Therefore, it is crucial to develop OoD indicators for detecting whether a testing sample is from in-distribution or out-of-distribution to ensure that applications based on classifiers are robust and reliable. The common belief Bishop (1994) is that the OoD indicators can be based on density model: train a density model (as an OoD indicator) to approximate the empirical distribution of training data, and refuse the sample when is sufficiently low. However, recent works Nalisnick et al. (2019); Choi et al. (2018); Hendrycks et al. (2018)

show that density estimates by

deep generative models Dinh et al. (2016); Tomczak and Welling (2018); Takahashi et al. (2019); Van den Oord et al. (2016), which generate realistic samples, assign higher density to samples from out-of-distribution. For example, according to  Nalisnick et al. (2019), this phenomenon occurs in CIFAR-10 (as in-distribution) vs SVHN (as out-of-distribution) for different likelihood-based models, while the data in CIFAR-10 and SVHN have significant different semantics.

To alleviate the aforementioned phenomenon, more advanced OoD indicators Serrà et al. (2019); Song et al. (2017); Choi et al. (2018); Ren et al. (2019); Song et al. (2019); Che et al. (2019) are proposed recently based on deep generative models and shown to perform well on few datasets,e.g., the number of dataset pairs (in-distribution dataset, out-of-distribution dataset) is only to . However, a robust OoD indicator should detect samples from any out-of-distribution Chen et al. (2020)

. Thus it should be evaluated on a large collection of benchmarks. In this paper, we first conduct a large collection of benchmarks with 92 dataset pairs (based on 14 popular image datasets, including MNIST, FASHION-MNIST, KMNIST, NOT-MNIST, Omniglot, CIFAR-10, CIFAR-100, TinyImagenet, SVHN, iSUN, CelebA, LSUN, Noise and Constant), whose scale is 1 order of magnitude larger than all above works. We observe that none of the above OoD indicators perform well on the large collection of benchmarks (see later in Table 

1). Based on this observation, we advocate that experiments on few datasets are unreliable, and a large collection of benchmarks is mandatory for evaluating OoD indicators.

Another interesting observation that we discover by accident, as a result of the fact that we try to enumerate the dataset pairs when possible (e.g., (CIFAR-10, SVHN) and (SVHN,CIFAR-10) are both in our experiment setting), is that, on all dataset pairs, when is from in-distribution and when is from out-distribution, where is a generative model trained on the out-of-distribution data. Inspired by this intuitive (in retrospect) observation, we propose a fundamental theoretical framework DOI for Divergence-based Out-of-Distribution Indictors in deep generative models. Following this framework, we further propose a simple and effective out-of-distribution detection algorithm, Single-shot Fine-tune algorithm with three mainstream deep generative models (VAE, PixelCNN, and RNVP). In our experiments, Single-shot Fine-tune significantly outperforms existing works by 58% in AUROC, and its performance is close to the theoretical optimal results.

The main contributions of this paper are as follows:

  • We conduct the first large collection of benchmarks (containing 1 order of magnitude larger datasets than previous ones) for existing OoD indicators and observe that none perform well. We thus advocate that experiments on few datasets are unreliable, and a large collection of benchmarks is mandatory for evaluating OoD indicators.

  • We propose a novel theoretical framework DOI for divergence-based out-of-distribution indicators (instead of traditional likelihood-based) in deep generative models. Following DOI, we further propose a simple and effective algorithm, Single-shot Fine-tune. We believe that DOI framework could guide the development in the OoD domain.

  • Single-shot Fine-tune algorithm significantly outperforms past works by 58% in AUROC. Single-shot Fine-tune is the first fine-tuned-based inductive OoD method, which shows that fine-tune criterion is a practical and effective criterion for detecting OoD. In fine-tune criterion, the use of both knowledge and cognition improves the performance significantly, which will draw attention to the cognition ability of deep generative models.

2 Background

Likelihood-based generative models are widely viewed to be robust to detect out-of-distribution samples by the model density intuitively. However, the densities of common likelihood-based models, e.g., RealNVP Dinh et al. (2016), VAE Tomczak and Welling (2018); Takahashi et al. (2019) and PixelCNN Van den Oord et al. (2016), have been shown to be problematic for detecting out-of-distribution data Nalisnick et al. (2019). These likelihood-based models assign a higher likelihood for samples from SVHN (out-of-distribution) than samples from CIFAR-10 (in-distribution).

To solve this problem, some researchers proposed some variants of these models for detecting out-of-distribution data Che et al. (2019) and some researchers proposed improved indicators to replace log-likelihood on common models Serrà et al. (2019). Common models are widely applied in the images domain, but variants are not. Moreover, evaluating numerous variants on the large collection of benchmarks is more expensive, while common models are easy to train, and many indicators can share one well-trained model. Furthermore, it is necessary to check the generality of indicators on common models. By the above motivations, this paper focuses on the indicators based on common models.

Song et al. (2017)

proposed permutation tests statistics

as the indicator to detect OoD. The rank of in the training set is used as OoD indicators. Both low-likelihood and high-likelihood samples are identified as OoD. It is significantly useful to solve the counterexample of CIFAR-10 vs SVHN Nalisnick et al. (2019).

Choi et al. (2018) used Watanabe Akaike Information Criterion (WAIC) based on model ensembles.


Ren et al. (2019) proposed a likelihood ratio indicator for deep generative models. They proposed a background model to capture the general background statistics and a likelihood ratio indicator to capture the significance of semantics compared to the background model.


Serrà et al. (2019) observed that input complexity excessively affects the generative models’ likelihoods. Then an estimation is proposed for input complexity , to derive a parameter-free OoD indicator :


Song et al. (2019)

observed that generative models with batch normalization assign a lower likelihood to OoD samples than in-distribution samples. Meanwhile, the corresponding log-likelihood decreases dramatically for OoD samples, but is relatively stable for in-distribution samples, as the ratio of test samples in a batch increases. Based on the insight,

, measuring the difference of log-likelihood under two situation that ratio of test samples are different, is proposed for OoD detection.

Some researchers also proposed to use labels (for classification tasks) to solve OoD. Che et al. (2019) proposed for OoD detection. It uses conditional deep generative models to verify the predictions of classifier. Alemi et al. (2018) use VIB to model the bottleneck where is the mutual information. Hendrycks and Gimpel (2016); Hendrycks et al. (2018); Hsu et al. (2020); Lee et al. (2018); Lakshminarayanan et al. (2017) proposed some indictors based on classifier for detecting OoD.

3 Problem Statement

Out-of-distribution detection problem can be formulated as a special binary classification problem. In canonical binary classification problem, a set of images with label 1 (denoted by ) and a set of images with label 0 (denoted by ) are given in training; in testing, a set of images without label (denoted by ) are given, algorithm needs to predict the label of each image in . consists of and . and are sampled from dataset and and are sampled from another dataset .

It is the key difference between OoD detection and canonical binary classification that is unkown in OoD detection problem. Moreover, is quite distinct from in OoD problem, e.g., are animals and are house numbers. and denote corresponding data distributions of and , where is called in-distribution and is called out-of-distribution.

It is important to decide whether two datasets are distinct on the large collection of benchmarks. Two datasets and are called simply-classified if a common classifier (e.g., ResNet34) trained for 2-class classification, when and are both known, can simply predict the accurate label for images in (AUROC ). If two datasets A, B are simply-classified, A vs B and B vs A dataset pairs will be considered in our experiments.

This paper considers the OoD detection based on common deep generative models, i.e., VAE, PixelCNN, flow-based models, and GANs. We focus on searching for a simple and effective indicator for OoD. More common datasets (shown in Section 6) are used to validate the generality of indicators.

All common metrics, including AUROC, AUPR, AP, FPR@TPR95, are considered in this paper. AUROC is selected as the major metric, and other metrics are shown in appendix B. AUROC is a threshold-independent metric Davis and Goadrich (2006) and is widely used in the OoD domain.

4 Motivating Observations

(a) CIFAR-10 vs SVHN
(b) KMNIST vs Omniglot
(c) SVHN vs CIFAR-10
(d) MNIST vs Omniglot
Figure 1: The histogram of log-likelihood of VAE. The green and red parts denote the log-likelihood of out-of-distribution and in-distribution, respectively. Intuitively, the log-likelihood of out-of-distribution is expected to be higher than in-distribution. However, in the above experiments, the likelihood of out-of-distribution might be higher, lower, or nearly the same as in-distribution. In above figures, the AUROC of log-likelihood is 8%, 9%, 99%, 59% and the AUROC of is 84%, 82%, 98%, 66%.

4.1 Counterexamples

Intuitively, the log-likelihood of the sample from out-of-distribution is expected to be lower than the in-distribution because models are trained on in-distribution. However, Nalisnick et al. (2019) observes that VAE, PixelCNN, and RealNVP all assign the higher log-likelihood to samples from out-of-distribution in experiments CIFAR-10 vs SVHN and NotMNIST vs MNIST. The number of datasets in Nalisnick et al. (2019) is quite small, and we suspect that there are more counterexamples on the large collection of benchmarks.

Therefore, we reproduce the experiments on a large collection of benchmarks and find 28 counterexamples in 92 dataset pairs, as shown in Figure 1 and appendix A. These experiments show that log-likelihood is unpredictable at out-of-distribution, i.e., it might be lower, higher, or same to in-distribution. Moreover, the methods based on the log-likelihood might have counterexamples on the large collection of benchmarks. We reproduce the indicators Alemi et al. (2018); Song et al. (2017); Ren et al. (2019); Song and Ermon (2019); Nalisnick et al. (2019); Che et al. (2019); Alemi et al. (2018) on common generative models and find counterexamples, shown in appendix A. Especially, Nalisnick et al. (2019) observed that there is a clear correlation between likelihoods and complexity estimates. We checked their observation on the large collection of benchmarks, shown in Figure 2.

Furthermore, counterexamples for OoD indicators not based on deep generative models are shown in appendix A. e.g., Lee et al. (2018) reaches 98.24% AUROC on SVHN vs CIFAR-10, but only 38.22% AUROC on Omniglot vs FashionMNIST. These counterexamples indicate the critical generality problem in the OoD domain. An important reason for this phenomenon is that OoD indicators are always designed based on motivating observations only on few datasets. However, it is not guaranteed that these observations are also established on the large collection of benchmarks. These counterexamples encourage the evaluation on the large collection of benchmarks.

(a) Correlation
(b) AUROC = 0.9867
(c) AUROC = 0.7770
(d) AUROC = 0.9999
Figure 2: Omniglot is in-distribution, and other datasets are out-of-distribution. (a) shows the correlation between likelihoods trained on Omniglot and complexity estimate. (b) shows the histogram of log-likelihood. (c) shows that indicator might perform worse than log-likelihood. It means that is rough, and we need a more precise, stable, and interpretable estimate to assist log-likelihood for detecting OoD. (d) shows might be a good choice as a theoretical indicator (NOT practical indicator) where is trained on and is trained on .

4.2 Performance on large collection of benchmarks

For the following reasons, a large collection of benchmarks is used:

1) Check observations OoD indicators of past works are based on the motivating observation on few datasets. However, we find that the observation on few datasets is not reliable on the large collection of benchmarks, as shown in Section 4.1. In practice, OoD indicators need to handle arbitrary images, and it will be harmful if OoD indicators can only work for few datasets. Therefore, it is necessary to validate the generality of motivating observation on the large collection of benchmarks.

2) Average Performance Average performance on the large collection of benchmarks is better for assessing indicators. In CelebA vs LSUN, log-likelihood reaches 98% AUROC. However, it only reaches 2% in CelebA vs SVHN in appendix A. The average performance will consider such experiments with lower AUROC. It is more meaningful to improve the average performance of indicators than to improve little (e.g., 99.1% to 99.2%) in a single experiment.

Indicators of previous works via common deep generative models do not perform well on the large collection of benchmarks, as shown in Table 1, where DeConf-C, MCMC Recon, MCMC , , , entropy, , Mahalanobis, ODIN and disagreement are proposed by past wroks Hendrycks and Gimpel (2016); Hsu et al. (2020); Lee et al. (2018); Alemi et al. (2018); Liang et al. (2018); Kumar et al. (2019); Xu et al. (2018); Chen et al. (2019); Lakshminarayanan et al. (2017). Thanks to the assistance of , performs well among past works, which encourages us to develop better assistance.

Indicator VAE PixelCNN RNVP 70.11 72.26 69.19 89.71 84.28 89.72 53.95 NA 24.27 69.47 77.46 64.03 74.59 82.19 83.74 83.11 82.21 86.06 53.13 69.22 71.27 67.26 56.98 77.38 BN 85.15 64.10 82.00 81.88 88.33 80.11 SF(x) 95.78 97.64 94.34 Indicator Model AUROC Recon VAE 69.26 ELBO VAE 69.44 ELBO - Recon VAE 52.85 MCMC Recon VAE 67.43 MCMC VAE 67.45 Volume RNVP 62.46 RNVP 74.58 H VIB 66.79 R VIB 58.78 WGAN 79.15 WGAN 60.55 Disagreement ResNet 69.25 Mahalanobis ResNet 83.02 Entropy of ResNet 62.74 ResNet 61.59 ODIN ResNet 60.68 DeConf-C ResNet 68.59 DeConf-C* ResNet 71.09
Table 1: The average AUROC of past works and our method Single-shot Fine-tune (SF(x)) with VAE, PixelCNN, and RNVP on the large collection of benchmarks. SF(x) outperforms past works.

4.3 Observation of KL-based indicator

As shown in Figure 2, the complexity estimate is unstable, and sometimes it might lower the performance. From the view of complexity estimate, is the log-likelihood of a universal model  Serrà et al. (2019). Therefore, we try to find another likelihood-based term to replace . In experiments on large collection of benchmarks, we observe a common phenomenon (for 99.815% data in all experiments) that for most in and for most in , where is a likelihood-based model trained on . The average AUROC of reaches nearly 100% on all datasets in Table 2. can be seen as an improvement of , which is not a universal model but a particular model for OoD detection. is called KL-based indicator .

However, is unknown in the OoD problem. Therefore, the KL-based indicator is only a theoretical indicator, not practical. Next, we develop an indicator approximating to KL-based indicator without training on and explain why KL-based indicator is always effective theoretically.

5 Algorithm

Based on the observation of the KL-based indicator, we propose a novel theoretical framework DOI. Through DOI, a strawman algorithm, Naive Fine-tune, is proposed for introducing a novel OoD criterion, called fine-tune criterion. Naive Fine-tune has an obvious weakness that it needs , which is not allowed in the OoD domain. To solve this problem, we propose Single-shot Fine-tune algorithm, which fine-tunes the model on the single testing sample. It is an inductive method.

5.1 Divergence-based OoD Indicators

We propose a fundamental theoretical framework for Divergence-based OoD Indicators called DOI. The key idea of DOI is to use the divergence between in-distribution and out-of-distribution to detect OoD instead of likelihood.

To achieve this idea, Kullback-Leibler divergence is chosen as the divergence in

DOI. Based on 5 fundamental assumptions (also observed in experiments), many theorems are proved in appendix C. We show the theorems without proves here.

Theorem. 1 and and are effective symmetric indicators, i.e., the two indicators both reach same performance in experiment A vs B and B vs A, with threshold 0, where and . maps

into a gaussian distribution.

Theorem. 2 For any mixture distribution where and , the performance of indicator and indicator is equal for OoD detection.

Theorem. 3 On any dataset pair that log-likelihood works well, i.e., for most , KL-based indicator can reach better performance.

Theorem. 4 For any likelihood-ratio indicator where

is a continuous differentiable probability distribution, KL-based indicator outperforms them.

Theorem. 5 can reach better AUROC than KL-based indicator, when is well-trained, i.e., reaches better likelihood on than .

By Theorem 1, indicator can nearly perfectly solve the OoD problem, as shown in Figure 2. By Theorem 1, for and for , as shown in Figure 3. is called KL-based indicator. However, is not a practical method since needs , which is not allowed in OoD domain. Through Theorem 5, DOI find a method which does NOT need , to approximate KL-based indicator, shown in the next section.

Figure 3: Diagrammatic sketch for KL-based indicator. The in-distribution is in [0, 10] and out-of-distribution in [10, 20]. Intuitively, assign higher density to and lower density for than , i.e., and .

5.2 Naive Fine-tune

By Theorem 5, could reach same performance as KL-based indicator, while only needs for training (note that the training for does NOT need the label in ). From another perspective, using indicator could be treated as a fine-tune process: a model is well-trained on , and then it is fine-tuned on ; if gets a worse likelihood after fine-tuning, will be detected as in-distribution, otherwise out-of-distribution. Fine-tune criterion is significantly different from the previous likelihood criterion in Table 1.

To introduce the fine-tune criterion, the Naive Fine-tune algorithm is proposed in Algorithm 1, which is only a strawman algorithm since Naive Fine-tune requires testing set . In Algorithm 1, is used to detect whether the is too high or too low ( will be treated as OoD directly), by Theorem 1. Fine-tune criterion is shown effective in Table 2.

However, Naive Fine-tune will be useless in realistic scenes since the whole is hard to get Xu et al. (2018). The major weakness of Naive Fine-tune is that Naive Fine-tune requires . However, it is not allowed in the OoD domain. To solve this problem, we did detailed researches about Naive Fine-tune in Section 6.2 and developed an inductive method in the following subsection.

  Input: The training set , the testing set . represents whether use to initialize . .
  Output: Predicted label for each image in
  Maximize log-likelihood on
   if is True else random initialize
  Maximize log-likelihood on
  return for each
Algorithm 1 Naive Fine-tune Algorithm

5.3 Single-shot Fine-tune

By Theorem 5 and Figure 4

, Naive Fine-tune reaches a promised performance in few epochs. Inspired by this, we propose Single-shot Fine-tune algorithm in Algorithm 

2. The key idea of Single-shot Fine-tune is to fine-tune model on the single testing sample , instead of the whole testing set. Data-augmentation generates more samples to enhance the generality of .

Especially, if , is fine-tuned on a sample from in-distribution (note is well-trained on ) and then will be close to ; if , is fine-tuned on a sample from out-of-distribution, and then will be much larger than .

Through fine-tuning model on the single testing sample, Single-shot Fine-tune solves the weakness of Naive Fine-tune. The input required by Single-shot Fine-tune is only the testing sample , and obviously, every method needs as input. At last, algorithm 2 might be confusing since it uses the back-propagation, which is usually used in training instead of testing.

Why we can use back-propagation

. In canonical deep learning domains,

e.g., classification, back-propagation is usually used in training instead of testing because of the following 3 major reasons:

Labels. In the canonical deep learning domain, e.g.

, classification, the loss function usually uses the labels, and then back-propagation needs labels as input, which is not allowed in testing.

Samples. Back-propagation needs enough samples. However, in the testing stage, especially in the inductive learning domain, it is not allowed to obtain many testing samples.

Time. Back-propagation usually needs many steps to train the model, which is time-consuming. However, especially in some online-system, the testing time should be short enough.

The above problems lead to an inherent impression that back-propagation can not be used in the testing stage. However, Single-shot Fine-tune has solved the above problems in the OoD domain:

Labels. The loss functions of deep generative models do not use the labels.

Samples. Single-shot Fine-tune algorithm only uses the single testing sample as input and uses data-augmentation to enhance the generality of samples. Only popular data-augmentation methods are used (containing shift and rotation) instead of special-designed data-augmentation.

Time. Single-shot Fine-tune only uses few steps (64 steps in 7s per testing sample, while Naive Fine-tune costs 60k steps) to fine-tune the model. The fastest method likelihood costs 0.36s per testing sample, but its performance is much lower (24%) than Single-shot Fine-tune.

In conclusion, we argue that back-propagation should be allowed to be used in Single-shot Fine-tune.

  Input: The training set , the testing sample , fine-tuning steps , and .
  Output: Predicted labels for
  Maximize log-likelihood on
  for  to  do
     Generate a batch through data-augmentation for
     Maximize log-likelihood on for one step
  end for
Algorithm 2 Single-shot Fine-tune Algorithm

6 Experiments

This section demonstrates the effectiveness of KL-based indicator, Naive Fine-tune, and Single-shot Fine-tune, on computer vision benchmark datasets. Detailed setup is shown in Appendix B.

6.1 Major Results

Indicator VAE PixelCNN RNVP
KL-based Indicator 99.08 99.85 99.81
Naive Fine-tune 98.68 97.80 98.55
Single-shot Fine-tune 95.78 97.64 94.34
Table 2: The average AUROC of KL-based indicator (only theoretical, NOT practical), Naive Fine-tune (transductive), and Single-shot Fine-tune (inductive) on VAE, PixelCNN, and RNVP. The KL-based indicator uses . Thus it is not practical. Naive Fine-tune reaches nearly the same performance as the KL-based indicator, which validates the Theorem 5. Single-shot Fine-tune is slightly worse than Naive Fine-tune. However, it is inductive and only costs about 7s per image.

The main results of past works are shown in Table 1. The main results for KL-based indicator, Naive Fine-tune, and Single-shot Fine-tune are summarized in Table 2.

6.2 Addressing concerns

The key idea of this paper is to detect OoD by fine-tuning the model with testing samples. However, there are the following major concerns about this idea:

Q1. Is Naive Fine-tune data-specific? i.e., does it work for the data that have not been fine-tuned on?

A1. In Table 3, Naive Fine-tune reaches promised performance (slightly lower than KL-based indicator) when 20% testing data are used for fine-tuning. Thus Naive Fine-tune is not data-specific.

Q2. Can our method work online? i.e., the testing data is streaming.

A2. We simulate an online system with streaming : , and, is known when is testing. At time , Naive Fine-tune runs with . Table 3 and Figure 4 shows that Naive Fine-tune reaches promised performance with online limitation.

Q3. Can Naive Fine-tune work if contains few data?

A3. is divided into several blocks, and Naive Fine-tune runs on each block. In this case, data for fine-tuning and fine-tune epochs are less than the ordinary case. Figure 4 shows that the optimization leads to unexpected when data are insufficient. Training on past data (online) can alleviate the issue. These experiments show that Naive Fine-tune is effective, simple, online, and not data-specific. It also shows the weakness of that Naive Fine-tune algorithm can not work well when data for Naive Fine-tune are severely insufficient. It encourages the development of Single-shot Fine-tune.

Q4. What is the difference between pretrained (initialize with ) and unpretrained models?

A4. Figure 4 shows AUROC during Naive Fine-tune. Unpretrained model needs more epochs to reach a better performance than the pretrained model. In contrast, the pretrained model can easily reach a promised performance in few epochs, which leads to Single-shot Fine-tune algorithm.

Q5. Does the Single-shot Fine-tune method rely on the data augmentation and the number of steps?

A5. Algorithm 2 does not mandate a special-designed data-augmentation. Appendix B shows experiments with data augmentations containing shifting, rotation, blur, noise, scale, cropping, flipping, and modifying contrast and lightness. Their performance has no significant difference (0.3%) to the basic data-augmentation containing rotation and shift. When step , the step significantly influences the AUROC, and when step , AUROC is nearly the same, as shown in appendix B.2. Therefore, we set step = 64 as the default parameter of Single-shot Fine-tune. Additionally, when step =64, the time-cost is only 7s per testing sample.

Limitation None online 20% block
VAE 98.68 97.24 96.50 97.60
PixelCNN 97.80 91.77 91.47 88.80
RNVP 98.55 90.92 88.16 90.44
Table 3: Average AUROC of Naive Fine-tune with limitations introduced in Section 6.2.
Figure 4: Left: Average AUROC of pretrained model and unpretrained model during the Naive Fine-tune on CIFAR-10 vs other datasets. Mid: AUROC of Naive Fine-tune when the number of samples used by Naive Fine-tune varies on CIFAR-10 vs SVHN. ’online’, ’block’ is introduced in Section 6.2. ’direct’ indicates that model is directly trained on such few data with enough epochs. Right: ROC and PRC on MNIST vs Omniglot based on VAE model. The KL-based indicator, Naive Fine-tune algorithm, and Single-shot Fine-tune algorithm significantly outperform others.

6.3 Validation of Theorem

Theorem 1 and Theorem 3 are supported by the detailed experiments shown in appendix B. Theorem 2 is supported by Table 2, where the performance of is quite close to . Figure 4 shows PRC and ROC of KL-based indicator, log-likelihood indicator, likelihood ratio indicator, and others, which supports Theorem 4 that KL-based indicator is the best. Theorem 5 is supported by Table 2, where the performance of Naive Fine-tune is close to the KL-based indicator.

6.4 Limitations of this Study

Limitation of datasets. In our paper, we use a large collection of benchmarks to show the generality of indicators. However, we only consider natural OoD datasets and do not consider attacked OoD, which are categorized by Chen et al. (2020). An important reason is that there is no universal criterion like simply-classified, to measure attacked OoD datasets.

Limitation of models. In our paper, for fair comparison and generality, we only consider the common models, including ResNet, VAE, PixelCNN, RealNVP, and WGAN. However, there are numerous models careful-designed for OoD detection. Due to the resource limitation, we can not provide the performance of them on the large collection of benchmarks.

Limitation of KL-based indicators. In section 6, KL-based indicators rely on the model (Naive Fine-tune with PixelCNN and RNVP is more data-specific) and optimizer (optimizer can not provide the expected with insufficient data). Single-shot Fine-tune solves such problems through data augmentation. However, Single-shot Fine-tune reaches worse performance than Naive Fine-tune, and it needs to be developed (e.g., careful-designed data-augmentation and optimizer for single-shot).

7 Conclusion and Future Work

This paper first shows none of the existing OoD indicators based on deep generative models perform well on the large collection of benchmarks. We then propose a novel theoretical framework DOI for divergence-based out-of-distribution indicators and propose the Single-shot Fine-tune algorithm, which significantly outperforms past works by 58% in AUROC.

We believe the divergence-based out-of-distribution indicator theoretical framework and fine-tune criterion of our paper are important steps towards developing more effective OoD indicators based on deep generative models. For future work, it will be interesting to propose more OoD indicators through DOI framework and fine-tune criterion.


  • A. A. Alemi, I. Fischer, and J. V. Dillon (2018) Uncertainty in the variational information bottleneck. arXiv preprint arXiv:1807.00906. Cited by: §2, §4.1, §4.2.
  • C. M. Bishop (1994) Novelty detection and neural network validation. IEE Proceedings-Vision, Image and Signal processing 141 (4), pp. 217–222. Cited by: §1.
  • T. Che, X. Liu, S. Li, Y. Ge, R. Zhang, C. Xiong, and Y. Bengio (2019) Deep verifier networks: verification of deep discriminative models with deep generative models. arXiv preprint arXiv:1911.07421. Cited by: §1, §2, §2, §4.1.
  • J. Chen, Y. Li, X. Wu, Y. Liang, and S. Jha (2020)

    Robust out-of-distribution detection via informative outlier mining

    arXiv preprint arXiv:2006.15207. Cited by: §1, §6.4.
  • W. Chen, H. Xu, Z. Li, D. Peiy, J. Chen, H. Qiao, Y. Feng, and Z. Wang (2019)

    Unsupervised anomaly detection for intricate kpis via adversarial training of vae

    In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 1891–1899. Cited by: §4.2.
  • H. Choi, E. Jang, and A. A. Alemi (2018) Waic, but why? generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392. Cited by: §1, §1, §1, §2.
  • J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §3.
  • L. Dinh, J. Sohl-Dickstein, et al. (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1.
  • D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §2, §4.2.
  • D. Hendrycks, M. Mazeika, and T. Dietterich (2018) Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §1, §2.
  • Y. Hsu, Y. Shen, H. Jin, and Z. Kira (2020) Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10951–10960. Cited by: §2, §4.2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1.
  • R. Kumar, S. Ozair, A. Goyal, A. Courville, and Y. Bengio (2019)

    Maximum entropy generators for energy-based models

    arXiv preprint arXiv:1901.08508. Cited by: §4.2.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. neural information processing systems. Cited by: §2, §4.2.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §2, §4.1, §4.2.
  • S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. international conference on learning representations. Cited by: §4.2.
  • T. E. Nalisnick, A. Matsukawa, W. Y. Teh, D. Görür, and B. Lakshminarayanan (2019) Do deep generative models know what they don’t know?. international conference on learning representations. Cited by: §1, §2, §2, §4.1, §4.1.
  • Y. Netzer, T. Wang, et al. (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §1.
  • J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, pp. 14707–14718. Cited by: §1, §2, §4.1.
  • J. Serrà, D. Álvarez, V. Gómez, O. Slizovskaia, J. F. Núñez, and J. Luque (2019) Input complexity and out-of-distribution detection with likelihood-based generative models. arXiv preprint arXiv:1909.11480. Cited by: §1, §2, §2, §4.3.
  • J. Song, Y. Song, et al. (2019) Unsupervised out-of-distribution detection with batch normalization. arXiv preprint arXiv:1910.09115. Cited by: §1, §2.
  • Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895–11907. Cited by: §4.1.
  • Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman (2017) Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §1, §2, §4.1.
  • C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi (2016)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    arXiv preprint arXiv:1602.07261. Cited by: §1.
  • H. Takahashi, T. Iwata, et al. (2019)

    Variational autoencoder with implicit optimal priors


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 5066–5073. Cited by: §1, §2.
  • J. Tomczak and M. Welling (2018) VAE with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214–1223. Cited by: §1, §2.
  • A. Van den Oord, N. Kalchbrenner, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §1, §2.
  • H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, et al. (2018) Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 World Wide Web Conference, pp. 187–196. Cited by: §4.2, §5.2.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §1.
  • X. Zhisheng, Y. Qing, and A. Yali (2020) Likelihood regret: an out-of-distribution detection score for variational auto-encoder. NIPS 2020. Cited by: footnote 1.