1 Introduction
Assessing whether input data is novel or significantly different than the one used in training is critical for realworld machine learning applications. Such data are known as outofdistribution (OOD) inputs, and detecting them should facilitate safe and reliable model operation. This is particularly necessary for deep neural network classifiers, which can be easily fooled by OOD data
(Nguyen et al., 2015). Several approaches have been proposed for OOD detection on top of or within a neural network classifier (Hendrycks & Gimpel, 2017; Lakshminarayanan et al., 2017; Liang et al., 2018; Lee et al., 2018). Nonetheless, OOD detection is not limited to classification tasks nor to labeled data sets. Two examples of that are novelty detection from an unlabeled data set and nextframe prediction from video sequences.
A rather obvious strategy to perform OOD detection in the absence of labels (and even in the presence of them) is to learn a density model that approximates the true distribution of training inputs (Bishop, 1994). Then, if such approximation is good enough, that is, , OOD inputs should yield a low likelihood under model . With complex data like audio or images, this strategy was long thought to be unattainable due to the difficulty of learning a sufficiently good model. However, with current approaches, we start having generative models that are able to learn good approximations of the density conveyed by those complex data. Autoregressive and invertible models such as PixelCNN++ (Salimans et al., 2017) and Glow (Kingma & Dhariwal, 2018) perform well in this regard and, in addition, can approximate with arbitrary accuracy.
Recent works, however, have shown that likelihoods derived from generative models fail to distinguish between training data and some OOD input types (Choi et al., 2018; Nalisnick et al., 2019a; Hendrycks et al., 2019). This occurs for different likelihoodbased generative models, and even when inputs are unrelated to training data or have totally different semantics. For instance, when trained on CIFAR10, generative models report higher likelihoods for SVHN than for CIFAR10 itself (Fig. 1; data descriptions are available in Appendix A). Intriguingly, this behavior is not consistent across data sets, as other ones correctly tend to produce likelihoods lower than the ones of the training data (see the example of TrafficSign in Fig. 1). A number of explanations have been suggested for the root cause of this behavior (Choi et al., 2018; Nalisnick et al., 2019a; Ren et al., 2019) but, to date, a full understanding of the phenomenon remains elusive.
In this paper, we shed light to the above phenomenon, showing that likelihoods computed from generative models exhibit a strong bias towards the complexity of the corresponding inputs. We find that qualitatively complex images tend to produce the lowest likelihoods, and that simple images always yield the highest ones. In fact, we show a clear negative correlation between quantitative estimates of complexity and the likelihood of generative models. In the second part of the paper, we propose to leverage such estimates of complexity to detect OOD inputs. To do so, we introduce a widelyapplicable OOD score for individual inputs that corresponds, conceptually, to a likelihoodratio test akin to Bayesian model comparison. We show that such score turns likelihoodbased generative models into practical and effective OOD detectors, with performances comparable to, or even better than the stateoftheart. We base our experiments on an extensive collection of alternatives, including a pool of 12 data sets, two conceptually different generative models, and three variants of complexity estimates.
2 Complexity Bias in Likelihoodbased Generative Models
From now on, we shall consider the loglikelihood of an input x given a model : . Following common practice in evaluating generative models, negative loglikelihoods will be expressed in bits per dimension (Theis et al., 2016), where dimension corresponds to the total size of x (we resize all images to have 33232 pixels). Note that the qualitative behavior of loglikelihoods is the same as likelihoods. Ideally, OOD inputs should have a low , while indistribution data should have a larger .
Most literature compares likelihoods of a given model for a few data sets. However, if we consider several different data sets at once and study their likelihoods, we can get some insight. In Fig. 2, we show the loglikelihood distributions for the considered data sets (Appendix A), computed with a Glow model trained on CIFAR10. We observe that the data set with a higher loglikelihood is Constant, a data set of constantcolor images, followed by Omniglot, MNIST, and FashionMNIST; all of those featuring grayscale images with a large presence of empty black background. On the other side of the spectrum, we observe that the data set with a lower loglikelihood is Noise, a data set of uniform random images, followed by TrafficSign and TinyImageNet; both featuring colorful images with nontrivial background. Such ordering is perhaps more clear by looking at the average loglikelihood of each data set (Appendix D). If we think about the visual complexity of the images in those data sets, it would seem that loglikelihoods tend to grow when images become simpler and with less information or content.
To further confirm the previous observation, we design a controlled experiment where we can set different decreasing levels of image complexity. We train a generative model with some data set, as before, but now compute likelihoods of progressively simpler inputs. Such inputs are obtained by averagepooling the uniform random Noise images by factors of 1, 2, 4, 8, 16, and 32, and rescaling back the images to the original size by nearestneighbor upsampling. Intuitively, a noise image with a pooling size of 1 (no pooling) has the highest complexity, while a noise image with a pooling of 32 (constantcolor image) has the lowest complexity. Pooling factors from 2 to 16 then account for intermediate, decreasing levels of complexity. The result of the experiment is a progressive growing of the loglikelihood (Fig. 3). Given that the only difference between data is the pooling factor, we can infer that image complexity plays a major role in generative models’ likelihoods.
Until now, we have consciously avoided a quantitative definition of complexity. However, to further study the observed phenomenon, and despite the difficulty in quantifying the multiple aspects that affect the complexity of an input (cf. Lloyd, 2001), we have to adopt one. A sensible choice would be to exploit the notion of Kolmogorov complexity (Kolmogorov, 1963) which, unfortunately, is noncomputable. In such cases, we have to deal with it by calculating an upper bound using a lossless compression algorithm (Cover & Thomas, 2006). Given a set of inputs x coded with the same bit depth, the normalized size of their compressed versions, (in bits per dimension), can be considered a reasonable estimate of their complexity. That is, given the same coding depth, a highly complex input will require more bits per dimension, while a less complex one will be compressed with fewer bits per dimension. For images, we can use PNG, JPEG2000, or FLIF compressors (Appendix C). For other data types such as audio or text, other lossless compressors should be available to produce a similar estimate.
If we study the relation between generative models’ likelihoods and our complexity estimates, we observe that there is a clear negative correlation (Fig. 4). Considering all data sets, we find Pearson’s correlation coefficients below 0.75 for models trained on FashionMNIST, and below 0.9 for models trained on CIFAR10, independently of the compressor used (Appendix D
). Such significant correlations, all of them with infinitesimal pvalues, indicate that likelihoodbased measures are highly influenced by the complexity of the input image, and that this concept accounts for most of their variance. In fact, such strong correlations suggest that we could replace the computed likelihood values for the negative of our complexity estimate and obtain almost the same result. This implies that, in terms of detecting OOD inputs, a complexity estimate would perform as well (or bad) as the likelihoods computed from our generative models.
3 Testing Outofdistribution Inputs
3.1 Definition
As complexity seems to account for most of the variability in generative models’ likelihoods, we propose to compensate for it when testing for possible OOD inputs. Given that both negative loglikelihoods and the complexity estimate are expressed in bits per dimension (Sec. 2), we can express our OOD score as a subtraction between the two:
(1) 
Notice that, since we use negative loglikelihoods, the higher the , the more OOD the input x will be (see below).
3.2 Interpretation: Occam’s Razor and the Outofdistribution Problem
Interestingly, can be interpreted as a likelihoodratio test. For that, we take the point of view of Bayesian model comparison or minimum description length principle (MacKay, 2003). We can think of a compressor as a universal model, adjusted for all possible inputs and general enough so that it is not biased towards a particular type of data semantics. Considering the probabilistic model associated with the size of the output produced by the lossless compressor (MacKay, 2003), we have
and, correspondingly,
(2) 
In Bayesian model comparison, we are interested in comparing the posterior probabilities of different models in light of data
. In this setting, the trained generative model is a ‘simpler’ version of the universal model , targeted to a specific semantics or data type. With it, one aims to approximate the marginal likelihood or model evidence for , which integrates out all model parameters:This integral is intractable, but current generative models can approximate with arbitrary accuracy (Kingma & Dhariwal, 2018). Choosing between one or another model is then reduced to a simple likelihoodratio test:
For uniform priors , this test is reduced to
This test accommodates the Occam’s razor principle. Consider simple inputs that can be easily compressed by using a few bits, and that are not present in the training of . These cases have a high probability under , effectively correcting the abnormal high likelihood given by the learned model . The same effect will occur with complex inputs that are not present in the training data. In these cases, both likelihoods will be low, but the universal lossless compressor will predict those better than the learned model . The two situations will lead to large values of the test . In contrast, inputs that belong to the data used to train the generative model will always be better predicted by than by , resulting in lower values of .
4 Related Works
Ren et al. (2019)
have recently proposed the use of likelihoodratio tests for OOD detection. They posit that “background statistics” (for instance, the number of zeros in the background of MNISTlike images) are the source of abnormal likelihoods, and propose to exploit them by learning a background model which is trained on random surrogates of input data. Such surrogates are generated according to a Bernoulli distribution, and an L2 regularization term is added to the background model, which implies that the approach has two hyperparameters. Moreover, both the background model and the model trained using indistribution data need to capture the background information equally well. In contrast to their method, our test does not require additional training nor extra conditions on a specific background model for every type of training data.
Choi et al. (2018) and Nalisnick et al. (2019b) suggest that typicality is the culprit for likelihoodbased generative models not being able to detect OOD inputs. While Choi et al. (2018) do not explicitly address the problem of typicality, their estimate of the WatanabeAkaike information criterion using ensembles of generative models performs well in practice. Nalisnick et al. (2019b) propose an explicit test for typicality employing a Monte Carlo estimate of empirical entropy, which limits their approach to batches of inputs of the same type.
The works of HøstMadsen et al. (2019) and Sabeti & HøstMadsen (2019) combine the concepts of typicality and minimum description length to perform novelty detection. Although concepts are similar to the ones employed here, their focus is mainly on bit sequences. They consider atypical sequences those that can be described (coded) with fewer bits in itself rather than using the (optimum) code for typical sequences. We find their implementation to rely on strong parametric assumptions, which makes it difficult to generalize to generative or other machine learning models.
A number of methods have been proposed to perform OOD detection under a classificationbased framework (Hendrycks & Gimpel, 2017; Lakshminarayanan et al., 2017; Liang et al., 2018; Alemi et al., 2018; Lee et al., 2018; Hendrycks et al., 2019). Although achieving promising results, these methods do not generally apply to the more general case of nonlabeled or selfsupervised data. The method of Hendrycks et al. (2019)
extends to such cases by leveraging generative models, but nonetheless makes use of auxiliary, outlier data to learn to distinguish OOD inputs.
5 Results
We now study how performs on the OOD detection task. For that, we train a generative model on the train partition of a given data set and compute scores for such partition and the test partition of a different data set. With both sets of scores, we then calculate the AUROC measure, which is a common evaluation measure for the OOD detection task (Hendrycks et al., 2019).
First of all, we want to assess the improvement of over likelihoods alone (). When considering likelihoods from generative models trained on CIFAR10, the problematic results reported by previous works become clearly apparent (Table 1). The unintuitive higher likelihoods for SVHN observed in Sec. 1 now translate into a poor AUROC below 0.1. This not only happens for SVHN, but also for Constant, Omniglot, MNIST, and FashionMNIST data sets, for which we observed consistently higher likelihoods than CIFAR10 in Sec. 2. Likelihoods for the other data sets yield AUROCs above the random baseline of 0.5, but none above 0.67. The only exception is the Noise data set, which is perfectly distinguishable from CIFAR10 using likelihood alone.
Data set  Glow  PixelCNN++  

Constant  0.024  1.000  0.006  1.000 
Omniglot  0.001  1.000  0.001  1.000 
MNIST  0.001  1.000  0.002  1.000 
FashionMNIST  0.010  1.000  0.013  1.000 
SVHN  0.083  0.950  0.083  0.929 
CIFAR100  0.582  0.736  0.526  0.535 
CelebA  0.621  0.863  0.624  0.776 
FaceScrub  0.646  0.859  0.643  0.760 
TinyImageNet  0.663  0.716  0.642  0.589 
TrafficSign  0.609  0.931  0.599  0.870 
Noise  1.000  1.000  1.000  1.000 
If we now look at the AUROCs obtained with , we see that not only the results are reversed for less complex datasets like MNIST or SVHN, but also that all AUROCs for the rest of the data sets improve as well (Table 1). The only exception to the last assertion among all studied combinations is TinyImageNet for PixelCNN++ and FLIF (see also Appendix D). In general, we obtain AUROCs above 0.7, with many of them approaching 0.9 or 1. Thus, we can conclude that clearly improves over likelihoods alone in the OOD detection task, and that is able to revert the situation with intuitively less complex data sets that were previously yielding a low AUROC.
We also study how the training set and the choice of a generative model or compressor affects the performance of (Appendix D). Overall, we do not observe a large difference between the considered models and compressors, except for a few isolated cases whose investigation we defer for future work. In terms of data sets, we find the OOD detection task to be easier with FashionMNIST than with CIFAR10. We assume that this is due to the ease of the generative model to learn and approximate the density conveyed by the data. A similar but less marked trend is also observed for compressors, with better compressors yielding slightly improved AUROCs than other, in principle, less powerful ones. A takeaway from these observations would be that using better generative models and compressors will yield a more reliable and a better AUROC. The conducted experiments seem to support that, but a more indepth analysis should be carried out to further confirm this hypothesis.
Finally, we want to assess how compares to previous approaches in the literature. For that, we compile a number of reported AUROCs for both classifier and generativebased approaches and compare them with . Note that classifierbased approaches, as mentioned in Sec. 1, are less applicable than generativebased ones. In addition, as they exploit label information, they might have an advantage over generativebased approaches in terms of performance (some also exploit external or outlier data; Sec. 4).
Trained on:  FashionMNIST  CIFAR10  

OOD data:  MNIST  Omniglot  SVHN  CelebA  CIFAR100 
Classifierbased approaches  
ODIN (Liang et al., 2018)  0.697    0.966     
VIB (Alemi et al., 2018)  0.941  0.943  0.528  0.735   
Mahalanobis (Lee et al., 2018)  0.986    0.991     
Outlier exposure (Hendrycks et al., 2019)      0.984    0.933 
Generativebased approaches  
WAIC (Choi et al., 2018)  0.766  0.796  1.000  0.997   
Outlier exposure (Hendrycks et al., 2019)      0.758    0.685 
Typicality test (Nalisnick et al., 2019b)  0.140    0.420     
Likelihoodratio (Ren et al., 2019)  0.997    0.912     
using Glow and FLIF (ours)  0.998  1.000  0.950  0.863  0.736 
using PixelCNN++ and FLIF (ours)  0.967  1.000  0.929  0.776  0.535 
We observe that is competitive with both classifier and existing generativebased approaches (Table 2). When training with FashionMNIST, achieves the best scores among all considered approaches. The results with further test sets are also encouraging, with almost all AUROCs approaching 1 (Appendix D). When training with CIFAR10, achieves similar or better performance than existing approaches. Noticeably, within generativebased approaches, is only outperformed in two occasions by the same approach, WAIC, which uses ensembles of generative models (Sec. 4).
On the one hand, it would be interesting to see how could perform when using ensembles of models and compressors to produce better estimates of and , respectively. On the other hand, however, the use of a single generative model together with a single fast compression library makes an efficient alternative compared to WAIC and some other existing approaches. It is also worth noting that a number of classifier and generativebased approaches have some hyperparameters that need to be tuned, sometimes with the help of outlier or additional data. In contrast, is a parameterfree measure, which makes it easy to use and deploy.
6 Conclusion
We illustrate a fundamental insight with regard to the use of generative models’ likelihoods for the task of detecting OOD data. We show that input complexity has a strong effect in those likelihoods, and pose that it is the main culprit for the puzzling results of using generative models’ likelihoods for OOD detection. In addition, we show that an estimate of input complexity can be used to compensate standard negative loglikelihoods in order to produce an efficient and reliable OOD score. We also offer an interpretation of our score as a likelihoodratio test using Bayesian model comparison. Such score performs comparably to, or even better than several stateoftheart approaches, with results that are consistent across a range of data sets, models, and compression algorithms. The proposed score has no hyperparameters besides the definition of a generative model and a compression algorithm, which makes it easy to employ in a variety of problems and situations.
References

Alemi et al. (2018)
A. A. Alemi, I. Fischer, and J. V. Dillon.
Uncertainty in the variational information bottleneck.
In
Uncertainty in Deep Learning Workshop, UAI
, 2018.  Bishop (1994) C. M. Bishop. Novelty detection and neural network validation. IEEE Proceedings – Vision, Image and Signal Processing, 141(4):217–222, 1994.
 Choi et al. (2018) H. Choi, E. Jang, and A. A. Alemi. WAIC, but why? Generative ensembles for robust anomaly detection. ArXiv, 1810.01392, 2018.
 Cover & Thomas (2006) T. M. Cover and J. A. Thomas. Elements of information theory. WileyInterscience, Hoboken, USA, 2nd edition, 2006.

Deng et al. (2009)
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: a largescale hierarchical image database.
In
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
, pp. 248–255, 2009.  Hendrycks & Gimpel (2017) D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and outofdistribution examples in neural networks. In Proc. of the Int. Conf. on Learning Representations (ICLR), 2017.

Hendrycks et al. (2019)
D. Hendrycks, M. Mazeika, and T. G. Dietterich.
Deep anomaly detection with outlier exposure.
In Proc. of the Int. Conf. on Learning Representations (ICLR), 2019.  HøstMadsen et al. (2019) A. HøstMadsen, E. Sabeti, and C. Walton. Data Discovery and Anomaly Detection Using Atypicality: Theory. IEEE Trans. on Information Theory, In press, 2019.
 Kingma & Dhariwal (2018) D. P. Kingma and P. Dhariwal. Glow: generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems (NeurIPS), volume 31, pp. 10215–10224. Curran Associates, Inc., 2018.
 Kolmogorov (1963) A. N. Kolmogorov. On tables of random numbers. Sankhya Ser. A, 25:369–375, 1963.
 Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. MSc Thesis, University of Toronto, Toronto, Canada, 2009.
 Lake et al. (2015) B. Lake, R. Salakhutdinov, and J. Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350:1332–1338, 2015.
 Lakshminarayanan et al. (2017) B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallac, R. Fergus, S. Vishwanatan, and R. Garnett (eds.), Advances in Neural Information Processing Systems (NeurIPS), pp. 6402–6413. Curran Associates, Inc., 2017.

LeCun et al. (2010)
Y. LeCun, C. Cortes, and C. J. C. Burges.
The MNIST database of handwritten digits.
2010. URL http://yann.lecun.com/exdb/mnist/.  Lee et al. (2018) K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting outofdistribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems (NeurIPS), pp. 7167–7177. Curran Associates, Inc., 2018.
 Liang et al. (2018) S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of outofdistribution image detection in neural networks. In Proc. of the Int. Conf. on Learning Representations (ICLR), 2018.
 Liu et al. (2015) Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proc. of Int. Conf. on Computer Vision (ICCV), pp. 3730–3738, 2015.
 Lloyd (2001) S. Lloyd. Measures of complexity: a nonexhaustive list. IEEE Control Systems Magazine, 21(4):7–8, 2001.
 MacKay (2003) D. J. C. MacKay. Information theory, inference and learning algorithms. Cambridge University Press, Cambridge, UK, 2003.
 Nalisnick et al. (2019a) E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models know what they don’t know? In Proc. of the Int. Conf. on Learning Representations (ICLR), 2019a.
 Nalisnick et al. (2019b) E. Nalisnick, A. Matsukawa, Y. W. Teh, and B. Lakshminarayanan. Detecting outofdistribution inputs to deep generative models using a test for typicality. ArXiv, 1906.02994, 2019b.
 Netzer et al. (2011) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 Ng & Winkler (2014) H. W. Ng and S. Winkler. A datadriven approach to cleaning large face dataset. In Proc. of the IEEE Int. Conf. on Image Processing (ICIP), pp. 343–347, 2014.
 Nguyen et al. (2015) A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 427–436, 2015.

Paszke et al. (2017)
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer.
Automatic differentiation in PyTorch.
In NIPS Workshop on The Future of Gradientbased Machine Learning Software & Techniques (NIPSAutodiff), 2017.  Ren et al. (2019) J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan. Likelihood ratios for outofdistribution detection. In Advances in Neural Information Processing Systems (NeurIPS), in press. Curran Associates, Inc., 2019.
 Sabeti & HøstMadsen (2019) E. Sabeti and A. HøstMadsen. Data discovery and anomaly detection using atypicality for realvalued data. Entropy, 21(3):219, 2019.
 Salimans et al. (2017) T. Salimans, A. Karpaty, X. Chen, and D. P. Kingma. PixelCNN++: improving the PixelCNN with discretized logistic mixture likelihood and other modifications. In Proc. of the Int. Conf. on Learning Representations (ICLR), 2017.
 Stallkamp et al. (2011) J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: A multiclass classification competition. In The 2011 International Joint Conference on Neural Networks, pp. 1453–1460, 2011.
 Theis et al. (2016) L. Theis, A. Van den Oord, and M. Bethge. A note on the evaluation of generative models. In Proc. of the Int. Conf. on Learning Representations (ICLR), 2016.
 Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. ArXiv, 1708.07747, 2017.
Appendix
Appendix A Data Sets
In our experiments, we employ wellknown, publiclyavailable data sets. In addition to those, and to facilitate a better understanding of the problem, we develop another two selfcreated sets of synthetic images: Noise and Constant images. The Noise data set is created by uniformly randomly sampling a tensor of 3
3232 and quantizing the result to 8 bits. The Constant data set is created similarly, but using a tensor of 311 and repeating the values along the last two dimensions to obtain a size of 33232. The complete list of data sets is available in Table 3. In the case of data sets with different variations, such as CelebA or FaceScrub, which have both plain and aligned versions of the faces, we select the aligned versions. Note that, for models trained on CIFAR10, it is important to notice the overlap of certain classes between that and other sets, namely TinyImageNet and CIFAR100 (they overlap, for instance, in classes of certain animals or vehicles).Data set  Original size  Num. classes  Num. images 

Constant (Synthetic)  33232  1  40,000 
Omniglot (Lake et al., 2015)  1105105  1,623  32,460 
MNIST (LeCun et al., 2010)  12828  10  70,000 
FashionMNIST (Xiao et al., 2017)  12828  10  70,000 
SVHN (Netzer et al., 2011)  3 3232  10  99,289 
CIFAR10 (Krizhevsky, 2009)  33232  10  60,000 
CIFAR100 (Krizhevsky, 2009)  33232  100  60,000 
CelebA (Liu et al., 2015)  3178218  10,177  182,732 
FaceScrub (Ng & Winkler, 2014)  3300300  530  91,712 
TinyImageNet (Deng et al., 2009)  36464  200  100,000 
TrafficSign (Stallkamp et al., 2011)  33232  43  51,839 
Noise (Synthetic)  33232  1  40,000 
In order to split the data between train, validation, and test, we follow two simple rules: (1) if the data set contains some predefined train and test splits, we respect them and create a validation split using a random 10% of the training data; (2) if no predefined splits are available, we create them by randomly assigning 80% of the data to the train split and 10% to both validation and test splits. In order to create consistent input sizes for the generative models, we work with 3channel images of size 3232. For those data sets which do not match this configuration, we follow a classic bilinear resizing strategy and, to simulate the three color components from a grayscale image, we triplicate the channel dimension.
Appendix B Models and Training
The results of this paper are obtained using two generative models of different nature: one autoregressive and one invertible model. As autoregressive model we choose PixelCNN++
(Salimans et al., 2017), which has been shown to obtain very good results in terms of likelihood for image data. As invertible model we choose Glow (Kingma & Dhariwal, 2018), which is also capable of inferring exact loglikelihood computations using large stacks of bijective transformations. We implement the Glow model using the default configuration of the original implementation^{1}^{1}1https://github.com/openai/glow, except that we zeropad and do not use ActNorm inside the coupling network. The model has 3 blocks of 32 flows, using an affine coupling with an squeezing factor of 2. As for PixelCNN++, we set 5 residual blocks per stage, with 80 filters and 10 logistic components in the mixture. The nonlinearity of the residual layers corresponds to an exponential linear unit
^{2}^{2}2https://github.com/pclucas14/pixelcnnpp.We train both Glow and PixelCNN++ using the Adam optimizer with an initial learning rate of . We reduce this initial value by a factor of
every time that the validation loss does not decrease during 5 consecutive epochs. The training finishes when the learning rate is reduced by factor of
. The batch size of both models is set to 50. The final model weights are the ones yielding the best validation loss. We use PyTorch version 1.2.0 (Paszke et al., 2017). All models have been trained with a single NVIDIA GeForce GTX 1080Ti GPU. Training takes some hours under that setting.Appendix C Compressors and Complexity Estimate
We explore three different options to compress input images. As a mandatory condition, they need to provide lossless compression. The first format that we consider is PNG, and oldclassic format which is globally used and wellknown. We use OpenCV^{3}^{3}3https://opencv.org to compress from raw Numpy matrices, with compression set to the maximum possible level. The second format that we consider is JPEG2000. Although not as globally known as the previous one, it is a more modern format with several new generation features such as progressive decoding. Again, we use the default OpenCV implementation to obtain the size of an image using this compression algorithm. The third format that we consider is FLIF, the most modern algorithm of the list. According to its website^{4}^{4}4https://flif.info, it promises to generate up to 53% smaller files than JPEG2000. We use the publiclyavailable compressor implementation in their website. Notice that we do not include the header size in the measurement of the resulting bits per dimension.
To compute our complexity estimate , we compress the input x with one of the compressors above. With that, we obtain a string of bits . The length of it, , is normalized by the size or dimensionality of x, which we denote by , to obtain the complexity estimate:
We also experimented with an improved version of ,
where corresponds to different compression schemes. This forces to work always with the best compressor for every x. In our case, as FLIF was almost always the best compressor, we did not observe a clear difference between using or . However, in cases where it is not clear which compressor to use or cases in which we do not have a clear best/winner, could be of use.
Appendix D Additional Results
The additional results mentioned in the main paper are the following:

In Table 4, we report the average loglikelihood for every data set. We sort data sets from highest to lowest loglikelihood.

In Table 5, we report the global Pearson’s correlation coefficient for different models, train sets, and compressors. Due to the large sample size, Scipy version 1.2.1 reports a pvalue of 0 in all cases.

In Table 6, we report the AUROC values obtained from on the OOD detection task across different data sets, models, and compressors.
Data set  

Constant (Test)  0.25 
Omniglot (Test)  0.43 
MNIST (Test)  0.55 
FashionMNIST (Test)  0.83 
SVHN (Test)  1.19 
CIFAR10 (Train)  2.20 
CIFAR10 (Test)  2.21 
CIFAR100 (Test)  2.27 
CelebA (Test)  2.42 
FaceScrub (Test)  2.43 
TinyImageNet (Test)  2.51 
TrafficSign (Test)  2.51 
Noise (Test)  8.22 
Model  Trained with  Compressor  

PNG  JPEG2000  FLIF  
Glow  FashionMNIST  0.77  0.75  0.77 
PixelCNN++  FashionMNIST  0.77  0.77  0.78 
Glow  CIFAR10  0.94  0.90  0.90 
PixelCNN++  CIFAR10  0.96  0.94  0.94 
Data set  Glow  PixelCNN++  

PNG  JPEG2000  FLIF  PNG  JPEG2000  FLIF  
Constant  1.000  1.000  1.000  1.000  1.000  1.000 
Omniglot  1.000  1.000  1.000  1.000  1.000  1.000 
MNIST  0.841  0.493  0.997  0.821  0.687  0.967 
SVHN  1.000  1.000  1.000  1.000  1.000  1.000 
CIFAR10  1.000  1.000  1.000  0.998  1.000  1.000 
CIFAR100  1.000  1.000  1.000  0.997  1.000  1.000 
CelebA  1.000  1.000  1.000  1.000  1.000  1.000 
FaceScrub  1.000  1.000  1.000  1.000  1.000  1.000 
TinyImageNet  1.000  1.000  1.000  1.000  1.000  1.000 
TrafficSign  1.000  1.000  1.000  1.000  1.000  1.000 
Noise  1.000  1.000  1.000  1.000  1.000  1.000 
Constant  1.000  1.000  1.000  1.000  1.000  1.000 
Omniglot  1.000  0.994  1.000  1.000  0.997  1.000 
MNIST  1.000  0.996  1.000  1.000  0.995  1.000 
FashionMNIST  0.998  0.998  1.000  0.998  0.995  1.000 
SVHN  0.787  0.974  0.950  0.787  0.965  0.929 
CIFAR100  0.683  0.757  0.736  0.583  0.514  0.535 
CelebA  0.794  0.701  0.863  0.756  0.640  0.776 
FaceScrub  0.750  0.797  0.859  0.710  0.704  0.760 
TinyImageNet  0.710  0.875  0.716  0.657  0.735  0.589 
TrafficSign  0.953  0.955  0.931  0.916  0.840  0.870 
Noise  1.000  1.000  1.000  1.000  1.000  1.000 