InFlow: Robust outlier detection utilizing Normalizing Flows

by   Nishant Kumar, et al.
TU Dresden

Normalizing flows are prominent deep generative models that provide tractable probability distributions and efficient density estimation. However, they are well known to fail while detecting Out-of-Distribution (OOD) inputs as they directly encode the local features of the input representations in their latent space. In this paper, we solve this overconfidence issue of normalizing flows by demonstrating that flows, if extended by an attention mechanism, can reliably detect outliers including adversarial attacks. Our approach does not require outlier data for training and we showcase the efficiency of our method for OOD detection by reporting state-of-the-art performance in diverse experimental settings. Code available at .


Why Normalizing Flows Fail to Detect Out-of-Distribution Data

Detecting out-of-distribution (OOD) data is crucial for robust machine l...

AdvFlow: Inconspicuous Black-box Adversarial Attacks using Normalizing Flows

Deep learning classifiers are susceptible to well-crafted, imperceptible...

VOS: Learning What You Don't Know by Virtual Outlier Synthesis

Out-of-distribution (OOD) detection has received much attention lately d...

Densely connected normalizing flows

Normalizing flows are bijective mappings between inputs and latent repre...

Out-of-Distribution Detection with Hilbert-Schmidt Independence Optimization

Outlier detection tasks have been playing a critical role in AI safety. ...

Normalizing Flows: Introduction and Ideas

Normalizing Flows are generative models which produce tractable distribu...

FloMo: Tractable Motion Prediction with Normalizing Flows

The future motion of traffic participants is inherently uncertain. To pl...

1 Introduction

Rapid advancement in imaging sensor technology and machine learning (ML) techniques has led to notable breakthroughs in several real-world applications. ML models typically perform effectively when the training and testing data are sampled from the same distribution. However, when applied to input data that are not similar to the training data, i.e. when they are far away from the training data distribution (e.g. OOD), these models can fail and the predictions of the model are not reliable anymore. This limitation prevents the safe deployment of these ML models in life-sensitive and real-world setups like autonomous driving and medical diagnosis. In these setups, plenty of OOD data naturally occurs due to various factors such as different image acquisition settings, noise in image scenes, and varied camera parameters. Therefore, a reliable deployment of an ML model requires that the model can detect anomalies so that these models do not provide high confidence predictions to such inputs.

Deep generative models are commonly used for OOD detection in an unsupervised setting because of their ability to approximate the density of in-distribution samples as a probability distribution. It allows these models to assign lower likelihood to OOD inputs, rendering such inputs less likely to have been sampled from the in-distribution training set. Generative models such as Normalizing flows (Dinh et al., 2015); (Dinh et al., 2017); (Sorrenson et al., 2020); (Kingma et al., 2018); (Grathwohl et al., 2019); (Durkan et al., 2019) are especially suitable candidates for OOD detection as they provide tractable likelihoods. Let us define

as a random variable with input observations

and probability distribution while as the random variable with latent observations and probability distribution . Now, according to the change of variables formula, we can define a series of invertible bijective mappings where with parameters and being the number of coupling blocks to get . Therefore, the log-likelihood of the posterior distribution is given as,

(a) untrained RealNVP
(b) trained RealNVP
(c) our InFlow model
Figure 1: (a): A histogram of log-likelihoods of in-distribution CelebA and other OOD datasets for RealNVP model (Dinh et al., 2017) initialised with zeros. (b): A histogram of log-likelihoods for RealNVP model after training. Both (a) and (b) shows that RealNVP assigns higher likelihood to OOD inputs. (c): A histogram of log-likelihoods for our InFlow model at p-value assigning much higher log-likelihood to in-distribution CelebA samples than all other OOD datasets.

Considering the prior distribution of latent space be a multivariate Gaussian, then the series of invertible bijective transformations with parameters can transform the posterior from a Gaussian prior into significantly more complex probability distribution. Hence, we can maximize the log-likelihood of the in-distribution samples with respect to the parameters of the invertible transformation and use a likelihood-based threshold to decide whether the log-likelihood of a test sample

is below the threshold (classify

as OOD) or above the threshold (classify as in-distribution). Additionally, AUCROC (Area Under the Curve Receiver Operating Characteristic) can be calculated to determine the performance of the flow model in terms of OOD detection. However, works such as (Nalisnick et al., 2019); (Kirichenko et al., 2020) showed that generative models such as normalizing flows assign higher likelihoods to OOD samples compared to in-distribution samples, resulting in overconfident predictions on these OOD inputs as shown in Figure 1 (b). To interpret this behavior, (Kirichenko et al., 2020) argued that these models only capture low-level statistics such as local pixel correlations rather than high-level semantics, due to which these models are inefficient in separating in-distribution data from the OOD samples.

In this paper, we show that this issue can be solved by extending the normalizing flow design with an attention mechanism and validate that the attention mechanism ensures a higher log-likelihood score for in-distribution samples than the log-likelihood scores of OOD samples. We suggest that there are few benefits of constructing new designs of normalizing flow models for the OOD detection task and the focus should be directed towards extending the existing flow models with robust attention mechanisms in order to develop a reliable OOD detector. In Section 2, we present the current state of research in the field of OOD detection with the main focus on deep generative models. In Section 3, we develop the representation of our model and provide theoretical evidence for the robustness of our approach along with any underlying assumptions. In Section 4, we conduct several empirical evaluations of our approach in a variety of settings and discuss its effectiveness for OOD detection along with relevant limitations.

2 Related work

(Nguyen et al., 2012) provided initial evidences that ML models have high confidence for OOD inputs. To overcome this issue, (Ren et al., 2019) presented a likelihood ratio approach for OOD detection using auto-regressive generative models and experimented with a genomics dataset. (Liang et al., 2018) employed input data perturbations to obtain a softmax score from a pre-trained model and used a threshold to determine whether the input data is in-distribution or OOD. (DeVries et al., 2018) modified a pre-existing network architecture and added a confidence estimate branch at the penultimate layer to enhance the OOD detection accuracy. (Hendrycks et al., 2018) applied a technique called outlier exposure that teaches a pre-trained model to detect unseen OOD examples. (Hendrycks et al., 2018); (Lakshminarayanan et. al., 2017); (DeVries et. al., 2018) proposed classification models to detect OOD inputs whereas (Rabanser et al., 2019)

utilized a combination of dimensionality reduction techniques and robust test statistics like Maximum Mean Discrepancy (MMD) to develop a dataset drift detection approach.

(Lee et al., 2018) proposed a confidence estimate based on Mahalanobis distances. (Chen et al., 2020) showed that many existing OOD detection approaches such as (Liang et al., 2018); (Hendrycks et al., 2018); (Lee et al., 2018) do not work efficiently when small perturbations are added to the in-distribution samples. Hence, they trained their model on adversarial examples of in-distribution data along with the distribution from the outlier exposure developed by (Hendrycks et al., 2018). (Akcay et. al., 2018)

defined their OOD detection strategy based on the idea that Generative adversarial networks (GANs) will not reconstruct OOD samples well.

(Lee et al., 2018) developed a training mechanism by minimizing Kullback-Leibler (KL) divergence loss on the predictive distributions of the OOD samples to the uniform ones providing a measure for confidence assessment. (Hendrycks et al., 2019)

used a self-supervised learning approach that is robust to detecting adversarial attacks while

(Serrà et al., 2019) showed that the likelihood scores from generative models have a bias towards the complexity of the input data where non-smooth images tend to produce low likelihood scores while the smoother samples produce higher likelihood scores. (Xiao et al., 2020) studied OOD detection for Variational Auto-Encoders (VAEs) and proposed a likelihood regret score that computes the log-likelihood improvement of the VAE configuration that maximizes the likelihood of an individual sample. (Morningstar et al., 2021)

did not use likelihood-based OOD detection but utilized kernel density estimators such as Support Vector Machines (SVM) to differentiate between in-distribution and anomalous inputs.

(Chen et al., 2021) mined informative OOD data to improve the OOD detection performance, and subsequently generalized to unseen adversarial attacks. (Nalisnick et al., 2020) showed that the high likelihood behavior of generative models for OOD samples is due to a mismatch between the model’s typical set and its high probability density whereas (Choi et al., 2018) introduced Watanabe–Akaike information criterion (WAIC) based score to differentiate OOD samples from in-distribution samples. (Kobyzev et al., 2020) gave an outline of several normalizing flow-based methods and discussed their suitability for different real-world applications. (Nalisnick et al., 2019) showed that INNs are especially attractive for OOD detection compared to other generative models such as VAEs and GANs since they provide an exact computation of the marginal likelihoods, thereby requiring no approximate inference techniques. Inspired from the work of (Ardizzone et al., 2019), (Ardizzone et al., 2020)

utilized Information Bottleneck (IB) as a loss function for the Invertible Neural Networks (INNs) with RealNVP

(Dinh et al., 2017) architecture to provide high-quality uncertainty estimation and OOD detection. (Zisselman et al., 2020) introduced a residual flow architecture for OOD detection that learns the residual distribution from a Gaussian prior.

3 InFlow for OOD detection

Given unlabeled in-distribution samples , the task is to develop a robust normalizing flow model that maximizes the log-likelihood of in-distribution whereas assigning lower log-likelihoods to OOD test samples. For achieving this, we explore the answer to the following questions: i). how can the maximum likelihood-based objective of our attention-based normalizing flow assign a higher log-likelihood to the in-distribution data than the log-likelihood of unseen OOD outliers? (see Section 3.1); ii). how do we define the attention mechanism that makes the normalizing flow model robust? (see Section 3.2). iii). how do we estimate an effective likelihood-based threshold for classifying the test samples as in-distribution or OOD? (see Section 3.3).

3.1 Model definition

(Dinh et al., 2017) presented a normalizing flow architecture that are based on a sequence of high dimensional bijective functions stacked together as affine coupling blocks. Each of the affine coupling blocks contain the transformations, scaling and translation respectively. We extend this design by forwarding a function (see also Appendix A.2) to each of the coupling blocks as,


For simplicity, let us assume , then Eq. 2 can be represented as

. Now, according to the chain rule, the derivative of

w.r.t. is given as,


By defining function as the attention mechanism, that maps the input to the two integers where if is in-distribution and otherwise, produces the derivative of w.r.t. as 0 except at the decision boundary of . Hence, the Eq. 3 becomes,


It is observable that each of the derivatives in Eq. 4 are the partial derivatives of the output of a single coupling block with respect to the input of the same coupling block. Hence, defining as the input and as the output of the coupling block and extending the Eq. 4 with coupling blocks will lead to,


At every affine coupling block, is channel wise divided into two halves and and is transformed by the affine functions and respectively. We now multiply with the output of transformations and in each of the coupling blocks. Therefore, the coupling block of our model is denoted as:


where is one part of the output which is replicated from input and is the other part which is the result of applying affine transformations on and respectively. Therefore, the jacobian matrix at coupling block is given as:


As there is no connection between and while is equal to , the jacobian matrix in Eq. 7 is triangular which means its determinant is just the product of its main diagonal elements. These main diagonal elements of the jacobian matrices at each coupling block is multiplied to obtain the determinant of our end-to-end InFlow model as:


Since the attention mechanism is a common element, applying logarithm on Eq. 8 gives:


It is to be noted that the output is still invertible and the mappings and can be arbitrarily non-invertible functions such as deep neural networks. Hence, the parameters of the model can be optimized by minimizing the negative log-likelihood of the posterior which is equivalent to maximizing the evidence lower bound . Therefore, the maximum likelihood objective of the in-distribution samples can be achieved using Adam optimization with gradients of the form as,



Considering input samples and the attention based normalizing flow model that satisfies , then the model returns the prior distribution of the latent observation for the posterior distribution


Theoretically, we have to prove that for all that satisfies . Using the change of variables formula, the forward direction of an invertible normalizing flow is,


Now, using in the Eq. 6 will yield and . This conveys that the output of the coupling block is equal to the input considering and . Therefore, by substituting output of each coupling block with its input, we get . Additionally, for the reverse transformation, the change of variables formula gives,


Using the result that for in Eq. 12, we will obtain:


Therefore, Eq. 13 shows that the proposition holds and provides an elegant proof. With the condition satisfied, the posterior log-likelihood of the OOD samples is given as:


Furthermore, for the in-distribution samples that satisfies , putting Eq. 9 in Eq. 1 results in the posterior log-likelihood of the in-distribution samples as,


Under the assumption that the maximum likelihood objective as shown in Eq. 10 asymptotically converged, we argue that the empirical upper bound of is equal to or larger than the maximum likelihood estimate (MLE) of , where MLE of is attained when (see Eq. 20) and is not transformed by the maximum likelihood training of our InFlow model. Additionally, in our implementation, , since sub-networks and

are realized by a succession of several simple convolutional layers with ReLU activations (see Table

4 in Appendix A.2). Considering these postulations, it is noticeable from Figure 1 (c) that the log-likelihood of in-distribution samples is significantly higher than the log-likelihood of OOD samples , leading to robust disentanglement of posterior log-likelihoods of in-distribution samples from the OOD samples.

3.2 The attention mechanism

We utilize Maximum mean discrepancy (MMD) (Gretton et al., 2012) as our attention mechanism since it is an efficient metric to perform the two sample kernel tests. Assuming we have two distributions and over the sets and respectively, as the kernel in a reproducing kernel Hilbert space (RKHS) given by that maps , be the input random variable with in-distribution observations where , be another random variable with unknown observations where , then the MMD in between two distributions and is given by,


However, calculating has quadratic time complexity due to which, given a subset of in-distribution observations where , we use an encoder function that maps the high dimensional input space and into a lower dimensional space and with the new observations and

. The details related to the encoder architecture and the hyperparameters can be found in Appendix

A.7. Now, given the kernel , an unbiased empirical approximation of on a lower dimensional space is a sum of two U-statistics and a sample average which is given by (Gretton et al., 2012),


We used

as a test statistic with the null hypothesis

while the alternate hypothesis being . Let us assume be the significance p-value that gives the maximum permissible probability of falsely rejecting the null hypothesis . Then under the permutation based hypothesis test, the set of all encoded observations i.e. is used to generate randomly permuted partitions with at . After performing the permutations, we compute for each instances of and compare it with as presented in Algorithm 2 of Appendix A.4. We then calculate the mean p-value as the proportion of permutations where holds. Finally, we reject our null hypothesis if and define for the test samples .

3.3 Likelihood-based threshold for OOD detection

The decision for deep generative models to classify input test samples as in-distribution or OOD naturally grounds on the likelihood-based threshold. To realize a robust likelihood-based OOD detector, we assert that the minimum posterior log-likelihood score of an in-distribution sample should preferably be higher than the maximum posterior log-likelihood score of the OOD samples. Hence, we define our likelihood-based threshold for OOD detection as the maximum posterior log-likelihood of OOD samples . Moreover, to study the effect of p-value on the performance of our approach, we relate the significance p-value with the confidence bounds of the Gaussian prior distribution and infer several critical values of this likelihood-based threshold based on . Therefore, can be seen as the proportion of the data within the standard deviation of the mean , with computable by the inverse of the error function, using,


Now, let us assume be the dimension of the latent observation , then given the mean

and variance

of the prior Gaussian distribution, the log-likelihood



The maximum likelihood estimate (MLE) of can then be computed as the asymptotically unbiased upper bound that needs to satisfy the following condition,


Given and , the values of should be to satisfy the condition in Eq. 20. Hence, substituting this proposition in Eq. 19 yields:


Eq. 21 shows that the MLE upper bound can be interpreted as a robust likelihood-based threshold since it is data-independent and only constrained on mean , standard deviation of the prior distribution as well as the p-value . Hence, for a fixed and , the critical values of threshold can then be controlled by changing the significance p-value . Our likelihood-based threshold, , therefore enables us to interpret the robustness of our approach for OOD detection w.r.t. confidence level of our attention mechanism given a p-value .

4 Experimental Results

We evaluated the performance of our method for its robustness in a variety of experimental settings. For all our experiments, we fixed an in-distribution dataset for training our InFlow model and inferred with several OOD datasets. The details related to the datasets used in our experiments can be found in Appendix A.1. The particulars related to the hyperparameters used during training and inference are given in Appendix A.3. We intended to assess our approach by evaluating its robustness with three different types of outlier data categories. The first category of test samples is generated by adding different types of visible perturbations to the in-distribution data samples (see Appendix A.10). The second category is related to adversarial attacks on the in-distribution data samples with invisible perturbations (see Appendix A.9). The third category is associated with the dataset drifts where the semantic information and object classes of the test dataset is unseen by our InFlow model during training. Under the third category, we present some of the results in Section 4 while further results related to this category are shown in Appendix A.8. We also visualized the sub-network and activations as well as the input and latent observations for in-distribution and OOD samples and compared the behavior of our InFlow model with that of a RealNVP model (see Appendix A.11).

Datasets InFlow Likelihood Ratio LR ODIN Outlier exposure IC
MNIST 1 0.961 0.996 0.997 0.999 0.991
FashionMNIST 1 0.939 0.989 0.995 0.995 0.972
SVHN 1 0.224 0.763 0.970 0.983 0.919
CelebA 1 0.668 0.786 0.965 0.858 0.677
CIFAR 10 (train) 0.513 0.497 0.494 0.702 0.504 0.497
CIFAR 10 (test) 0.529 0.500 0.496 0.706 0.500 0.500

Tiny ImageNet

0.556 0.273 0.848 0.941 0.984 0.362
Noise 1 0.618 0.739 1 0.995 0.878
Constant 1 0.918 0.935 0.908 0.999 1
Table 1: AUCROC values of our InFlow model at with CIFAR 10 training data as in-distribution samples compared with other OOD detection methods.


We used three different metrics namely Area under the Curve-Receiver Operating Characteristic (AUCROC), False Positive Rate at 95% True Positive Rate (FPR95) and Area Under the Curve-Precision Recall (AUCPR

) to quantitatively evaluate the likelihood-based OOD detection performance of our method compared with other approaches. A receiver operating characteristic is a plot between the true positive rate (TPR) vs. the false positive rate (FPR) that shows the performance of the binary classification at different threshold configurations. We assign the binary label 1 as the ground truth for the log-likelihood scores obtained from the training in-distribution samples and the binary label 0 as the ground truth for the log-likelihood scores obtained from the test samples. AUCPR is the plot between the precision and recall with the same ground truth as AUCROC while FPR95 is the false positive rate when the true positive rate is at minimum 95%.

Figure 2: The histogram of log-likelihoods of different datasets at different values of significance p-value when InFlow model was trained with CelebA training data.

Quantitative comparison with state-of-the-art:

We study the performance of our InFlow model by comparing it with other likelihood-based OOD detection methods present in literature and compared the performance of our InFlow model with these methods using the three mentioned metrics. The methods that we evaluated are Likelihood ratio (Ren et al., 2019) , Likelihood regret (LR) (Xiao et al., 2020), ODIN (Liang et al., 2018), Outlier exposure (Hendrycks et al., 2018) and Input complexity (IC) (Serrà et al., 2019). The details related to the implementation of these methods have been described in the Appendix A.5. Table 1

describes the AUCROC scores obtained from our model with CIFAR 10 training data as the in-distribution and compared with other approaches. It can be observed that except Tiny ImageNet test dataset, our model is robust and reaches the highest possible AUCROC scores in each of the evaluated OOD datasets. The AUCROC scores of around 0.5 for CIFAR 10 training and test sets show that our model is unable to distinguish between the in-distribution CIFAR 10 samples which further verifies the robustness of our approach. Therefore, the results convey that our likelihood-based OOD detection is effective in solving the overconfidence issue of normalizing flows. The FPR95 and AUCPR scores for the same experimental setting are shown in Table

7 at Appendix A.8. The additional results for experiments related to dataset drift can be found in Table 8 at Appendix A.8 where we present the AUCROC, FPR95 and AUCPR scores for the evaluated methods with FashionMNIST as the in-distribution dataset.

CIFAR 10 vs Tiny ImageNet:

The results for the InFlow model trained with CIFAR 10 training data as shown in Table 1 reveals that our InFlow model has poor performance while detecting Tiny ImageNet test dataset as OOD. We associate this empirical outcome to two different rationale. Our first argument for such behavior relates to the significant overlap in the object class of these two datasets. It is to be noted that all 10 object classes of the CIFAR 10 testing samples are included in the object classes of Tiny ImageNet test set due to which the InFlow model assigns high log-likelihood scores to the test samples with overlapping classes in the Tiny ImageNet dataset. The second phenomenon is associated with the influence of image resolution on the log-likelihood score even if there is no class overlap between the samples from the two datasets. The test samples of CIFAR 10 are inherently sized RGB images while Tiny ImageNet are of higher resolution and are desirably downsampled to to fit our experimental settings. We believe that decreasing the image resolution eliminated significant semantic information from the Tiny ImageNet samples that were important for OOD detection. Hence, we presume the resultant lower resolution Tiny ImageNet samples were of similar complexity compared to the CIFAR 10 samples.

MNIST 1 1 1 1 1
FashionMNIST 0.986 0.986 1 1 1
SVHN 0.994 0.995 0.999 1 1
CelebA (train) 0.494 0.506 0.511 0.524 0.548
CelebA (test) 0.495 0.506 0.514 0.525 0.548
CIFAR10 0.931 0.930 0.965 0.990 0.998
Tiny ImageNet 0.926 0.925 0.969 0.993 0.998
Noise 1 1 1 1 1
Constant 0.999 0.999 1 1 1
Table 2: AUCROC scores of our InFlow model trained on CelebA images as in-distribution and compared with different datasets at different significance p-value .

In/Out classification at the decision boundary:

We anticipate that the decision on whether a test sample is an in-distribution or OOD can change by adding extremely small and invisible perturbations to the test sample that lies at the decision boundary. These perturbations can be applied in the form of adversarial attacks and the OOD detection approach must be robust to such adversarial changes in the in-distribution samples. We performed exhaustive experiments for evaluating the robustness of our InFlow model w.r.t. such attacks. The results and the discussion related to it can be viewed in Appendix A.9. Our results convey that the usage of the MMD based hypothesis test as our attention mechanism is highly effective in detecting such adversarial changes since projecting the probability distribution of the attacked samples in higher dimensional RKHS space stretches its mean embeddings further away from the mean embeddings of the in-distribution samples.

p-value and its limitations:

We empirically evaluated the effect of p-value on the robustness of our InFlow model for OOD detection. We trained the model on CelebA training data and inferred several datasets including CelebA with different significance p-values . Table 2 shows AUCROC scores obtained for the evaluated datasets at different p-values ranging from to . It can be observed that, in general, a smaller p-value leads to a lower AUCROC score of the evaluated datasets. The visual evidence of this behavior is shown in Figure 2 (a) where a number of OOD samples from CIFAR 10 and Tiny ImageNet test datasets attain high log-likelihood scores comparable to the log-likelihood scores of in-distribution samples. In contrast, a higher value of leads to a number of in-distribution CelebA samples being wrongly classified as OOD. This is an apparent limitation of using p-value as its dichotomy can significantly affect the decision-making of our InFlow model for OOD detection as no single p-value can be interpreted as correct and foolproof for all types of data variability. This can lead to false positives and false negatives in life-sensitive real-world applications such as medical diagnosis and autonomous driving where the scope of failure is low.

5 Conclusion and Discussion

In this paper, we addressed the issue of overconfident predictions of normalizing flows for outlier inputs that have largely prevented these models to be deployed as a robust likelihood-based outlier detectors. With this regard, we put forth theoretical evidence along with exhaustive empirical investigation showing that the normalizing flows can be highly effective for detecting OOD data if the sub-network activations at each of its coupling blocks are complemented by an attention mechanism. We claim that considering the benefits of our approach, developing new flow architectures with high complexity particularly for OOD detection is not beneficial. In contrast, future work should instead focus on enhancing the attention mechanism to improve the robustness of these likelihood-based generative models for OOD detection. One approach in this direction is relating our OOD detection approach with a Generative adversarial network (GAN). To our understanding, the normalizing flow model can act as a generator network that learns to map the input samples into a latent space while the attention mechanism can be viewed as a discriminator that distinguishes in-distribution samples from the OOD samples, thereby improving the performance of the generator for OOD detection. The development of a robust OOD detection framework has a significant societal impact since such systems are crucial for the deployment of reliable and fair machine learning models in several real-world applications including medical diagnosis and autonomous driving. However, we urge caution while relying solely on OOD detection techniques for such sensitive applications and encourage more research in the direction of attention-based normalizing flows for OOD detection to further understand the limitations and mitigate potential risks. To the best of our knowledge, we are the first to overcome the high confidence issue of normalizing flows for OOD inputs and facilitated a methodological progress in this domain.


  • Dinh et al. [2015] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear Independent Components Estimation. In International Conference on Learning Representations (ICLR), 2015.
  • Dinh et al. [2017] Laurent Dinh, Jascha S. Dickstein, and Samy Bengio. Density estimation using Real NVP. In International Conference on Learning Representations (ICLR), 2017.
  • Sorrenson et al. [2020] Peter Sorrenson, Carsten Rother, and Ullrich Köthe. Disentanglement by nonlinear ICA with general incompressible-flow networks (GIN). In International Conference on Learning Representations (ICLR), 2020.
  • Kingma et al. [2018] Durk P. Kingma, and Prafulla Dhariwal. Glow: Generative Flow with Invertible 1x1 Convolutions. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • Grathwohl et al. [2019] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. In International Conference on Learning Representations (ICLR), 2019.
  • Durkan et al. [2019] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural Spline Flows. In Advances in Neural Information Processing Systems (NIPS), 2019.
  • Nalisnick et al. [2019] Eric Nalisnick, Akihiro Matsukawa, Yee W. Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know?. In International Conference on Learning Representations (ICLR), 2019.
  • Kirichenko et al. [2020] Polina Kirichenko, Pavel Izmailov and Andrew G. Wilson. Why Normalizing Flows Fail to Detect Out-of-Distribution Data. In Advances in Neural Information Processing Systems (NIPS), 2020.
  • Gretton et al. [2012] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research (JMLR), 2012.
  • Nguyen et al. [2012] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015.
  • Ren et al. [2019] Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems (NIPS), 2019.
  • Liang et al. [2018] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. In International Conference on Learning Representations (ICLR), 2018.
  • DeVries et al. [2018] Terrance DeVries, and Graham Wr. Taylor. Learning Confidence for Out-of-Distribution Detection in Neural Networks. arXiv preprint arXiv:1802.04865, 2018.
  • Hendrycks et al. [2018]

    Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep Anomaly Detection with Outlier Exposure. In

    International Conference on Learning Representations (ICLR), 2019.
  • Hendrycks et al. [2018] Dan Hendrycks, and Kevin Gimpel. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In International Conference on Learning Representations (ICLR), 2017.
  • Lee et al. [2018] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • Chen et al. [2020] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Robust Out-of-distribution Detection for Neural Networks. arXiv preprint arXiv:2003.09711, 2020.
  • Ardizzone et al. [2020] Lynton Ardizzone, Radek Mackowiak, Carsten Rother, and Ullrich Köthe. Training Normalizing Flows with the Information Bottleneck for Competitive Generative Classification. In Advances in Neural Information Processing Systems (NIPS), 2020.
  • Ardizzone et al. [2019] Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W. Pellegrini, Ralf S. Klessen, Lena Maier-Hein, Carsten Rother, and Ullrich Köthe. Analyzing inverse problems with invertible neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • Lee et al. [2018] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples. In International Conference on Learning Representations (ICLR), 2018.
  • Hendrycks et al. [2019] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty. In Advances in Neural Information Processing Systems (NIPS), 2019.
  • Serrà et al. [2019] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F. Núñez, and Jordi Luque. Input complexity and out-of-distribution detection with likelihood-based generative models. In International Conference on Learning Representations (ICLR), 2019.
  • Kobyzev et al. [2020] Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing Flows: An Introduction and Review of Current Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • Xiao et al. [2020] Zhisheng Xiao, Qing Yan, and Yali Amit. Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder. In Advances in Neural Information Processing Systems (NIPS), 2020.
  • Morningstar et al. [2021] Warren R. Morningstar, Cusuh Ham, Andrew G. Gallagher, Balaji Lakshminarayanan, Alexander A. Alemi, and Joshua V. Dillon. Density of States Estimation for Out-of-Distribution Detection. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2021.
  • Chen et al. [2021] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Informative Outlier Matters: Robustifying Out-of-distribution Detection Using Outlier Mining. In International Conference on Learning Representations (ICLR), 2021.
  • Zisselman et al. [2020] Ev Zisselman, and Aviv Tamar. Deep Residual Flow for Out of Distribution Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Rabanser et al. [2019] Stephan Rabanser, Stephan Günnemann, and Zachary C. Lipton. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. In Advances in Neural Information Processing Systems (NIPS), 2019.
  • Nalisnick et al. [2020] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, and Balaji Lakshminarayanan. Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality. In International Conference on Learning Representations (ICLR), 2020.
  • Guo et al. [2018] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input Transformations. In International Conference on Learning Representations (ICLR), 2018.
  • Choi et al. [2018] Hyunsun Choi, Eric Jang, Alexander A. Alemi. WAIC, but Why? Generative Ensembles for Robust Anomaly Detection. arXiv preprint arXiv:1810.01392, 2019.
  • Liu et al. [2018]

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In

    Proceedings of the IEEE international conference on computer vision (ICCV), 2015.
  • LeCun et al. [2010] Yann LeCun, Corinna Cortes, and Christopher J. Burges. Mnist handwritten digit database. 2010.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • Krizhevsky et al. [2019] Alex Krizhevsky, Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009
  • Pouransari et al. [2019] Tiny ImageNet Visual Recognition Challenge.
  • Mnih et. al. [2015]

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare et. al. Human-level control through deep reinforcement learning. Nature, 2015.

  • Hasselt et. al. [2016] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-Learning. In AAAI Conference on Artificial Intelligence (AAAI), 2016.
  • Wang et. al. [2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. In 33rd International Conference on Machine Learning (ICML), 2016.
  • Zhang et. al. [2020] Chaoning Zhang, Philipp Benz, Tooba Imtiaz, and In So Kweon. CD-UAP: Class Discriminative Universal Adversarial Perturbation. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
  • Hendrycks et. al. [2019] Dan Hendrycks, Thomas Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In International Conference on Learning Representations (ICLR), 2019.
  • Lakshminarayanan et. al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • DeVries et. al. [2018] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
  • Akcay et. al. [2018] Samet Akcay, Amir A. Abarghouei, and Toby P. Breckon. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Asian Conference on Computer Vision (ACCV), 2018.

Appendix A Experimental Settings

a.1 Datasets

We evaluated our model by carrying out experiments on publicly available datasets such as CelebA [Liu et al., 2018], MNIST [LeCun et al., 2010], FashionMNIST [Xiao et al., 2017], SVHN [Netzer et al., 2011], CIFAR-10 [Krizhevsky et al., 2019], Tiny ImageNet [Pouransari et al., 2019]. To maintain consistency, we created all input as sized RGB images. Considering some of the evaluated datasets were of different resolution, we also resized those datasets as dimensional RGB images. For grayscale datasets such as MNIST and FashionMNIST, we concatenated the grayscale pixel values from the single-channel into three RGB channels. In addition to the publicly available datasets, we also synthetically generated two new datasets namely Noise and Constant to evaluate our method on the feature boundaries. For the Noise dataset, we performed a random sampling of integers between the range for each of the data points in all three RGB channels to obtain an RGB noise image while for the Constant dataset, we randomly sampled three different integers from the range and assigned it to each pixel of the three RGB channels respectively. Each of the evaluated datasets was normalized between the range before using it in our experiments. Table 3 shows the original size of the datasets along with the number of images present in each of these datasets and the segregation of training and test sets. We keep the training set empty for all the datasets which were not used for training the InFlow model. Figure 3 (a) - (h) shows nine examples of resized images in a setting for each of the datasets with Noise being of highest feature complexity and Constant with the least.

Dataset Actual size Total images Training set Test set
MNIST 70,000 60,000 10,000
FashionMNIST 70,000 - 10,000
SVHN 99,289 - 26,032
CelebA 202,599 150,000 52,599
CIFAR 10 60,000 50,000 10,000
Tiny ImageNet 120,000 - 10,000
Noise 10,000 - 10,000
Constant 10,000 - 10,000
Table 3: Details of the evaluated datasets such as size and the train-test partitions.
(a) Noise
(b) CelebA
(c) Tiny ImageNet
(d) CIFAR 10
(e) SVHN
(f) FashionMNIST
(h) Constant
Figure 3: The resized RGB images from different datasets used in our experiments.

a.2 Model

As mentioned in Section 3.1, the central part of our normalizing flow model, InFlow, is an affine coupling block inspired by [Dinh et al., 2017]. Figure 4 shows the architecture of our model at the coupling block with input and output . The input was split channel-wise into two parts, containing a single channel of the input RGB image and with the remaining two channels.

Figure 4: A single coupling block of our InFlow framework.

The sub-part is getting transformed with learnable functions and respectively, which we formulated as a neural network whose architectural details are in Table LABEL:st

. The network consists of two convolutional layers with the ReLU unit as the non-linear activation function. The resolution of the input and output features in each of these convolutional layers is not changed. As we defined ReLU as the last layer of our

and sub-networks, we empirically ensured that the .

Operation In-channel Out-channel Kernel Stride Padding
Conv2D + ReLU 1 256 (3,3) (1,1) (1,1)
Conv2D + ReLU 256 1 (3,3) (1,1) (1,1)
Table 4: Network details for and transformations.

In addition to the use of learnable functions and , we extended the design with an attention mechanism by element-wise multiplying it with the output of the and networks as discussed in Section 3.1. Hence, each of these coupling blocks is stacked together to form our end-to-end InFlow framework as shown in Figure 5. We pass our attention mechanism using a conditional node to each of the coupling blocks. Additionally, we perform random permutations of the variables between the two subsequent coupling blocks to ensure that the ordering of the sub-parts and are randomly changed across the channel dimension so that each channel is getting transformed using the and sub-networks at a particular coupling block of InFlow framework.

Figure 5: The end-to-end InFlow framework with coupling blocks.

a.3 Training details

We performed three different types of experiments for evaluating our model for robustness with OOD inputs as mentioned in Appendix A.6. We present the details related to the attention mechanism setup in Appendix A.7. The InFlow model was trained in a comparable setting for each of the experiments where we used the Adam optimizer with initial learning rate of , momentum and and an exponential decay rate of

. Our model was trained on a single NVIDIA Tesla V100 GPU for 200 epochs with each epoch containing 100 training steps and a batch size of


a.4 The Pseudo Code

The pseudo-code for training the InFlow model is described in Algorithm 1.

1:In-distribution samples , normalizing flow model , number of iterations , learning rate
2:Choose a subset with and assign
3:for  iterations do
4:      minimize negative log-likelihood
Algorithm 1 The maximum likelihood objective: InFlow

The pseudo-code for using the trained InFlow model for OOD detection is described in Algorithm 2.

1:Unknown set of samples , trained model and p-value threshold .
2:Use an encoder function Dimensionality reduction
3:Take all encoded observations as () and perform permutations with .
4:Partition the set () into () for each of the permutations
5:for  permutations do Perform p-value permutation tests
6:     if  then
7:         assign the empirical p-value at permutation as 0
8:     else
9:         assign the empirical p-value at permutation as 1      
10:Compute the mean p-value from the p-values for each of the permutations
11:if  then
12:     reject null hypothesis and assign y
13:      Estimate the log-likelihood
15:     reject alternate hypothesis and assign
16:      Estimate the log-likelihood
17:Estimate the likelihood-based threshold with
18:if  then
19:     Assign samples as Out-of-Distribution (OOD)
21:     Assign samples as in-distribution
Algorithm 2 OOD detection using InFlow model

a.5 Implementing the state of the art

We quantitatively compared the robustness of our InFlow model with other popular OOD detection methods such as ODIN [Liang et al., 2018], Likelihood ratio [Ren et al., 2019], Outlier exposure [Hendrycks et al., 2018], Likelihood regret (LR) [Xiao et al., 2020], and Input Complexity (IC) [Serrà et al., 2019] using AUCROC, FPR95 and AUCPR scores. During the evaluation, we fixed the same in-distribution samples for our approach as well as other competing methods. For ODIN, Outlier Exposure, and LR, we followed similar meta-parameter settings as recommended in the code documentation of these methods. For the IC method, we followed the implementation as provided by [Xiao et al., 2020]. We first calculated the input data complexity using the length of the binary string provided by a PNG-based lossless compression algorithm and subtracted this input data complexity from the negative log-likelihood scores. For the Likelihood ratio method, we computed the scores by subtracting the log-likelihood of the background model from the log-likelihood of the main model. The background model was trained on perturbed input data by corrupting the input semantics with random pixel values. The number of pixels that were perturbed was of the total number of pixels for the model trained with FashionMNIST data and for the model trained with CIFAR 10.

a.6 Evaluating robustness to OOD inputs

It is difficult to obtain a consensus in current literature upon a single definition for estimating the robustness of a machine learning model. Ideally, a model can be said to be robust if it is able to distinguish OOD inputs from the in-distribution samples. Therefore, it is essential to determine the true definition of an OOD outlier that a robust model should be able to detect. With this regard, we identify three such synopses which are generally used for assessing the robustness of an OOD detection approach,

  • Dataset drift: A robust OOD detection model should detect those input test samples that do not contain any of the object classes present in the in-distribution samples. These test samples have a complete shift in the semantic information of the data, and ideally, the model should not provide predictions with high confidence on such data since it has not observed such data during training (see Appendix A.8).

  • Adversarial attacks: A robust OOD detection model should be aware of adversarial attacks on the in-distribution samples. In these attacks, the magnitude of the perturbation is kept low so that the changes in the attacked sample are indistinguishable from the in-distribution samples but is enough to trick a model—e.g. a classifier—that interprets the attacked input sample with high confidence. Therefore, such attacks have the potential to significantly degrade the performance of ML models (see Appendix A.9).

  • Visible perturbations: We conduct another kind of robustness test where we corrupt the in-distribution samples with different types of perturbations. Additionally, we test the performance of our model on different levels of corruption severity. The corruptions are visible for all severity levels even though the inherent semantic information is still preserved (see Appendix A.10).

Operation In-channel Out-channel Kernel Stride Padding
Conv2D + ReLU 3 64 (4,4) (2,2) (0,0)
Conv2D + ReLU 64 128 (4,4) (2,2) (0,0)
Conv2D + ReLU 128 256 (4,4) (2,2) (0,0)
Conv2D + ReLU 256 512 (4,4) (2,2) (0,0)
Operation In-features Out-features
Flatten + Linear 2048 32 - - -
Encoder network architecture for experiments with adversarial attacks.
Operation In-channel Out-channel Kernel Stride Padding
Conv2D + ReLU 3 64 (4,4) (2,2) (0,0)
Conv2D + ReLU 64 128 (4,4) (2,3) (0,0)
Conv2D + ReLU 128 256 (5,5) (2,2) (0,0)
Conv2D + ReLU 256 512 (4,5) (2,2) (0,0)
Conv2D + ReLU 512 512 (5,2) (2,2) (0,0)
Operation In-features Out-features
Flatten + Linear 4096 32 - - -
Table 5: Encoder network architecture for experiments without adversarial attacks.

a.7 Dimensionality reduction

MMD as a test statistic has considerable time and memory complexity for high-dimensional data. To overcome this challenge, we used an encoder to reduce the number of features per sample. The first part of Table

LABEL:encoder shows the encoder network architecture used for the experiments that do not involve adversarial attacks. The second part of Table LABEL:encoder

shows the encoder architecture used for experiments conducted for evaluating robustness to adversarial attacks. It can be observed that the final dimension obtained for samples consists of just 32 features for each of the encoder architectures. For MMD computation, we defined exponential quadratic function or Radial Basis function (RBF) as the kernel given by

. The RBF kernel is positive definite due to which applying it on the input samples and that are dependent on variance produce a smooth estimate in the RKHS space. This aids in better interpretation of the mean embeddings for the respective input distributions. For all the experiments involving permutation tests, the significance p-value was set at and the number of permutations as . The average p-value was estimated for a batch of in-distribution samples and test samples.

a.8 Robustness to dataset drift


We utilized the CelebA training dataset as in-distribution and interpreted the influence of the p-value on the OOD detection performance of our InFlow model. The AUCPR and FPR95 scores for such a setting are shown in Table 6. It can be observed that as we lower the significance p-value , we obtain worse AUCPR and FPR95 scores, a behavior which is also discussed in Table 2. Since a smaller p-value is statistically significant, a lower mean p-value is required to reject the null hypothesis that the test samples are from in-distribution.

MNIST 0 0 0 0 0
FashionMNIST 0.010 0.010 0 0 0
SVHN 0.001 0.001 0 0 0
CelebA (train) 0.051 0.049 0.049 0.046 0.045
CelebA (test) 0.051 0.049 0.047 0.042 0.041
CIFAR 10 0.006 0.007 0.005 0.002 0
Tiny ImageNet 0.008 0.009 0.005 0.002 0
Noise 0 0 0 0 0
Constant 0 0 0 0 0
MNIST 1 1 1 1 1
FashionMNIST 0.990 0.990 1 1 1
SVHN 0.996 0.996 0.999 1 1
CelebA (train) 0.488 0.517 0.531 0.559 0.599
CelebA (test) 0.487 0.517 0.530 0.556 0.599
CIFAR 10 0.970 0.970 0.986 0.996 0.999
Tiny ImageNet 0.912 0.910 0.966 0.993 0.998
Noise 1 1 1 1 1
Constant 0.999 0.999 1 1 1
Table 6: FPR95 and AUCPR values of our InFlow model trained on CelebA images as in-distribution and compared with different OOD datasets at different significance threshold values.

Cifar 10:

We showed the efficiency of our InFlow model for detecting dataset drifts in Section 4. We further present the FPR95 and AUCPR scores (see Table 7) for the setting where CIFAR 10 was fixed as in-distribution samples and the model was evaluated on other datasets. The interpretation of the achieved FPR95 and AUCPR scores is similar to the AUCROC scores as shown in Table 1. It can be noticed that the FPR95 and AUCPR values of the Tiny ImageNet dataset are poor compared to other evaluated datasets that achieve the best possible values. We relate this behavior to the overlapping object classes and the influence of image resolution as explained in Section 4.

Datasets InFlow Likelihood Ratio LR ODIN Outlier exposure IC
MNIST 0 0.150 0 0.014 0.006 0
FashionMNIST 0 0.200 0 0.028 0.027 0
SVHN 0 0.960 0 0.155 0.076 0
CelebA 0 0.720 0 0.181 0.517 0
CIFAR 10 (train) 0.047 0.032 0.031 0.941 0.945 0.042
CIFAR 10 (test) 0.049 0.035 0.035 0.950 0.950 0.050
Tiny ImageNet 0.139 0.575 0 0.412 0.076 0.075
Noise 0 0 0 0 0.008 0
Constant 0 0 0 0.413 0.001 0
Datasets InFlow Likelihood Ratio LR ODIN Outlier exposure IC
MNIST 1 0.910 0.969 0.997 0.992 0.942
FashionMNIST 1 0.909 0.914 0.994 0.976 0.849
SVHN 1 0.403 0.489 0.956 0.908 0.774
CelebA 1 0.794 0.532 0.946 0.562 0.355
CIFAR 10 (train) 0.567 0.511 0.503 0.509 0.170 0.503
CIFAR 10 (test) 0.561 0.506 0.497 0.500 0.165 0.499
Tiny ImageNet 0.515 0.129 0.498 0.950 0.935 0.126
Noise 1 0.189 0.888 1 0.934 0.407
Constant 1 0.656 0.673 0.871 0.998 1
Table 7: FPR95 and AUCPR values of our InFlow model with CIFAR 10 training data as in-distribution samples compared with other OOD detection methods.


We present further results where we evaluate our model trained on the FashionMNIST dataset and inferred on other datasets. Table 8 provides the AUCROC, FPR95, and AUCPR results of our InFlow model when trained with in-distribution FashionMNIST dataset and compared the results with other datasets. Our method can detect the dataset drift and provides robust results for detecting the OOD samples from all evaluated datasets. Additionally, we don’t observe the inferior performance of our model while detecting Tiny ImageNet as OOD since the object classes in the Tiny ImageNet dataset are mutually exclusive from the object classes in the FashionMNIST dataset.

Datasets InFlow Likelihood Ratio LR Outlier exposure IC
MNIST 1 0.978 1 1 0.769
FashionMNIST (train) 0.554 0.510 0.503 0.529 0.506
FashionMNIST (test) 0.549 0.494 0.494 0.522 0.499
SVHN 1 0.981 1 0.891 0.927
CelebA 1 0.958 1 0.823 0.378
CIFAR 10 1 0.995 1 0.814 0.611
Tiny ImageNet 1 1 1 0.817 0.254
Noise 1 1 1 0.890 0.531
Constant 1 0.935 0.716 0.999 1
Datasets InFlow Likelihood Ratio LR Outlier exposure IC
MNIST 0 0.175 0 0 0
FashionMNIST (train) 0.040 0.971 0.019 0.012 0.048
FashionMNIST (test) 0.044 0.975 0.025 0.015 0.050
SVHN 0 0.020 0 0.575 0
CelebA 0 0.053 0 0.601 0.227
CIFAR 10 0 0.010 0 0.637 0.075
Tiny ImageNet 0 0 0 0.639 0.4
Noise 0 0 0 0.040 0
Constant 0 0.075 0.25 0.001 0
Datasets InFlow Likelihood Ratio LR Outlier exposure IC
MNIST 1 0.692 1 1 0.756
FashionMNIST (train) 0.602 0.473 0.501 0.548 0.507
FashionMNIST (test) 0.595 0.467 0.493 0.546 0.496
SVHN 1 0.487 1 0.871 0.971
CelebA 1 0.525 0.999 0.834 0.612
CIFAR 10 1 0.354 1 0.823 0.892
Tiny ImageNet 1 0.693 1 0.780 0.408
Noise 1 0.693 1 0.908 0.464
Constant 1 0.653 0.839 0.997 1
Table 8: AUCROC, FPR95 and AUCPR values of our InFlow model trained on FashionMNIST training data compared with other OOD detection methods.

a.9 Robustness to adversarial attacks

We are interested in examining whether the InFlow model is robust to adversarial attacks. The adversarial attacks are tiny perturbations to the in-distribution samples that are completely hidden from human observation but severely impact the performance of a deep learning model in several real-world applications. The methods to generate adversarial attacks are commonly targeted to fool supervised learning models. For performing such attacks, we focused on fooling a Reinforcement Learning (RL) agent that is playing Atari games. In Reinforcement learning (RL), multiple actions (labels) might be considered as correct or appropriate. Generated attacks should therefore not only be visible perturbations but ones that worsen the overall performance of the model.

Training RL Agents:

The task of playing Atari games using RL agents is well established in the RL-based literature. Hence, detecting attacks on Atari images acts as a useful evaluation of our model and provides a use case for more complex applications such as autonomous driving. Therefore, in preparation for the adversarial attacks, three agents were trained, each for three different Atari games namely Enduro, RoadRunner, and Breakout. They were implemented as Dueling Deep Q-Networks (DQN) [Wang et. al., 2016] trained on observations while the optimization was done utilizing the Double DQN [Hasselt et. al., 2016] algorithm.

Throughout a certain amount of time steps, an agent interacts with an environment. In our case, the environment returns a grayscaled image of the game at each step. These so-called observations need to be preprocessed further. To keep recent history, the agent will not only be presented with the current but additional last three images. All images are stacked to create a observation on which the agent will base the decision for the next step.

The primary metric to evaluate an agent’s performance during training is the average episode reward over 100 episodes. [Mnih et. al., 2015] provided the scores of a human expert player for the games. If the average episode reward is significantly higher, the agent was considered reliable. Our empirical evaluation conveyed that the trained agents performed reliably on their tasks and therefore are suitable attack targets.

Adversarial attack algorithm:

Now given three such reliable Atari agents, the goal was to find a perturbation vector to be added to the observations, which will be able to fool the RL agent to output erroneous predictions. The perturbations are restricted to stay within a set range of

. The main purpose of the attack was to lower the average episode reward of the agents. It was considered to be successful if: the produced adversarial perturbations are unrecognisable for humans; they lower the overall performance of the agent, and lead to predictions that are unfit for the current state of the environment. We utilized the algorithm of class-discriminative universal adversarial perturbations (CD-UAP) introduced by [Zhang et. al., 2020] to calculate perturbations that fulfill these criteria. In our case, the perturbations are not universal but input-specific. With this alteration, the algorithm produces a new perturbation for each observation. Although a universal perturbation is more complicated to calculate, it would be easily detected by InFlow if added repeatedly to the in-distribution data.

Dataset creation:

There were a total of unattacked observations for each of the Atari agents for Breakout, Enduro and RoadRunner. We produced adversarially attacked samples with and which were used as the test samples during inference of the model. It is to be noted that not all of the calculated perturbations were able to pass the above three requirements for our attacks—the amount of actual samples for each included in the dataset is listed in Table 9.

Unattacked 0.0008 0.0009 0.001 0.002 0.003 0.004 0.005
Breakout 10,000 3626 3735 3816 4464 8216 8487 10,000
Enduro 10,000 6410 6730 6405 8758 7105 9336 10,000
RoadRunner 10,000 8700 8882 9065 9867 9980 9993 10,000
Table 9: Number of samples that passed all the criteria for producing a natural adversarial example.
Figure 6: CD-UAP on Breakout, Enduro and RoadRunner.

Training and Results:

We trained three separate instances of our InFlow model for each of the games with the original unattacked observations set as the in-distribution samples. Table 10 shows the AUCROC, FPR95 and AUCPR values obtained for the adversarially attacked samples with different values of . The score reveals that our model is robust in detecting adversarial examples for Breakout and Enduro games where we observe a overall tendency that the scores are better for higher as expected. However, for RoadRunner, we do not observe compatible AUCROC scores even for perturbations at . A reason for this behavior could be the higher action space of RoadRunner in comparison to the two other games. Utilizing the CD-UAP algorithm, we determined multiple favorable actions for the current observation and shifted the predictions away from them. For Breakout and Enduro, the different actions are considerably more contradicting and hence the perturbations need to include more distinct features to fool the agent. Additionally, we believe that the created perturbations can be disguised more easily in the detailed images of RoadRunner.

unattacked 0.0008 0.0009 0.001 0.002 0.003 0.004 0.005
Breakout 0.524 0.985 0.994 0.983 0.991 0.977 0.998 0.973
Enduro 0.500 0.938 0.880 0.918 0.928 0.981 0.959 0.939
RoadRunner 0.524 0.598 0.629 0.617 0.604 0.606 0.639 0.654
unattacked 0.0008 0.0009 0.001 0.002 0.003 0.004 0.005
Breakout 0.048 0.001 0 0 0 0 0 0
Enduro 0.050 0.007 0.011 0.009 0.006 0.001 0.003 0.003
RoadRunner 0.049 0.022 0.019 0.019 0.035 0.039 0.036 0.030
unattacked 0.0008 0.0009 0.001 0.002 0.003 0.004 0.005
Breakout 0.549 0.971 0.980 0.971 0.981 0.982 0.992 0.983
Enduro 0.500 0.989 0.979 0.985 0.990 0.995 0.994 0.993
RoadRunner 0.556 0.761 0.788 0.781 0.782 0.776 0.802 0.810
Table 10: AUCROC, FPR95 and AUCPR values of our InFlow model trained on original Atari images as in-distribution and compared with the adversarially attacked images at different level of attack with p-value fixed at .

a.10 Robustness to visible perturbations

Data generation:

To evaluate the robustness of our InFlow model on visibly perturbed images, we generated corrupted versions of CIFAR 10 test samples on 19 different types of perturbation at five separate severity levels as proposed in [Hendrycks et. al., 2019]. The 19 different perturbation types were chosen since several existing ML models showcase instability for accurately predicting the object classes after the samples undergo such perturbations. Figure 7 shows two CIFAR 10 test examples with five different levels of perturbation severity applied to them. These perturbation types can be categorized into four broad categories—Noise, Blur, Weather and Digital effects—that significantly cover a broad spectrum of real-world perturbations.

Perturbation Type Severity 1 Severity 2 Severity 3 Severity 4 Severity 5
Gaussian Noise 0.229 0.260 0.705 0.851 1.0
Impulse Noise 0.314 0.566 0.901 1.0 1.0
Shot Noise 0.343 0.272 0.358 0.571 0.951
Speckle Noise 0.344 0.239 0.317 0.584 1.0
Defocus Blur 0.541 0.628 0.909 0.978 1.0
Gaussian Blur 0.540 0.909 0.978 1.0 1.0
Glass Blur 0.514 0.466 0.565 0.464 0.601
Motion Blur 0.602 0.955 1.0 1.0 1.0
Zoom Blur 0.815 0.910 0.977 0.978 0.978
Snow 1.0 1.0 1.0 1.0 1.0
Spatter 0.560 0.809 1.0 0.593 0.806
Frost 1.0 1.0 1.0 1.0 1.0
Fog 1.0 1.0 1.0 1.0 1.0
Brightness 1.0 1.0 1.0 1.0 1.0
Contrast 1.0 1.0 1.0 1.0 1.0
Saturate 1.0 1.0 1.0 1.0 1.0
Elastic Transform 0.571 0.649 0.791 0.884 0.860
Pixelate 0.500 0.525 0.550 0.547 0.729
JPEG Compression 0.485 0.509 0.509 0.509 0.491
Table 11: AUCROC values of our InFlow model trained on CIFAR 10 training images as in-distribution and evaluated with different types of visible perturbations at increasing severity levels.


We trained our InFlow model on original CIFAR 10 training images and tested the efficiency of the model for detecting the increasing severity levels of corruptness. For training our model, we fixed the significance p-value of our attention mechanism at while keeping the hyperparameters as set in Appendix A.3 and used the encoder architecture as shown in Table LABEL:encoder. We then calculated the AUCROC scores as shown in Table 11, which were obtained for all 19 perturbation types with increasing severity levels while keeping CIFAR 10 training samples as in-distribution.


Observably, our model can detect OOD samples related to weather effects such as frost, fog, and brightness at all severity levels with perfect AUCROC scores. For the spatter effect, we notice a different color pattern at severity levels 4 and 5, in contrast to the pattern at severity levels 1 to 3. This behavior explains the drop in the performance of the InFlow model with an increase in severity from level 3 to level 4. For perturbations in the noise and blur categories, an increasing level of corruptness results in increased AUCROC scores as the more severely perturbed test samples get significantly further away from being in-distribution. However, for shot noise, speckle noise, and glass blur, there is a dip in AUCROC scores when the severity is increased from 1 to 2. We argue that at severity level 2, the perturbation resembled in-distribution samples. Overall, it can be presumed that our model is robust in detecting several types of visible perturbations on in-distribution samples as OOD and its performance improves as the severity of the corruption is increased.

Figure 7: The different perturbations on two CIFAR 10 images with increasing level of severity.

a.11 Visualization of sub-network activations and latent space

We visually interpreted the variations in the behavior of in-distribution and OOD samples when using our model compared to RealNVP based flow model [Dinh et al., 2017]. We performed several visualizations of sub-network activations, output at each coupling block, and the latent space. We used a total of two coupling blocks () for both RealNVP and our model and adopted non-shared weights for and activations at each coupling block. Figure 8 shows the visualization results for RealNVP as well as our InFlow model when compared with in-distribution CelebA and other OOD datasets.

Figure 8: The figure visualizes the input , the activations of the sub-networks and , the activations of the coupling blocks and the output in latent space of in-distribution and several OOD datasets obtained from the RealNVP model and our InFlow model at K = 2.

It can be observed from Figure 8 (a) that the InFlow model transforms the in-distribution CelebA samples into a complex latent space after training. Additionally, Figures 8 (b), (d), and (f) reveal that the RealNVP model with no attention mechanism in place also transforms the OOD datasets into a much more complex distribution in the latent space. This presents the visual proof that the RealNVP model learns the local pixel interactions of the input space for both in-distribution and OOD samples due to which it cannot distinguish between the semantic information from an in-distribution sample and an OOD outlier. As a consequence, the RealNVP model increases the log-likelihood of both in-distribution and OOD samples. Figures 8 (c), (e), and (g) show the visualization of our InFlow model for different OOD datasets. It is noticeable that our model reproduces the semantic input features into its latent space and a color change occurs for the output of the coupling blocks as a consequence of the permutation. For the grayscale images like MNIST, the color change is not visible since all the three channels are the same grayscale image. Therefore, the depiction in Figure 8 provides visual evidence that the InFlow model directly transfers the input in its latent space for OOD samples while it was able to transform the in-distribution samples into a much more complex distribution.

a.12 InFlow architecture vs Robustness

To analyze the effect of altering the architecture of our model in terms of OOD detection performance, we defined two strategies. The first strategy deals with evaluating the effect of increasing the number of coupling blocks in the model on the OOD detection performance. The second strategy focuses on estimating the effect of jointly or separately learning the weights of and sub-networks. Therefore, we trained four separate instances of the InFlow model with the CelebA training images as the in-distribution samples. In both the joint and non-joint learning settings, we evaluated two and four coupling blocks with and respectively.

(a) non-shared (K = 4)
(b) shared (K = 2)
(c) shared (K = 4)
Figure 9: (a) shows the histogram of log-likelihoods with non-shared weights of and sub-networks at K = 4. (b) shows the histogram of log-likelihoods with shared weights of and sub-networks at K = 2. (c) shows the histogram of log-likelihoods with shared weights of and sub-networks at K = 4.
K = 4 (non-shared) K = 2 (shared) K = 4 (shared)
MNIST 1 1 1
FashionMNIST 1 1 1
SVHN 1 1 1
CelebA (train) 0.522 0.527 0.526
CelebA (test) 0.523 0.520 0.523
CIFAR10 1 1 1
Tiny ImageNet 1 1 1
Noise 1 1 1
Constant 1 1 1
Table 12: AUCROC values of our InFlow model trained on CelebA images as in-distribution and compared with several OOD datasets at different number of coupling blocks.

Figure 9 shows the histogram of log-likelihoods of the InFlow obtained after training it in the four mentioned architectural settings, and Table 12 shows the AUCROC values of these settings at . The AUCROC values for the non-shared weights of and sub-networks with coupling blocks and p-value can be found in Table 2. The results reveal that in each of the modified architectural settings, our InFlow model was able to assign lower log-likelihood to OOD datasets compared to the in-distribution CelebA dataset. Therefore, we can conclude that the internal architecture of the InFlow model does not affect the performance w.r.t OOD detection. Hence, this provides a significant empirical observation that should prevent further study on improving the design of normalizing flows, particularly for OOD detection.