1 Introduction
Rapid advancement in imaging sensor technology and machine learning (ML) techniques has led to notable breakthroughs in several realworld applications. ML models typically perform effectively when the training and testing data are sampled from the same distribution. However, when applied to input data that are not similar to the training data, i.e. when they are far away from the training data distribution (e.g. OOD), these models can fail and the predictions of the model are not reliable anymore. This limitation prevents the safe deployment of these ML models in lifesensitive and realworld setups like autonomous driving and medical diagnosis. In these setups, plenty of OOD data naturally occurs due to various factors such as different image acquisition settings, noise in image scenes, and varied camera parameters. Therefore, a reliable deployment of an ML model requires that the model can detect anomalies so that these models do not provide high confidence predictions to such inputs.
Deep generative models are commonly used for OOD detection in an unsupervised setting because of their ability to approximate the density of indistribution samples as a probability distribution. It allows these models to assign lower likelihood to OOD inputs, rendering such inputs less likely to have been sampled from the indistribution training set. Generative models such as Normalizing flows (Dinh et al., 2015); (Dinh et al., 2017); (Sorrenson et al., 2020); (Kingma et al., 2018); (Grathwohl et al., 2019); (Durkan et al., 2019) are especially suitable candidates for OOD detection as they provide tractable likelihoods. Let us define
as a random variable with input observations
and probability distribution while as the random variable with latent observations and probability distribution . Now, according to the change of variables formula, we can define a series of invertible bijective mappings where with parameters and being the number of coupling blocks to get . Therefore, the loglikelihood of the posterior distribution is given as,(1) 
Considering the prior distribution of latent space be a multivariate Gaussian, then the series of invertible bijective transformations with parameters can transform the posterior from a Gaussian prior into significantly more complex probability distribution. Hence, we can maximize the loglikelihood of the indistribution samples with respect to the parameters of the invertible transformation and use a likelihoodbased threshold to decide whether the loglikelihood of a test sample
is below the threshold (classify
as OOD) or above the threshold (classify as indistribution). Additionally, AUCROC (Area Under the Curve Receiver Operating Characteristic) can be calculated to determine the performance of the flow model in terms of OOD detection. However, works such as (Nalisnick et al., 2019); (Kirichenko et al., 2020) showed that generative models such as normalizing flows assign higher likelihoods to OOD samples compared to indistribution samples, resulting in overconfident predictions on these OOD inputs as shown in Figure 1 (b). To interpret this behavior, (Kirichenko et al., 2020) argued that these models only capture lowlevel statistics such as local pixel correlations rather than highlevel semantics, due to which these models are inefficient in separating indistribution data from the OOD samples.In this paper, we show that this issue can be solved by extending the normalizing flow design with an attention mechanism and validate that the attention mechanism ensures a higher loglikelihood score for indistribution samples than the loglikelihood scores of OOD samples. We suggest that there are few benefits of constructing new designs of normalizing flow models for the OOD detection task and the focus should be directed towards extending the existing flow models with robust attention mechanisms in order to develop a reliable OOD detector. In Section 2, we present the current state of research in the field of OOD detection with the main focus on deep generative models. In Section 3, we develop the representation of our model and provide theoretical evidence for the robustness of our approach along with any underlying assumptions. In Section 4, we conduct several empirical evaluations of our approach in a variety of settings and discuss its effectiveness for OOD detection along with relevant limitations.
2 Related work
(Nguyen et al., 2012) provided initial evidences that ML models have high confidence for OOD inputs. To overcome this issue, (Ren et al., 2019) presented a likelihood ratio approach for OOD detection using autoregressive generative models and experimented with a genomics dataset. (Liang et al., 2018) employed input data perturbations to obtain a softmax score from a pretrained model and used a threshold to determine whether the input data is indistribution or OOD. (DeVries et al., 2018) modified a preexisting network architecture and added a confidence estimate branch at the penultimate layer to enhance the OOD detection accuracy. (Hendrycks et al., 2018) applied a technique called outlier exposure that teaches a pretrained model to detect unseen OOD examples. (Hendrycks et al., 2018); (Lakshminarayanan et. al., 2017); (DeVries et. al., 2018) proposed classification models to detect OOD inputs whereas (Rabanser et al., 2019)
utilized a combination of dimensionality reduction techniques and robust test statistics like Maximum Mean Discrepancy (MMD) to develop a dataset drift detection approach.
(Lee et al., 2018) proposed a confidence estimate based on Mahalanobis distances. (Chen et al., 2020) showed that many existing OOD detection approaches such as (Liang et al., 2018); (Hendrycks et al., 2018); (Lee et al., 2018) do not work efficiently when small perturbations are added to the indistribution samples. Hence, they trained their model on adversarial examples of indistribution data along with the distribution from the outlier exposure developed by (Hendrycks et al., 2018). (Akcay et. al., 2018)defined their OOD detection strategy based on the idea that Generative adversarial networks (GANs) will not reconstruct OOD samples well.
(Lee et al., 2018) developed a training mechanism by minimizing KullbackLeibler (KL) divergence loss on the predictive distributions of the OOD samples to the uniform ones providing a measure for confidence assessment. (Hendrycks et al., 2019)used a selfsupervised learning approach that is robust to detecting adversarial attacks while
(Serrà et al., 2019) showed that the likelihood scores from generative models have a bias towards the complexity of the input data where nonsmooth images tend to produce low likelihood scores while the smoother samples produce higher likelihood scores. (Xiao et al., 2020) studied OOD detection for Variational AutoEncoders (VAEs) and proposed a likelihood regret score that computes the loglikelihood improvement of the VAE configuration that maximizes the likelihood of an individual sample. (Morningstar et al., 2021)did not use likelihoodbased OOD detection but utilized kernel density estimators such as Support Vector Machines (SVM) to differentiate between indistribution and anomalous inputs.
(Chen et al., 2021) mined informative OOD data to improve the OOD detection performance, and subsequently generalized to unseen adversarial attacks. (Nalisnick et al., 2020) showed that the high likelihood behavior of generative models for OOD samples is due to a mismatch between the model’s typical set and its high probability density whereas (Choi et al., 2018) introduced Watanabe–Akaike information criterion (WAIC) based score to differentiate OOD samples from indistribution samples. (Kobyzev et al., 2020) gave an outline of several normalizing flowbased methods and discussed their suitability for different realworld applications. (Nalisnick et al., 2019) showed that INNs are especially attractive for OOD detection compared to other generative models such as VAEs and GANs since they provide an exact computation of the marginal likelihoods, thereby requiring no approximate inference techniques. Inspired from the work of (Ardizzone et al., 2019), (Ardizzone et al., 2020)utilized Information Bottleneck (IB) as a loss function for the Invertible Neural Networks (INNs) with RealNVP
(Dinh et al., 2017) architecture to provide highquality uncertainty estimation and OOD detection. (Zisselman et al., 2020) introduced a residual flow architecture for OOD detection that learns the residual distribution from a Gaussian prior.3 InFlow for OOD detection
Given unlabeled indistribution samples , the task is to develop a robust normalizing flow model that maximizes the loglikelihood of indistribution whereas assigning lower loglikelihoods to OOD test samples. For achieving this, we explore the answer to the following questions: i). how can the maximum likelihoodbased objective of our attentionbased normalizing flow assign a higher loglikelihood to the indistribution data than the loglikelihood of unseen OOD outliers? (see Section 3.1); ii). how do we define the attention mechanism that makes the normalizing flow model robust? (see Section 3.2). iii). how do we estimate an effective likelihoodbased threshold for classifying the test samples as indistribution or OOD? (see Section 3.3).
3.1 Model definition
(Dinh et al., 2017) presented a normalizing flow architecture that are based on a sequence of high dimensional bijective functions stacked together as affine coupling blocks. Each of the affine coupling blocks contain the transformations, scaling and translation respectively. We extend this design by forwarding a function (see also Appendix A.2) to each of the coupling blocks as,
(2) 
For simplicity, let us assume , then Eq. 2 can be represented as
. Now, according to the chain rule, the derivative of
w.r.t. is given as,(3) 
By defining function as the attention mechanism, that maps the input to the two integers where if is indistribution and otherwise, produces the derivative of w.r.t. as 0 except at the decision boundary of . Hence, the Eq. 3 becomes,
(4) 
It is observable that each of the derivatives in Eq. 4 are the partial derivatives of the output of a single coupling block with respect to the input of the same coupling block. Hence, defining as the input and as the output of the coupling block and extending the Eq. 4 with coupling blocks will lead to,
(5) 
At every affine coupling block, is channel wise divided into two halves and and is transformed by the affine functions and respectively. We now multiply with the output of transformations and in each of the coupling blocks. Therefore, the coupling block of our model is denoted as:
(6) 
where is one part of the output which is replicated from input and is the other part which is the result of applying affine transformations on and respectively. Therefore, the jacobian matrix at coupling block is given as:
(7) 
As there is no connection between and while is equal to , the jacobian matrix in Eq. 7 is triangular which means its determinant is just the product of its main diagonal elements. These main diagonal elements of the jacobian matrices at each coupling block is multiplied to obtain the determinant of our endtoend InFlow model as:
(8) 
Since the attention mechanism is a common element, applying logarithm on Eq. 8 gives:
(9) 
It is to be noted that the output is still invertible and the mappings and can be arbitrarily noninvertible functions such as deep neural networks. Hence, the parameters of the model can be optimized by minimizing the negative loglikelihood of the posterior which is equivalent to maximizing the evidence lower bound . Therefore, the maximum likelihood objective of the indistribution samples can be achieved using Adam optimization with gradients of the form as,
(10) 
Proposition:
Considering input samples and the attention based normalizing flow model that satisfies , then the model returns the prior distribution of the latent observation for the posterior distribution
Proof:
Theoretically, we have to prove that for all that satisfies . Using the change of variables formula, the forward direction of an invertible normalizing flow is,
(11) 
Now, using in the Eq. 6 will yield and . This conveys that the output of the coupling block is equal to the input considering and . Therefore, by substituting output of each coupling block with its input, we get . Additionally, for the reverse transformation, the change of variables formula gives,
(12) 
Using the result that for in Eq. 12, we will obtain:
(13) 
Therefore, Eq. 13 shows that the proposition holds and provides an elegant proof. With the condition satisfied, the posterior loglikelihood of the OOD samples is given as:
(14) 
Furthermore, for the indistribution samples that satisfies , putting Eq. 9 in Eq. 1 results in the posterior loglikelihood of the indistribution samples as,
(15) 
Under the assumption that the maximum likelihood objective as shown in Eq. 10 asymptotically converged, we argue that the empirical upper bound of is equal to or larger than the maximum likelihood estimate (MLE) of , where MLE of is attained when (see Eq. 20) and is not transformed by the maximum likelihood training of our InFlow model. Additionally, in our implementation, , since subnetworks and
are realized by a succession of several simple convolutional layers with ReLU activations (see Table
4 in Appendix A.2). Considering these postulations, it is noticeable from Figure 1 (c) that the loglikelihood of indistribution samples is significantly higher than the loglikelihood of OOD samples , leading to robust disentanglement of posterior loglikelihoods of indistribution samples from the OOD samples.3.2 The attention mechanism
We utilize Maximum mean discrepancy (MMD) (Gretton et al., 2012) as our attention mechanism since it is an efficient metric to perform the two sample kernel tests. Assuming we have two distributions and over the sets and respectively, as the kernel in a reproducing kernel Hilbert space (RKHS) given by that maps , be the input random variable with indistribution observations where , be another random variable with unknown observations where , then the MMD in between two distributions and is given by,
(16) 
However, calculating has quadratic time complexity due to which, given a subset of indistribution observations where , we use an encoder function that maps the high dimensional input space and into a lower dimensional space and with the new observations and
. The details related to the encoder architecture and the hyperparameters can be found in Appendix
A.7. Now, given the kernel , an unbiased empirical approximation of on a lower dimensional space is a sum of two Ustatistics and a sample average which is given by (Gretton et al., 2012),(17) 
We used
as a test statistic with the null hypothesis
while the alternate hypothesis being . Let us assume be the significance pvalue that gives the maximum permissible probability of falsely rejecting the null hypothesis . Then under the permutation based hypothesis test, the set of all encoded observations i.e. is used to generate randomly permuted partitions with at . After performing the permutations, we compute for each instances of and compare it with as presented in Algorithm 2 of Appendix A.4. We then calculate the mean pvalue as the proportion of permutations where holds. Finally, we reject our null hypothesis if and define for the test samples .3.3 Likelihoodbased threshold for OOD detection
The decision for deep generative models to classify input test samples as indistribution or OOD naturally grounds on the likelihoodbased threshold. To realize a robust likelihoodbased OOD detector, we assert that the minimum posterior loglikelihood score of an indistribution sample should preferably be higher than the maximum posterior loglikelihood score of the OOD samples. Hence, we define our likelihoodbased threshold for OOD detection as the maximum posterior loglikelihood of OOD samples . Moreover, to study the effect of pvalue on the performance of our approach, we relate the significance pvalue with the confidence bounds of the Gaussian prior distribution and infer several critical values of this likelihoodbased threshold based on . Therefore, can be seen as the proportion of the data within the standard deviation of the mean , with computable by the inverse of the error function, using,
(18) 
Now, let us assume be the dimension of the latent observation , then given the mean
and variance
of the prior Gaussian distribution, the loglikelihood
is,(19) 
The maximum likelihood estimate (MLE) of can then be computed as the asymptotically unbiased upper bound that needs to satisfy the following condition,
(20) 
Given and , the values of should be to satisfy the condition in Eq. 20. Hence, substituting this proposition in Eq. 19 yields:
(21) 
Eq. 21 shows that the MLE upper bound can be interpreted as a robust likelihoodbased threshold since it is dataindependent and only constrained on mean , standard deviation of the prior distribution as well as the pvalue . Hence, for a fixed and , the critical values of threshold can then be controlled by changing the significance pvalue . Our likelihoodbased threshold, , therefore enables us to interpret the robustness of our approach for OOD detection w.r.t. confidence level of our attention mechanism given a pvalue .
4 Experimental Results
We evaluated the performance of our method for its robustness in a variety of experimental settings. For all our experiments, we fixed an indistribution dataset for training our InFlow model and inferred with several OOD datasets. The details related to the datasets used in our experiments can be found in Appendix A.1. The particulars related to the hyperparameters used during training and inference are given in Appendix A.3. We intended to assess our approach by evaluating its robustness with three different types of outlier data categories. The first category of test samples is generated by adding different types of visible perturbations to the indistribution data samples (see Appendix A.10). The second category is related to adversarial attacks on the indistribution data samples with invisible perturbations (see Appendix A.9). The third category is associated with the dataset drifts where the semantic information and object classes of the test dataset is unseen by our InFlow model during training. Under the third category, we present some of the results in Section 4 while further results related to this category are shown in Appendix A.8. We also visualized the subnetwork and activations as well as the input and latent observations for indistribution and OOD samples and compared the behavior of our InFlow model with that of a RealNVP model (see Appendix A.11).
Datasets  InFlow  Likelihood Ratio  LR  ODIN  Outlier exposure  IC 

MNIST  1  0.961  0.996  0.997  0.999  0.991 
FashionMNIST  1  0.939  0.989  0.995  0.995  0.972 
SVHN  1  0.224  0.763  0.970  0.983  0.919 
CelebA  1  0.668  0.786  0.965  0.858  0.677 
CIFAR 10 (train)  0.513  0.497  0.494  0.702  0.504  0.497 
CIFAR 10 (test)  0.529  0.500  0.496  0.706  0.500  0.500 
Tiny ImageNet 
0.556  0.273  0.848  0.941  0.984  0.362 
Noise  1  0.618  0.739  1  0.995  0.878 
Constant  1  0.918  0.935  0.908  0.999  1 
Metrics:
We used three different metrics namely Area under the CurveReceiver Operating Characteristic (AUCROC), False Positive Rate at 95% True Positive Rate (FPR95) and Area Under the CurvePrecision Recall (AUCPR
) to quantitatively evaluate the likelihoodbased OOD detection performance of our method compared with other approaches. A receiver operating characteristic is a plot between the true positive rate (TPR) vs. the false positive rate (FPR) that shows the performance of the binary classification at different threshold configurations. We assign the binary label 1 as the ground truth for the loglikelihood scores obtained from the training indistribution samples and the binary label 0 as the ground truth for the loglikelihood scores obtained from the test samples. AUCPR is the plot between the precision and recall with the same ground truth as AUCROC while FPR95 is the false positive rate when the true positive rate is at minimum 95%.
Quantitative comparison with stateoftheart:
We study the performance of our InFlow model by comparing it with other likelihoodbased OOD detection methods present in literature and compared the performance of our InFlow model with these methods using the three mentioned metrics. The methods that we evaluated are Likelihood ratio (Ren et al., 2019) , Likelihood regret (LR) (Xiao et al., 2020), ODIN (Liang et al., 2018), Outlier exposure (Hendrycks et al., 2018) and Input complexity (IC) (Serrà et al., 2019). The details related to the implementation of these methods have been described in the Appendix A.5. Table 1
describes the AUCROC scores obtained from our model with CIFAR 10 training data as the indistribution and compared with other approaches. It can be observed that except Tiny ImageNet test dataset, our model is robust and reaches the highest possible AUCROC scores in each of the evaluated OOD datasets. The AUCROC scores of around 0.5 for CIFAR 10 training and test sets show that our model is unable to distinguish between the indistribution CIFAR 10 samples which further verifies the robustness of our approach. Therefore, the results convey that our likelihoodbased OOD detection is effective in solving the overconfidence issue of normalizing flows. The FPR95 and AUCPR scores for the same experimental setting are shown in Table
7 at Appendix A.8. The additional results for experiments related to dataset drift can be found in Table 8 at Appendix A.8 where we present the AUCROC, FPR95 and AUCPR scores for the evaluated methods with FashionMNIST as the indistribution dataset.CIFAR 10 vs Tiny ImageNet:
The results for the InFlow model trained with CIFAR 10 training data as shown in Table 1 reveals that our InFlow model has poor performance while detecting Tiny ImageNet test dataset as OOD. We associate this empirical outcome to two different rationale. Our first argument for such behavior relates to the significant overlap in the object class of these two datasets. It is to be noted that all 10 object classes of the CIFAR 10 testing samples are included in the object classes of Tiny ImageNet test set due to which the InFlow model assigns high loglikelihood scores to the test samples with overlapping classes in the Tiny ImageNet dataset. The second phenomenon is associated with the influence of image resolution on the loglikelihood score even if there is no class overlap between the samples from the two datasets. The test samples of CIFAR 10 are inherently sized RGB images while Tiny ImageNet are of higher resolution and are desirably downsampled to to fit our experimental settings. We believe that decreasing the image resolution eliminated significant semantic information from the Tiny ImageNet samples that were important for OOD detection. Hence, we presume the resultant lower resolution Tiny ImageNet samples were of similar complexity compared to the CIFAR 10 samples.
Datasets  
MNIST  1  1  1  1  1 
FashionMNIST  0.986  0.986  1  1  1 
SVHN  0.994  0.995  0.999  1  1 
CelebA (train)  0.494  0.506  0.511  0.524  0.548 
CelebA (test)  0.495  0.506  0.514  0.525  0.548 
CIFAR10  0.931  0.930  0.965  0.990  0.998 
Tiny ImageNet  0.926  0.925  0.969  0.993  0.998 
Noise  1  1  1  1  1 
Constant  0.999  0.999  1  1  1 
In/Out classification at the decision boundary:
We anticipate that the decision on whether a test sample is an indistribution or OOD can change by adding extremely small and invisible perturbations to the test sample that lies at the decision boundary. These perturbations can be applied in the form of adversarial attacks and the OOD detection approach must be robust to such adversarial changes in the indistribution samples. We performed exhaustive experiments for evaluating the robustness of our InFlow model w.r.t. such attacks. The results and the discussion related to it can be viewed in Appendix A.9. Our results convey that the usage of the MMD based hypothesis test as our attention mechanism is highly effective in detecting such adversarial changes since projecting the probability distribution of the attacked samples in higher dimensional RKHS space stretches its mean embeddings further away from the mean embeddings of the indistribution samples.
pvalue and its limitations:
We empirically evaluated the effect of pvalue on the robustness of our InFlow model for OOD detection. We trained the model on CelebA training data and inferred several datasets including CelebA with different significance pvalues . Table 2 shows AUCROC scores obtained for the evaluated datasets at different pvalues ranging from to . It can be observed that, in general, a smaller pvalue leads to a lower AUCROC score of the evaluated datasets. The visual evidence of this behavior is shown in Figure 2 (a) where a number of OOD samples from CIFAR 10 and Tiny ImageNet test datasets attain high loglikelihood scores comparable to the loglikelihood scores of indistribution samples. In contrast, a higher value of leads to a number of indistribution CelebA samples being wrongly classified as OOD. This is an apparent limitation of using pvalue as its dichotomy can significantly affect the decisionmaking of our InFlow model for OOD detection as no single pvalue can be interpreted as correct and foolproof for all types of data variability. This can lead to false positives and false negatives in lifesensitive realworld applications such as medical diagnosis and autonomous driving where the scope of failure is low.
5 Conclusion and Discussion
In this paper, we addressed the issue of overconfident predictions of normalizing flows for outlier inputs that have largely prevented these models to be deployed as a robust likelihoodbased outlier detectors. With this regard, we put forth theoretical evidence along with exhaustive empirical investigation showing that the normalizing flows can be highly effective for detecting OOD data if the subnetwork activations at each of its coupling blocks are complemented by an attention mechanism. We claim that considering the benefits of our approach, developing new flow architectures with high complexity particularly for OOD detection is not beneficial. In contrast, future work should instead focus on enhancing the attention mechanism to improve the robustness of these likelihoodbased generative models for OOD detection. One approach in this direction is relating our OOD detection approach with a Generative adversarial network (GAN). To our understanding, the normalizing flow model can act as a generator network that learns to map the input samples into a latent space while the attention mechanism can be viewed as a discriminator that distinguishes indistribution samples from the OOD samples, thereby improving the performance of the generator for OOD detection. The development of a robust OOD detection framework has a significant societal impact since such systems are crucial for the deployment of reliable and fair machine learning models in several realworld applications including medical diagnosis and autonomous driving. However, we urge caution while relying solely on OOD detection techniques for such sensitive applications and encourage more research in the direction of attentionbased normalizing flows for OOD detection to further understand the limitations and mitigate potential risks. To the best of our knowledge, we are the first to overcome the high confidence issue of normalizing flows for OOD inputs and facilitated a methodological progress in this domain.
References
 Dinh et al. [2015] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Nonlinear Independent Components Estimation. In International Conference on Learning Representations (ICLR), 2015.
 Dinh et al. [2017] Laurent Dinh, Jascha S. Dickstein, and Samy Bengio. Density estimation using Real NVP. In International Conference on Learning Representations (ICLR), 2017.
 Sorrenson et al. [2020] Peter Sorrenson, Carsten Rother, and Ullrich Köthe. Disentanglement by nonlinear ICA with general incompressibleflow networks (GIN). In International Conference on Learning Representations (ICLR), 2020.
 Kingma et al. [2018] Durk P. Kingma, and Prafulla Dhariwal. Glow: Generative Flow with Invertible 1x1 Convolutions. In Advances in Neural Information Processing Systems (NIPS), 2018.
 Grathwohl et al. [2019] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Freeform Continuous Dynamics for Scalable Reversible Generative Models. In International Conference on Learning Representations (ICLR), 2019.
 Durkan et al. [2019] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural Spline Flows. In Advances in Neural Information Processing Systems (NIPS), 2019.
 Nalisnick et al. [2019] Eric Nalisnick, Akihiro Matsukawa, Yee W. Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know?. In International Conference on Learning Representations (ICLR), 2019.
 Kirichenko et al. [2020] Polina Kirichenko, Pavel Izmailov and Andrew G. Wilson. Why Normalizing Flows Fail to Detect OutofDistribution Data. In Advances in Neural Information Processing Systems (NIPS), 2020.
 Gretton et al. [2012] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel TwoSample Test. Journal of Machine Learning Research (JMLR), 2012.

Nguyen et al. [2012]
Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2015.  Ren et al. [2019] Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for outofdistribution detection. In Advances in Neural Information Processing Systems (NIPS), 2019.
 Liang et al. [2018] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing The Reliability of Outofdistribution Image Detection in Neural Networks. In International Conference on Learning Representations (ICLR), 2018.
 DeVries et al. [2018] Terrance DeVries, and Graham Wr. Taylor. Learning Confidence for OutofDistribution Detection in Neural Networks. arXiv preprint arXiv:1802.04865, 2018.

Hendrycks et al. [2018]
Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep Anomaly Detection with Outlier Exposure. In
International Conference on Learning Representations (ICLR), 2019.  Hendrycks et al. [2018] Dan Hendrycks, and Kevin Gimpel. A Baseline for Detecting Misclassified and OutofDistribution Examples in Neural Networks. In International Conference on Learning Representations (ICLR), 2017.
 Lee et al. [2018] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A Simple Unified Framework for Detecting OutofDistribution Samples and Adversarial Attacks. In Advances in Neural Information Processing Systems (NIPS), 2018.
 Chen et al. [2020] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Robust Outofdistribution Detection for Neural Networks. arXiv preprint arXiv:2003.09711, 2020.
 Ardizzone et al. [2020] Lynton Ardizzone, Radek Mackowiak, Carsten Rother, and Ullrich Köthe. Training Normalizing Flows with the Information Bottleneck for Competitive Generative Classification. In Advances in Neural Information Processing Systems (NIPS), 2020.
 Ardizzone et al. [2019] Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W. Pellegrini, Ralf S. Klessen, Lena MaierHein, Carsten Rother, and Ullrich Köthe. Analyzing inverse problems with invertible neural networks. In International Conference on Learning Representations (ICLR), 2019.
 Lee et al. [2018] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training Confidencecalibrated Classifiers for Detecting OutofDistribution Samples. In International Conference on Learning Representations (ICLR), 2018.
 Hendrycks et al. [2019] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using SelfSupervised Learning Can Improve Model Robustness and Uncertainty. In Advances in Neural Information Processing Systems (NIPS), 2019.
 Serrà et al. [2019] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F. Núñez, and Jordi Luque. Input complexity and outofdistribution detection with likelihoodbased generative models. In International Conference on Learning Representations (ICLR), 2019.
 Kobyzev et al. [2020] Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing Flows: An Introduction and Review of Current Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
 Xiao et al. [2020] Zhisheng Xiao, Qing Yan, and Yali Amit. Likelihood Regret: An OutofDistribution Detection Score For Variational Autoencoder. In Advances in Neural Information Processing Systems (NIPS), 2020.

Morningstar et al. [2021]
Warren R. Morningstar, Cusuh Ham, Andrew G. Gallagher, Balaji Lakshminarayanan, Alexander A. Alemi, and Joshua V. Dillon. Density of States Estimation for OutofDistribution Detection. In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2021.  Chen et al. [2021] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Informative Outlier Matters: Robustifying Outofdistribution Detection Using Outlier Mining. In International Conference on Learning Representations (ICLR), 2021.
 Zisselman et al. [2020] Ev Zisselman, and Aviv Tamar. Deep Residual Flow for Out of Distribution Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
 Rabanser et al. [2019] Stephan Rabanser, Stephan Günnemann, and Zachary C. Lipton. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. In Advances in Neural Information Processing Systems (NIPS), 2019.
 Nalisnick et al. [2020] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, and Balaji Lakshminarayanan. Detecting OutofDistribution Inputs to Deep Generative Models Using Typicality. In International Conference on Learning Representations (ICLR), 2020.
 Guo et al. [2018] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input Transformations. In International Conference on Learning Representations (ICLR), 2018.
 Choi et al. [2018] Hyunsun Choi, Eric Jang, Alexander A. Alemi. WAIC, but Why? Generative Ensembles for Robust Anomaly Detection. arXiv preprint arXiv:1810.01392, 2019.

Liu et al. [2018]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In
Proceedings of the IEEE international conference on computer vision (ICCV), 2015.  LeCun et al. [2010] Yann LeCun, Corinna Cortes, and Christopher J. Burges. Mnist handwritten digit database. 2010.
 Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 Krizhevsky et al. [2019] Alex Krizhevsky, Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009
 Pouransari et al. [2019] Tiny ImageNet Visual Recognition Challenge. https://tinyimagenet.herokuapp.com/.

Mnih et. al. [2015]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare et. al. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Hasselt et. al. [2016] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double QLearning. In AAAI Conference on Artificial Intelligence (AAAI), 2016.
 Wang et. al. [2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. In 33rd International Conference on Machine Learning (ICML), 2016.
 Zhang et. al. [2020] Chaoning Zhang, Philipp Benz, Tooba Imtiaz, and In So Kweon. CDUAP: Class Discriminative Universal Adversarial Perturbation. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
 Hendrycks et. al. [2019] Dan Hendrycks, Thomas Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In International Conference on Learning Representations (ICLR), 2019.
 Lakshminarayanan et. al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NIPS), 2017.
 DeVries et. al. [2018] Terrance DeVries and Graham W Taylor. Learning confidence for outofdistribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
 Akcay et. al. [2018] Samet Akcay, Amir A. Abarghouei, and Toby P. Breckon. Ganomaly: Semisupervised anomaly detection via adversarial training. In Asian Conference on Computer Vision (ACCV), 2018.
Appendix A Experimental Settings
a.1 Datasets
We evaluated our model by carrying out experiments on publicly available datasets such as CelebA [Liu et al., 2018], MNIST [LeCun et al., 2010], FashionMNIST [Xiao et al., 2017], SVHN [Netzer et al., 2011], CIFAR10 [Krizhevsky et al., 2019], Tiny ImageNet [Pouransari et al., 2019]. To maintain consistency, we created all input as sized RGB images. Considering some of the evaluated datasets were of different resolution, we also resized those datasets as dimensional RGB images. For grayscale datasets such as MNIST and FashionMNIST, we concatenated the grayscale pixel values from the singlechannel into three RGB channels. In addition to the publicly available datasets, we also synthetically generated two new datasets namely Noise and Constant to evaluate our method on the feature boundaries. For the Noise dataset, we performed a random sampling of integers between the range for each of the data points in all three RGB channels to obtain an RGB noise image while for the Constant dataset, we randomly sampled three different integers from the range and assigned it to each pixel of the three RGB channels respectively. Each of the evaluated datasets was normalized between the range before using it in our experiments. Table 3 shows the original size of the datasets along with the number of images present in each of these datasets and the segregation of training and test sets. We keep the training set empty for all the datasets which were not used for training the InFlow model. Figure 3 (a)  (h) shows nine examples of resized images in a setting for each of the datasets with Noise being of highest feature complexity and Constant with the least.
Dataset  Actual size  Total images  Training set  Test set 

MNIST  70,000  60,000  10,000  
FashionMNIST  70,000    10,000  
SVHN  99,289    26,032  
CelebA  202,599  150,000  52,599  
CIFAR 10  60,000  50,000  10,000  
Tiny ImageNet  120,000    10,000  
Noise  10,000    10,000  
Constant  10,000    10,000 
a.2 Model
As mentioned in Section 3.1, the central part of our normalizing flow model, InFlow, is an affine coupling block inspired by [Dinh et al., 2017]. Figure 4 shows the architecture of our model at the coupling block with input and output . The input was split channelwise into two parts, containing a single channel of the input RGB image and with the remaining two channels.
The subpart is getting transformed with learnable functions and respectively, which we formulated as a neural network whose architectural details are in Table LABEL:st
. The network consists of two convolutional layers with the ReLU unit as the nonlinear activation function. The resolution of the input and output features in each of these convolutional layers is not changed. As we defined ReLU as the last layer of our
and subnetworks, we empirically ensured that the .Operation  Inchannel  Outchannel  Kernel  Stride  Padding 

Conv2D + ReLU  1  256  (3,3)  (1,1)  (1,1) 
Conv2D + ReLU  256  1  (3,3)  (1,1)  (1,1) 
In addition to the use of learnable functions and , we extended the design with an attention mechanism by elementwise multiplying it with the output of the and networks as discussed in Section 3.1. Hence, each of these coupling blocks is stacked together to form our endtoend InFlow framework as shown in Figure 5. We pass our attention mechanism using a conditional node to each of the coupling blocks. Additionally, we perform random permutations of the variables between the two subsequent coupling blocks to ensure that the ordering of the subparts and are randomly changed across the channel dimension so that each channel is getting transformed using the and subnetworks at a particular coupling block of InFlow framework.
a.3 Training details
We performed three different types of experiments for evaluating our model for robustness with OOD inputs as mentioned in Appendix A.6. We present the details related to the attention mechanism setup in Appendix A.7. The InFlow model was trained in a comparable setting for each of the experiments where we used the Adam optimizer with initial learning rate of , momentum and and an exponential decay rate of
. Our model was trained on a single NVIDIA Tesla V100 GPU for 200 epochs with each epoch containing 100 training steps and a batch size of
samples.a.4 The Pseudo Code
The pseudocode for training the InFlow model is described in Algorithm 1.
The pseudocode for using the trained InFlow model for OOD detection is described in Algorithm 2.
a.5 Implementing the state of the art
We quantitatively compared the robustness of our InFlow model with other popular OOD detection methods such as ODIN [Liang et al., 2018], Likelihood ratio [Ren et al., 2019], Outlier exposure [Hendrycks et al., 2018], Likelihood regret (LR) [Xiao et al., 2020], and Input Complexity (IC) [Serrà et al., 2019] using AUCROC, FPR95 and AUCPR scores. During the evaluation, we fixed the same indistribution samples for our approach as well as other competing methods. For ODIN, Outlier Exposure, and LR, we followed similar metaparameter settings as recommended in the code documentation of these methods. For the IC method, we followed the implementation as provided by [Xiao et al., 2020]. We first calculated the input data complexity using the length of the binary string provided by a PNGbased lossless compression algorithm and subtracted this input data complexity from the negative loglikelihood scores. For the Likelihood ratio method, we computed the scores by subtracting the loglikelihood of the background model from the loglikelihood of the main model. The background model was trained on perturbed input data by corrupting the input semantics with random pixel values. The number of pixels that were perturbed was of the total number of pixels for the model trained with FashionMNIST data and for the model trained with CIFAR 10.
a.6 Evaluating robustness to OOD inputs
It is difficult to obtain a consensus in current literature upon a single definition for estimating the robustness of a machine learning model. Ideally, a model can be said to be robust if it is able to distinguish OOD inputs from the indistribution samples. Therefore, it is essential to determine the true definition of an OOD outlier that a robust model should be able to detect. With this regard, we identify three such synopses which are generally used for assessing the robustness of an OOD detection approach,

Dataset drift: A robust OOD detection model should detect those input test samples that do not contain any of the object classes present in the indistribution samples. These test samples have a complete shift in the semantic information of the data, and ideally, the model should not provide predictions with high confidence on such data since it has not observed such data during training (see Appendix A.8).

Adversarial attacks: A robust OOD detection model should be aware of adversarial attacks on the indistribution samples. In these attacks, the magnitude of the perturbation is kept low so that the changes in the attacked sample are indistinguishable from the indistribution samples but is enough to trick a model—e.g. a classifier—that interprets the attacked input sample with high confidence. Therefore, such attacks have the potential to significantly degrade the performance of ML models (see Appendix A.9).

Visible perturbations: We conduct another kind of robustness test where we corrupt the indistribution samples with different types of perturbations. Additionally, we test the performance of our model on different levels of corruption severity. The corruptions are visible for all severity levels even though the inherent semantic information is still preserved (see Appendix A.10).
Operation  Inchannel  Outchannel  Kernel  Stride  Padding 
Conv2D + ReLU  3  64  (4,4)  (2,2)  (0,0) 
Conv2D + ReLU  64  128  (4,4)  (2,2)  (0,0) 
Conv2D + ReLU  128  256  (4,4)  (2,2)  (0,0) 
Conv2D + ReLU  256  512  (4,4)  (2,2)  (0,0) 
Operation  Infeatures  Outfeatures  
Flatten + Linear  2048  32       
Encoder network architecture for experiments with adversarial attacks.  
Operation  Inchannel  Outchannel  Kernel  Stride  Padding 
Conv2D + ReLU  3  64  (4,4)  (2,2)  (0,0) 
Conv2D + ReLU  64  128  (4,4)  (2,3)  (0,0) 
Conv2D + ReLU  128  256  (5,5)  (2,2)  (0,0) 
Conv2D + ReLU  256  512  (4,5)  (2,2)  (0,0) 
Conv2D + ReLU  512  512  (5,2)  (2,2)  (0,0) 
Operation  Infeatures  Outfeatures  
Flatten + Linear  4096  32       
a.7 Dimensionality reduction
MMD as a test statistic has considerable time and memory complexity for highdimensional data. To overcome this challenge, we used an encoder to reduce the number of features per sample. The first part of Table
LABEL:encoder shows the encoder network architecture used for the experiments that do not involve adversarial attacks. The second part of Table LABEL:encodershows the encoder architecture used for experiments conducted for evaluating robustness to adversarial attacks. It can be observed that the final dimension obtained for samples consists of just 32 features for each of the encoder architectures. For MMD computation, we defined exponential quadratic function or Radial Basis function (RBF) as the kernel given by
. The RBF kernel is positive definite due to which applying it on the input samples and that are dependent on variance produce a smooth estimate in the RKHS space. This aids in better interpretation of the mean embeddings for the respective input distributions. For all the experiments involving permutation tests, the significance pvalue was set at and the number of permutations as . The average pvalue was estimated for a batch of indistribution samples and test samples.a.8 Robustness to dataset drift
CelebA:
We utilized the CelebA training dataset as indistribution and interpreted the influence of the pvalue on the OOD detection performance of our InFlow model. The AUCPR and FPR95 scores for such a setting are shown in Table 6. It can be observed that as we lower the significance pvalue , we obtain worse AUCPR and FPR95 scores, a behavior which is also discussed in Table 2. Since a smaller pvalue is statistically significant, a lower mean pvalue is required to reject the null hypothesis that the test samples are from indistribution.
FPR95  

Datasets  
MNIST  0  0  0  0  0 
FashionMNIST  0.010  0.010  0  0  0 
SVHN  0.001  0.001  0  0  0 
CelebA (train)  0.051  0.049  0.049  0.046  0.045 
CelebA (test)  0.051  0.049  0.047  0.042  0.041 
CIFAR 10  0.006  0.007  0.005  0.002  0 
Tiny ImageNet  0.008  0.009  0.005  0.002  0 
Noise  0  0  0  0  0 
Constant  0  0  0  0  0 
AUCPR  
Datasets  
MNIST  1  1  1  1  1 
FashionMNIST  0.990  0.990  1  1  1 
SVHN  0.996  0.996  0.999  1  1 
CelebA (train)  0.488  0.517  0.531  0.559  0.599 
CelebA (test)  0.487  0.517  0.530  0.556  0.599 
CIFAR 10  0.970  0.970  0.986  0.996  0.999 
Tiny ImageNet  0.912  0.910  0.966  0.993  0.998 
Noise  1  1  1  1  1 
Constant  0.999  0.999  1  1  1 
Cifar 10:
We showed the efficiency of our InFlow model for detecting dataset drifts in Section 4. We further present the FPR95 and AUCPR scores (see Table 7) for the setting where CIFAR 10 was fixed as indistribution samples and the model was evaluated on other datasets. The interpretation of the achieved FPR95 and AUCPR scores is similar to the AUCROC scores as shown in Table 1. It can be noticed that the FPR95 and AUCPR values of the Tiny ImageNet dataset are poor compared to other evaluated datasets that achieve the best possible values. We relate this behavior to the overlapping object classes and the influence of image resolution as explained in Section 4.
FPR95  

Datasets  InFlow  Likelihood Ratio  LR  ODIN  Outlier exposure  IC 
MNIST  0  0.150  0  0.014  0.006  0 
FashionMNIST  0  0.200  0  0.028  0.027  0 
SVHN  0  0.960  0  0.155  0.076  0 
CelebA  0  0.720  0  0.181  0.517  0 
CIFAR 10 (train)  0.047  0.032  0.031  0.941  0.945  0.042 
CIFAR 10 (test)  0.049  0.035  0.035  0.950  0.950  0.050 
Tiny ImageNet  0.139  0.575  0  0.412  0.076  0.075 
Noise  0  0  0  0  0.008  0 
Constant  0  0  0  0.413  0.001  0 
AUCPR  
Datasets  InFlow  Likelihood Ratio  LR  ODIN  Outlier exposure  IC 
MNIST  1  0.910  0.969  0.997  0.992  0.942 
FashionMNIST  1  0.909  0.914  0.994  0.976  0.849 
SVHN  1  0.403  0.489  0.956  0.908  0.774 
CelebA  1  0.794  0.532  0.946  0.562  0.355 
CIFAR 10 (train)  0.567  0.511  0.503  0.509  0.170  0.503 
CIFAR 10 (test)  0.561  0.506  0.497  0.500  0.165  0.499 
Tiny ImageNet  0.515  0.129  0.498  0.950  0.935  0.126 
Noise  1  0.189  0.888  1  0.934  0.407 
Constant  1  0.656  0.673  0.871  0.998  1 
FashionMNIST:
We present further results where we evaluate our model trained on the FashionMNIST dataset and inferred on other datasets. Table 8 provides the AUCROC, FPR95, and AUCPR results of our InFlow model when trained with indistribution FashionMNIST dataset and compared the results with other datasets. Our method can detect the dataset drift and provides robust results for detecting the OOD samples from all evaluated datasets. Additionally, we don’t observe the inferior performance of our model while detecting Tiny ImageNet as OOD since the object classes in the Tiny ImageNet dataset are mutually exclusive from the object classes in the FashionMNIST dataset.
AUCROC  

Datasets  InFlow  Likelihood Ratio  LR  Outlier exposure  IC 
MNIST  1  0.978  1  1  0.769 
FashionMNIST (train)  0.554  0.510  0.503  0.529  0.506 
FashionMNIST (test)  0.549  0.494  0.494  0.522  0.499 
SVHN  1  0.981  1  0.891  0.927 
CelebA  1  0.958  1  0.823  0.378 
CIFAR 10  1  0.995  1  0.814  0.611 
Tiny ImageNet  1  1  1  0.817  0.254 
Noise  1  1  1  0.890  0.531 
Constant  1  0.935  0.716  0.999  1 
FPR95  
Datasets  InFlow  Likelihood Ratio  LR  Outlier exposure  IC 
MNIST  0  0.175  0  0  0 
FashionMNIST (train)  0.040  0.971  0.019  0.012  0.048 
FashionMNIST (test)  0.044  0.975  0.025  0.015  0.050 
SVHN  0  0.020  0  0.575  0 
CelebA  0  0.053  0  0.601  0.227 
CIFAR 10  0  0.010  0  0.637  0.075 
Tiny ImageNet  0  0  0  0.639  0.4 
Noise  0  0  0  0.040  0 
Constant  0  0.075  0.25  0.001  0 
AUCPR  
Datasets  InFlow  Likelihood Ratio  LR  Outlier exposure  IC 
MNIST  1  0.692  1  1  0.756 
FashionMNIST (train)  0.602  0.473  0.501  0.548  0.507 
FashionMNIST (test)  0.595  0.467  0.493  0.546  0.496 
SVHN  1  0.487  1  0.871  0.971 
CelebA  1  0.525  0.999  0.834  0.612 
CIFAR 10  1  0.354  1  0.823  0.892 
Tiny ImageNet  1  0.693  1  0.780  0.408 
Noise  1  0.693  1  0.908  0.464 
Constant  1  0.653  0.839  0.997  1 
a.9 Robustness to adversarial attacks
We are interested in examining whether the InFlow model is robust to adversarial attacks. The adversarial attacks are tiny perturbations to the indistribution samples that are completely hidden from human observation but severely impact the performance of a deep learning model in several realworld applications. The methods to generate adversarial attacks are commonly targeted to fool supervised learning models. For performing such attacks, we focused on fooling a Reinforcement Learning (RL) agent that is playing Atari games. In Reinforcement learning (RL), multiple actions (labels) might be considered as correct or appropriate. Generated attacks should therefore not only be visible perturbations but ones that worsen the overall performance of the model.
Training RL Agents:
The task of playing Atari games using RL agents is well established in the RLbased literature. Hence, detecting attacks on Atari images acts as a useful evaluation of our model and provides a use case for more complex applications such as autonomous driving. Therefore, in preparation for the adversarial attacks, three agents were trained, each for three different Atari games namely Enduro, RoadRunner, and Breakout. They were implemented as Dueling Deep QNetworks (DQN) [Wang et. al., 2016] trained on observations while the optimization was done utilizing the Double DQN [Hasselt et. al., 2016] algorithm.
Throughout a certain amount of time steps, an agent interacts with an environment. In our case, the environment returns a grayscaled image of the game at each step. These socalled observations need to be preprocessed further. To keep recent history, the agent will not only be presented with the current but additional last three images. All images are stacked to create a observation on which the agent will base the decision for the next step.
The primary metric to evaluate an agent’s performance during training is the average episode reward over 100 episodes. [Mnih et. al., 2015] provided the scores of a human expert player for the games. If the average episode reward is significantly higher, the agent was considered reliable. Our empirical evaluation conveyed that the trained agents performed reliably on their tasks and therefore are suitable attack targets.
Adversarial attack algorithm:
Now given three such reliable Atari agents, the goal was to find a perturbation vector to be added to the observations, which will be able to fool the RL agent to output erroneous predictions. The perturbations are restricted to stay within a set range of
. The main purpose of the attack was to lower the average episode reward of the agents. It was considered to be successful if: the produced adversarial perturbations are unrecognisable for humans; they lower the overall performance of the agent, and lead to predictions that are unfit for the current state of the environment. We utilized the algorithm of classdiscriminative universal adversarial perturbations (CDUAP) introduced by [Zhang et. al., 2020] to calculate perturbations that fulfill these criteria. In our case, the perturbations are not universal but inputspecific. With this alteration, the algorithm produces a new perturbation for each observation. Although a universal perturbation is more complicated to calculate, it would be easily detected by InFlow if added repeatedly to the indistribution data.Dataset creation:
There were a total of unattacked observations for each of the Atari agents for Breakout, Enduro and RoadRunner. We produced adversarially attacked samples with and which were used as the test samples during inference of the model. It is to be noted that not all of the calculated perturbations were able to pass the above three requirements for our attacks—the amount of actual samples for each included in the dataset is listed in Table 9.
Unattacked  0.0008  0.0009  0.001  0.002  0.003  0.004  0.005  

Breakout  10,000  3626  3735  3816  4464  8216  8487  10,000 
Enduro  10,000  6410  6730  6405  8758  7105  9336  10,000 
RoadRunner  10,000  8700  8882  9065  9867  9980  9993  10,000 
Training and Results:
We trained three separate instances of our InFlow model for each of the games with the original unattacked observations set as the indistribution samples. Table 10 shows the AUCROC, FPR95 and AUCPR values obtained for the adversarially attacked samples with different values of . The score reveals that our model is robust in detecting adversarial examples for Breakout and Enduro games where we observe a overall tendency that the scores are better for higher as expected. However, for RoadRunner, we do not observe compatible AUCROC scores even for perturbations at . A reason for this behavior could be the higher action space of RoadRunner in comparison to the two other games. Utilizing the CDUAP algorithm, we determined multiple favorable actions for the current observation and shifted the predictions away from them. For Breakout and Enduro, the different actions are considerably more contradicting and hence the perturbations need to include more distinct features to fool the agent. Additionally, we believe that the created perturbations can be disguised more easily in the detailed images of RoadRunner.
AUCROC  

unattacked  0.0008  0.0009  0.001  0.002  0.003  0.004  0.005  
Breakout  0.524  0.985  0.994  0.983  0.991  0.977  0.998  0.973 
Enduro  0.500  0.938  0.880  0.918  0.928  0.981  0.959  0.939 
RoadRunner  0.524  0.598  0.629  0.617  0.604  0.606  0.639  0.654 
FPR95  
unattacked  0.0008  0.0009  0.001  0.002  0.003  0.004  0.005  
Breakout  0.048  0.001  0  0  0  0  0  0 
Enduro  0.050  0.007  0.011  0.009  0.006  0.001  0.003  0.003 
RoadRunner  0.049  0.022  0.019  0.019  0.035  0.039  0.036  0.030 
AUCPR  
unattacked  0.0008  0.0009  0.001  0.002  0.003  0.004  0.005  
Breakout  0.549  0.971  0.980  0.971  0.981  0.982  0.992  0.983 
Enduro  0.500  0.989  0.979  0.985  0.990  0.995  0.994  0.993 
RoadRunner  0.556  0.761  0.788  0.781  0.782  0.776  0.802  0.810 
a.10 Robustness to visible perturbations
Data generation:
To evaluate the robustness of our InFlow model on visibly perturbed images, we generated corrupted versions of CIFAR 10 test samples on 19 different types of perturbation at five separate severity levels as proposed in [Hendrycks et. al., 2019]. The 19 different perturbation types were chosen since several existing ML models showcase instability for accurately predicting the object classes after the samples undergo such perturbations. Figure 7 shows two CIFAR 10 test examples with five different levels of perturbation severity applied to them. These perturbation types can be categorized into four broad categories—Noise, Blur, Weather and Digital effects—that significantly cover a broad spectrum of realworld perturbations.
Perturbation Type  Severity 1  Severity 2  Severity 3  Severity 4  Severity 5 

Gaussian Noise  0.229  0.260  0.705  0.851  1.0 
Impulse Noise  0.314  0.566  0.901  1.0  1.0 
Shot Noise  0.343  0.272  0.358  0.571  0.951 
Speckle Noise  0.344  0.239  0.317  0.584  1.0 
Defocus Blur  0.541  0.628  0.909  0.978  1.0 
Gaussian Blur  0.540  0.909  0.978  1.0  1.0 
Glass Blur  0.514  0.466  0.565  0.464  0.601 
Motion Blur  0.602  0.955  1.0  1.0  1.0 
Zoom Blur  0.815  0.910  0.977  0.978  0.978 
Snow  1.0  1.0  1.0  1.0  1.0 
Spatter  0.560  0.809  1.0  0.593  0.806 
Frost  1.0  1.0  1.0  1.0  1.0 
Fog  1.0  1.0  1.0  1.0  1.0 
Brightness  1.0  1.0  1.0  1.0  1.0 
Contrast  1.0  1.0  1.0  1.0  1.0 
Saturate  1.0  1.0  1.0  1.0  1.0 
Elastic Transform  0.571  0.649  0.791  0.884  0.860 
Pixelate  0.500  0.525  0.550  0.547  0.729 
JPEG Compression  0.485  0.509  0.509  0.509  0.491 
Training:
We trained our InFlow model on original CIFAR 10 training images and tested the efficiency of the model for detecting the increasing severity levels of corruptness. For training our model, we fixed the significance pvalue of our attention mechanism at while keeping the hyperparameters as set in Appendix A.3 and used the encoder architecture as shown in Table LABEL:encoder. We then calculated the AUCROC scores as shown in Table 11, which were obtained for all 19 perturbation types with increasing severity levels while keeping CIFAR 10 training samples as indistribution.
Results:
Observably, our model can detect OOD samples related to weather effects such as frost, fog, and brightness at all severity levels with perfect AUCROC scores. For the spatter effect, we notice a different color pattern at severity levels 4 and 5, in contrast to the pattern at severity levels 1 to 3. This behavior explains the drop in the performance of the InFlow model with an increase in severity from level 3 to level 4. For perturbations in the noise and blur categories, an increasing level of corruptness results in increased AUCROC scores as the more severely perturbed test samples get significantly further away from being indistribution. However, for shot noise, speckle noise, and glass blur, there is a dip in AUCROC scores when the severity is increased from 1 to 2. We argue that at severity level 2, the perturbation resembled indistribution samples. Overall, it can be presumed that our model is robust in detecting several types of visible perturbations on indistribution samples as OOD and its performance improves as the severity of the corruption is increased.
a.11 Visualization of subnetwork activations and latent space
We visually interpreted the variations in the behavior of indistribution and OOD samples when using our model compared to RealNVP based flow model [Dinh et al., 2017]. We performed several visualizations of subnetwork activations, output at each coupling block, and the latent space. We used a total of two coupling blocks () for both RealNVP and our model and adopted nonshared weights for and activations at each coupling block. Figure 8 shows the visualization results for RealNVP as well as our InFlow model when compared with indistribution CelebA and other OOD datasets.
It can be observed from Figure 8 (a) that the InFlow model transforms the indistribution CelebA samples into a complex latent space after training. Additionally, Figures 8 (b), (d), and (f) reveal that the RealNVP model with no attention mechanism in place also transforms the OOD datasets into a much more complex distribution in the latent space. This presents the visual proof that the RealNVP model learns the local pixel interactions of the input space for both indistribution and OOD samples due to which it cannot distinguish between the semantic information from an indistribution sample and an OOD outlier. As a consequence, the RealNVP model increases the loglikelihood of both indistribution and OOD samples. Figures 8 (c), (e), and (g) show the visualization of our InFlow model for different OOD datasets. It is noticeable that our model reproduces the semantic input features into its latent space and a color change occurs for the output of the coupling blocks as a consequence of the permutation. For the grayscale images like MNIST, the color change is not visible since all the three channels are the same grayscale image. Therefore, the depiction in Figure 8 provides visual evidence that the InFlow model directly transfers the input in its latent space for OOD samples while it was able to transform the indistribution samples into a much more complex distribution.
a.12 InFlow architecture vs Robustness
To analyze the effect of altering the architecture of our model in terms of OOD detection performance, we defined two strategies. The first strategy deals with evaluating the effect of increasing the number of coupling blocks in the model on the OOD detection performance. The second strategy focuses on estimating the effect of jointly or separately learning the weights of and subnetworks. Therefore, we trained four separate instances of the InFlow model with the CelebA training images as the indistribution samples. In both the joint and nonjoint learning settings, we evaluated two and four coupling blocks with and respectively.
K = 4 (nonshared)  K = 2 (shared)  K = 4 (shared)  
MNIST  1  1  1 
FashionMNIST  1  1  1 
SVHN  1  1  1 
CelebA (train)  0.522  0.527  0.526 
CelebA (test)  0.523  0.520  0.523 
CIFAR10  1  1  1 
Tiny ImageNet  1  1  1 
Noise  1  1  1 
Constant  1  1  1 
Figure 9 shows the histogram of loglikelihoods of the InFlow obtained after training it in the four mentioned architectural settings, and Table 12 shows the AUCROC values of these settings at . The AUCROC values for the nonshared weights of and subnetworks with coupling blocks and pvalue can be found in Table 2. The results reveal that in each of the modified architectural settings, our InFlow model was able to assign lower loglikelihood to OOD datasets compared to the indistribution CelebA dataset. Therefore, we can conclude that the internal architecture of the InFlow model does not affect the performance w.r.t OOD detection. Hence, this provides a significant empirical observation that should prevent further study on improving the design of normalizing flows, particularly for OOD detection.