1 Introduction
Anomaly detection with localization (AD) is a growing area of research in computer vision with many practical applications industrial inspection
[4], road traffic monitoring [20], medical diagnostics [44] . However, the common supervised AD [32] is not viable in practical applications due to several reasons. First, it requires labeled data which is costly to obtain. Second, anomalies are usually rare longtail examplesand have low probability to be acquired by sensors. Lastly, consistent labeling of anomalies is subjective and requires extensive domain expertise as illustrated in Figure
1 with industrial cable defects.With these limitations of the supervised AD, a more appealing approach is to collect only unlabeled anomalyfree images for train dataset as in Figure 1
(top row). Then, any deviation from anomalyfree images is classified as an anomaly. Such data setup with low rate of anomalies is generally considered to be
unsupervised [4]. Hence, the AD task can be reformulated as a task of outofdistribution detection (OOD) with the AD objective.While OOD for lowdimensional industrial sensors (powerline or acoustic) can be accomplished using a common nearestneighbor or more advanced clustering methods [10]
, it is less trivial for highresolution images. Recently, convolutional neural networks (CNNs) have gained popularity in extracting semantic information from images into downsampled feature maps
[4]. Though feature extraction using CNNs has relatively low complexity, the postprocessing of feature maps is far from realtime processing in the stateoftheart unsupervised AD methods
[8].To address this complexity drawback, we propose a CFLOWAD model that is based on conditional normalizing flows. CFLOWAD is agnostic to feature map spatial dimensions similar to CNNs, which leads to a higher accuracy metrics as well as a lower computational and memory requirements. We present the main idea behind our approach in a toy OOD detector example in Figure 1. A distribution of the anomalyfree image patches
with probability density function
is learned by the AD model. Our translationequivariant model is trained to transform the original distribution with density into a Gaussian distribution with density. Finally, this model separates indistribution patches with from the outofdistribution patches with using a threshold computed as the Euclidean distance from the distribution mean.2 Related work
We review models^{2}^{2}2For comprehensive review of the existing AD methods we refer readers to Ruff [31] and Pang [26] surveys. that employ the data setup from Figure 1 and provide experimental results for popular MVTec dataset [4] with factory defects or Shanghai Tech Campus (STC) dataset [21] with surveillance camera videos. We highlight the research related to a more challenging task of pixellevel anomaly localization (segmentation) rather than a more simple imagelevel anomaly detection.
Napoletano [25]
propose to use CNN feature extractors followed by a principal component analysis and
mean clustering for AD. Their feature extractor is a ResNet18 [13]pretrained on a largescale ImageNet dataset
[16]. Similarly, SPADE [7] employs a WideResNet50 [43] with multiscale pyramid pooling that is followed by anearestneighbor clustering. Unfortunately, clustering is slow at testtime with highdimensional data. Thus, parallel convolutional methods are preferred in realtime systems.
Numerous methods are based on a natural idea of generative modeling. Unlike models with the discriminativelypretrained feature extractors [25, 7]
, generative models learn distribution of anomalyfree data and, therefore, are able to estimate a proxy metrics for anomaly scores even for the unseen images with anomalies. Recent models employ generative adversarial networks (GANs)
[35, 36]and variational autoencoders (VAEs)
[3, 38].A fullygenerative models [35, 36, 3, 38] are directly applied to images in order to estimate pixellevel probability density and compute perpixel reconstruction errors as anomaly scores proxies. These fullygenerative models are unable to estimate the exact data likelihoods [6, 24] and do not perform better than the traditional methods [25, 7] according to MVTec survey in [4]. Recent works [34, 15] show that these models tend to capture only lowlevel correlations instead of relevant semantic information. To overcome the latter drawback, a hybrid DFR model [37] uses a pretrained feature extractor with multiscale pyramid pooling followed by a convolutional autoencoder (CAE). However, DFR model is unable to estimate the exact likelihoods.
Another line of research proposes to employ a studentteacher type of framework [5, 33, 41]. Teacher is a pretrained feature extractor and student is trained to estimate a scoring function for AD. Unfortunately, such frameworks underperform compared to stateoftheart models.
Patch SVDD [42] and CutPaste [19] introduce a selfsupervised pretraining scheme for AD. Moreover, Patch SVDD proposes a novel method to combine multiscale scoring masks to a final anomaly map. Unlike the nearestneighbor search in [42], CutPaste estimates anomaly scores using an efficient Gaussian density estimator. While the selfsupervised pretraining can be helpful in uncommon data domains, Schirrmeister [34]
argue that large naturalimage datasets such as ImageNet can be a more representative for pretraining compared to a small applicationspecific datasets industrial MVTec
[4].The stateoftheart PaDiM [8] proposes surprisingly simple yet effective approach for anomaly localization. Similarly to [37, 7, 42], this approach relies on ImageNetpretrained feature extractor with multiscale pyramid pooling. However, instead of slow testtime clustering in [7] or nearestneighbor search in [42], PaDiM uses a wellknown Mahalanobis distance metric [23]
as an anomaly score. The metric parameters are estimated for each feature vector from the pooled feature maps. PaDiM has been inspired by Rippel
[29] who firstly advocated to use this measure for anomaly detection without localization.DifferNet [30] uses a promising class of generative models called normalizing flows (NFLOWs) [9] for imagelevel AD. The main advantage of NFLOW models is ability to estimate the exact likelihoods for OOD compared to other generative models [35, 36, 3, 38, 37]. In this paper, we extend DifferNet approach to pixellevel anomaly localization task using our CFLOWAD model. In contrast to RealNVP [9] architecture with global average pooling in [30], we propose to use conditional normalizing flows [2] to make CFLOWAD suitable for lowcomplexity processing of multiscale feature maps for localization task. We develop our CFLOWAD with the following contributions:

Our theoretical analysis shows why multivariate Gaussian assumption is a justified prior in previous models and why a more general NFLOW framework objective converges to similar results with the less compute.

We propose to use conditional normalizing flows for unsupervised anomaly detection with localization using computational and memoryefficient architecture.

We show that our model outperforms previous stateofthe art in both detection and localization due to the unique properties of the proposed CFLOWAD model.
3 Theoretical background
3.1 Feature extraction with Gaussian prior
Consider a CNN trained for classification task. Its parameters are usually found by minimizing KullbackLeibler () divergence between joint train data distribution and the learned model distribution , where
is an inputlabel pair for supervised learning.
Typically, the parameters are initialized by the values sampled from the Gaussian distribution [12] and optimization process is regularized as
(1) 
where is a regularization term and
is a hyperparameter that defines regularization strength.
3.2 A case for Mahalanobis distance
With the same MVG prior assumption, Lee [18] recently proposed to model distribution of feature vectors by MVG density function and to use Mahalanobis distance [23] as a confidence score in CNN classifiers. Inspired by [18], Rippel [29] adopt Mahalanobis distance for anomaly detection task since this measure determines a distance of a particular feature vector to its MVG distribution. Consider a MVG distribution with a density function
for random variable
defined as(2) 
where is a mean vector and is a covariance matrix of a true anomalyfree density .
Then, the Mahalanobis distance is calculated as
(3) 
Since the true anomalyfree data distribution is unknown, mean vector and covariance matrix from (3) are replaced by the estimates and calculated from the empirical train dataset . At the same time, density function of anomaly data has different and statistics, which allows to separate outofdistribution and indistribution feature vectors using from (3).
3.3 Relationship with the flow framework
Dinh [9] introduce a class of generative probabilistic models called normalizing flows. These models apply change of variable formula to fit an arbitrary density by a tractable base distribution with density and a bijective invertible mapping . Then, the likelihood of any can be estimated by
(4) 
where a sample is usually from standard MVG distribution and a matrix is the Jacobian of a bijective invertible flow model and parameterized by vector .
The flow model is a set of basic layered transformations with tractable Jacobian determinants. For example, in RealNVP [9]
coupling layers is a simple sum of layer’s diagonal elements. These models are optimized using stochastic gradient descent by maximizing
likelihood in (4). Equivalently, optimization can be done by minimizing the reverse [27], where is the model prediction andis a target density. The loss function for this objective is defined as
(5) 
If is distributed according to Section 3.1 MVG assumption, we can express (5) as a function of Mahalanobis distance using its definition from (3) as
(6) 
where is a squared Euclidean distance of a sample (detailed proof in Appendix A).
Then, the loss in (6) converges to zero when the likelihood contribution term of the model (normalized by ) compensates the difference between a squared Mahalanobis distance for from the target density and a squared Euclidean distance for .
This normalizing flow framework can estimate the exact likelihoods of any arbitrary distribution with density, while Mahalanobis distance is limited to MVG distribution only. For example, CNNs trained with regularization would have Laplace prior [11] or have no particular prior in the absence of regularization. Moreover, we introduce conditional normalizing flows in the next section and show that they are more compact in size and have fullyconvolutional parallel architecture compared to [7, 8] models.
4 The proposed CFLOWAD model
4.1 CFLOW encoder for feature extraction
We implement a feature extraction scheme with multiscale feature pyramid pooling similar to recent models [7, 8]. We define the discriminativelytrained CNN feature extractor as an encoder in Figure 2. The CNN encoder maps image patches into a feature vectors that contain relevant semantic information about their content. CNNs accomplish this task efficiently due to their translationequivariant architecture with the shared kernel parameters. In our experiments, we use ImageNetpretrained encoder following Schirrmeister [34] who show that large naturalimage datasets can serve as a representative distribution for pretraining. If a large applicationdomain unlabeled data is available, the selfsupervised pretraining from [42, 19] can be a viable option.
One important aspect of a CNN encoder is its effective receptive field [22]. Since the effective receptive field is not strictly bounded, the size of encoded patches cannot be exactly defined. At the same time, anomalies have various sizes and shapes, and, ideally, they have to be processed with the variable receptive fields. To address the ambiguity between CNN receptive fields and anomaly variability, we adopt common multiscale feature pyramid pooling approach. Figure 2 shows that the feature vectors are extracted by pooling layers. Pyramid pooling captures both local and global patch information with small and large receptive fields in the first and last CNN layers, respectively. For convenience, we number pooling layers in the last to first layer order.
4.2 CFLOW decoders for likelihood estimation
We use the general normalizing flow framework from Section 3.3 to estimate likelihoods of feature vectors . Hence, our generative decoder model aims to fit true density by an estimated parameterized density from (1). However, the feature vectors are assumed to be independent of their spatial location in the general framework. To increase efficacy of distribution modeling, we propose to incorporate spatial prior into model using conditional flow framework. In addition, we model densities using independent decoder models due to multiscale feature pyramid pooling setup.
Our conditional normalizing flow (CFLOW) decoder architecture is presented in Figure 2. We generate a conditional vector using a 2D form of conventional positional encoding (PE) [39]. Each contains and harmonics that are unique to its spatial location . We extend unconditional flow framework to CFLOW by concatenating the intermediate vectors inside decoder coupling layers with the conditional vectors as in [2].
Then, the th CFLOW decoder contains a sequence of conventional coupling layers with the additional conditional input. Each coupling layer comprises of fullyconnected layer with kernel, softplus activation and output vector permutations. Usually, the conditional extension does not increase model size since . For example, we use the fixed in all our experiments. Our CFLOW decoder has translationequivariant architecture, because it slides along feature vectors extracted from the intermediate feature maps with kernel parameter sharing. As a result, both the encoder and decoders have convolutional translationequivariant architectures.
We train CFLOWAD using a maximum likelihood objective, which is equivalent to minimizing loss defined by
(7) 
where the random variable , the Jacobian for CFLOW decoder and an expectation operation in is replaced by an empirical train dataset of size . For brevity, we drop the th scale notation. The derivation is given in Appendix B.
After training the decoders for all scales using (7), we estimate test dataset likelihoods as
(8) 
Next, we convert likelihoods to probabilities for each th scale using (8) and normalize them to be in range. Then, we upsample to the input image resolution (
) using bilinear interpolation
. Finally, we calculate anomaly score maps by aggregating all upsampled probabilities as .Model  Train  Test  Memory 

SPADE [7]  
PaDiM [8]  
Ours 
4.3 Complexity analysis
Table 1 analytically compares complexity of CFLOWAD and recent stateoftheart models with the same pyramid pooling setup .
SPADE [7] performs nearestneighbor clustering between each test point and a gallery of train data. Therefore, the method requires large memory allocation for gallery and a clustering procedure that is typically slow compared to convolutional methods.
PaDiM [8] estimates traintime statistics inverses of covariance matrices to calculate at testtime. Hence, it has low computational complexity, but it stores in memory matrices of size for every th pooling layer.
5 Experiments
Encoder  WRN50  WRN50  WRN50  WRN50  WRN50  R18  R18  MNetV3  MNetV3 
# of CL  4 8  8  8  8  8  8  8  8  
# of PL  2  2 3  3  3  3  3  3  3  
HW  256  256  256 512  512  256 512  256 512  
Type  CFLOW  CFLOW  CFLOW  CFLOW UFLOW  CFLOW  CFLOW  CFLOW  CFLOW  
Bottle  97.280.03  97.240.03  98.760.01  98.980.01  98.830.01  98.470.03  98.640.01  98.74  98.92 
Cable  95.710.01  96.170.07  97.640.04  97.120.06  95.290.04  96.750.04  96.070.06  97.62  97.49 
Capsule  98.170.02  98.190.05  98.980.00  98.640.02  98.400.12  98.620.02  98.280.05  98.89  98.75 
Carpet  98.500.01  98.550.01  99.230.01  99.250.01  99.240.00  99.000.01  99.290.00  98.64  99.00 
Grid  93.770.05  93.880.16  96.890.02  98.990.02  98.740.00  93.950.04  98.530.01  94.75  98.81 
Hazelnut  98.080.01  98.130.02  98.820.01  98.890.01  98.880.01  98.810.01  98.410.01  98.88  99.00 
Leather  98.920.02  99.000.06  99.610.01  99.660.00  99.650.00  99.450.01  99.510.02  99.50  99.64 
Metal Nut  96.720.03  96.720.06  98.560.03  98.250.04  98.160.03  97.590.05  96.420.03  98.36  98.78 
Pill  98.460.02  98.460.01  98.950.00  98.520.05  98.200.08  98.340.02  97.800.05  98.69  98.44 
Screw  94.980.06  95.280.06  98.100.05  98.860.02  98.780.01  97.380.03  98.400.03  98.04  99.09 
Tile  95.520.02  95.660.06  97.710.02  98.010.01  97.980.02  95.100.02  95.800.10  96.07  96.48 
Toothbrush  98.020.03  97.980.00  98.560.02  98.930.00  98.890.00  98.440.02  99.000.01  98.09  98.80 
Transistor  93.090.28  94.050.11  93.280.40  80.520.13  76.280.14  92.710.23  83.340.46  97.79  95.22 
Wood  90.650.10  90.590.07  94.490.03  96.650.01  96.560.02  93.510.03  95.000.04  92.24  94.96 
Zipper  96.800.02  97.010.05  98.410.09  99.080.02  99.060.01  97.710.06  98.980.01  97.50  99.07 
Average  96.31  96.46  97.87  97.36  96.86  97.06  96.90  97.59  98.16 
Task  Localization  Detection  Localization  
Encoder  ResNet18  EffNetB4  WideResNet50  
Class/Model  CutPaste  Ours  CutPaste  Ours  CutPaste  Ours  SPADE  PaDiM  Ours 
Bottle  97.6  98.64  98.3  100.00  100.0  100.0  (98.4, 95.5)  (98.3, 94.8)  (98.98, 96.80) 
Cable  90.0  96.75  80.6  97.62  96.2  97.59  (97.2, 90.9)  (96.7, 88.8)  (97.64, 93.53) 
Capsule  97.4  98.62  96.2  93.15  95.4  97.68  (99.0, 93.7)  (98.5, 93.5)  (98.98, 93.40) 
Carpet  98.3  99.29  93.1  98.20  100.0  98.73  (97.5, 94.7)  (99.1, 96.2)  (99.25, 97.70) 
Grid  97.5  98.53  99.9  98.97  99.1  99.60  (93.7, 86.7)  (97.3, 94.6)  (98.99, 96.08) 
Hazelnut  97.3  98.81  97.3  99.91  99.9  99.98  (99.1, 95.4)  (98.2, 92.6)  (98.89, 96.68) 
Leather  99.5  99.51  100.0  100.00  100.0  100.0  (97.6, 97.2)  (98.9, 88.8)  (99.66, 99.35) 
Metal Nut  93.1  97.59  99.3  98.45  98.6  99.26  (98.1, 94.4)  (97.2, 85.6)  (98.56, 91.65) 
Pill  95.7  98.34  92.4  93.02  93.3  96.82  (96.5, 94.6)  (95.7, 92.7)  (98.95, 95.39) 
Screw  96.7  98.40  86.3  85.94  86.6  91.89  (98.9, 96.0)  (98.5, 94.4)  (98.86, 95.30) 
Tile  90.5  95.80  93.4  98.40  99.8  99.88  (87.4, 75.9)  (94.1, 86.0)  (98.01, 94.34) 
Toothbrush  98.1  99.00  98.3  99.86  90.7  99.65  (97.9, 93.5)  (98.8, 93.1)  (98.93, 95.06) 
Transistor  93.0  97.69  95.5  93.04  97.5  95.21  (94.1, 87.4)  (97.5, 84.5)  (97.99, 81.40) 
Wood  95.5  95.00  98.6  98.59  99.8  99.12  (88.5, 97.4)  (94.9, 91.1)  (96.65, 95.79) 
Zipper  99.3  98.98  99.4  96.15  99.9  98.48  (96.5, 92.6)  (98.5, 95.9)  (99.08, 96.60) 
Average  96.0  98.06  95.2  96.75  97.1  98.26  (96.0, 91.7)  (97.5, 92.1)  (98.62, 94.60) 
5.1 Experimental setup
We conduct unsupervised anomaly detection (imagelevel) and localization (pixellevel segmentation) experiments using the MVTec [4] dataset with factory defects and the STC [21]
dataset with surveillance camera videos. The code is in PyTorch
[28] with the FrEIA library [1] used for generative normalizing flow modeling.Industrial MVTec dataset comprises 15 classes with total of 3,629 images for training and 1,725 images for testing. The train dataset contains only anomalyfree images without any defects. The test dataset contains both images containing various types of defects and defectfree images. Five classes contain different types of textures (carpet, grid, leather, tile, wood), while the remaining 10 classes represent various types of objects. We resize MVTec images without cropping according to the specified image resolution ( ) and apply augmentation rotations during training phase only.
STC dataset contains 274,515 training and 42,883 testing frames extracted from surveillance camera videos and divided into 13 distinct university campus scenes. Because STC is significantly larger than MVTec, we experiment only with resolution and apply the same preprocessing and augmentation pipeline as for MVTec.
We compare CFLOWAD with the models reviewed in Section 2
using MVTec and STC datasets. We use widelyused thresholdagnostic evaluation metrics for localization: area under the receiver operating characteristic curve (AUROC) and area under the perregionoverlap curve (AUPRO)
[4]. AUROC is skewed towards largearea anomalies, while AUPRO metric ensures that both large and small anomalies are equally important in localization. Imagelevel AD detection is reported by the AUROC only.
We run each CFLOWAD experiment four times on the MVTec and report mean (
) of the evaluation metric and, if specified, its standard deviation (
). For the larger STC dataset, we conduct only a single experiment. As in other methods, we train a separate CFLOWAD model for each MVTec class and each STC scene. All our models use the same training hyperparameters: Adam optimizer with 2e4 learning rate, 100 train epochs, 32 minibatch size for encoder and cosine learning rate annealing with 2 warmup epochs. Since our decoders are agnostic to feature map dimensions and have low memory requirements, we train and test CFLOWAD decoders with 8,192 (32
256) minibatch size for feature vector processing. During the train phase 8,192 feature vectors are randomly sampled from 32 random feature maps. Similarly, 8,192 feature vectors are sequentially sampled during the test phase. The feature pyramid pooling setup for ResNet18 and WideResnet50 encoder is identical to PaDiM [8]. The effects of other architectural hyperparameters are studied in the ablation study.5.2 Ablation study
Table 2 presents a comprehensive study of various design choices for CFLOWAD on the MVTec dataset using AUROC metric. In particular, we experiment with the input image resolution (), encoder architecture (ResNet18 [13], WideResnet50 [43], MobileNetV3L [14]), type of normalizing flow (unconditional (UFLOW) or conditional (CFLOW)), number of flow coupling layers (# of CL) and pooling layers (# of PL).
Our study shows that the increase in number of decoder’s coupling layers from 4 to 8 gives on average 0.15% gain due to a more accurate distribution modeling. Even higher 1.4% AUROC improvement is achieved when processing 3scale feature maps (layers 1, 2 and 3) compared 2scale only (layers 2, 3). The additional feature map (layer 1) with larger scale () provides more precise spatial semantic information. The conditional normalizing flow (CFLOW) is on average 0.5% better than the unconditional (UFLOW) due to effective encoding of spatial prior. Finally, larger WideResnet50 outperforms smaller ResNet18 by 0.81%. MobileNetV3L, however, could be a good design choice for both fast inference and high AUROC.
Importantly, we find that the optimal input resolution is not consistent among MVTec classes. The classes with macro objects cable or pill tend to benefit from the smallerscale processing (256256), which, effectively, translates to larger CNN receptive fields. Majority of classes perform better with 512512 inputs smaller receptive fields. Finally, we discover that the transistor class has even higher AUROC with the resized to 128128 images. Hence, we report results with the highest performing input resolution settings in the Section 5.3 comparisons.
Metric  AUROC  AUPRO  

Model  Detection  Localization  
DifferNet [30]  94.9     
DFR [37]    95.0  91.0 
SVDD [42]  92.1  95.7   
SPADE [7]  85.5  96.0  91.7 
CutPaste [19]  97.1  96.0   
PaDiM [8]  97.9  97.5  92.1 
CFLOWAD (ours)  98.26  98.62  94.60 
5.3 Quantitative comparison
Table 4 summarizes average MVTec results for the best published models. CFLOWAD with WideResNet50 encoder outperforms stateoftheart by 0.36% AUROC in detection, by 1.12% AUROC and 2.5% AUPRO in localization, respectively. Table 3 contains perclass comparison for the subset of models grouped by the task and type of encoder architecture. CFLOWAD is on par or significantly exceeds the best models in perclass comparison with the same encoder setups.
Table 5 presents highlevel comparison of the best recently published models on the STC dataset. CFLOWAD outperforms stateoftheart SPADE [7] by 0.73% AUROC in anomaly detection and PaDiM [8] by 3.28% AUROC in anomaly localization tasks, respectively.
Note that our CFLOWAD models in Tables 34 use variable input resolution as discussed in the ablation study: 512512, 256256 or 128128 depending on the MVTec class. We used fixed 256256 input resolution in Table 5 for the large STC dataset to decrease training time. Other reference hyperparameters in Tables 45 are set as: WideResnet50 encoder with 3scale pooling layers, conditional normalizing flow decoders with 8 coupling layers.
5.4 Qualitative results
Figure 3 visually shows examples from the MVTec and the corresponding CFLOWAD predictions. The top row shows ground truth masks from including examples with and without anomalies. Then, our model produces anomaly score maps (middle row) using the architecture from Figure 2. Finally, we show the predicted segmentation masks with the threshold selected to maximize F1score.
Figure 4 presents an additional evidence that our CFLOWAD model actually addresses the OOD task sketched in Figure 1 toy example. We plot distribution of output anomaly scores for anomalyfree (green) and anomalous feature vectors (red). Then, CFLOWAD is able to distinguish indistribution and outofdistribution feature vectors and separate them using a scalar threshold .
5.5 Complexity evaluations
Complexity metric and Model  Inference speed, fps  Model size, MB  
STC  MVTec  
R18 encoder only  80 / 62  45  
PaDiMR18 [8]  4.4  210  170 
CFLOWADR18  34 / 12  96  
WRN50 encoder only  62 / 30  268  
SPADEWRN50 [7]  0.1  37,000  1,400 
PaDiMWRN50 [8]  1.1  5,200  3,800 
CFLOWADWRN50  27 / 9  947  
MNetV3 encoder only  82 / 61  12  
CFLOWADMNetV3  35 / 12  25 
In addition to analytical estimates in Table 1, we present the actual complexity evaluations for the trained models using inference speed and model size metrics. Particularly, Table 1 compares CFLOWAD with the models from Tables 45 that have been studied by Defard [8].
The model size in Table 6 is measured as the size of all floatingpoint parameters in the corresponding model its encoder and decoder (postprocessing) models. Because the encoder architectures are identical, only the postprocessing models are different. Since CFLOWAD decoders do not explicitly depend on the feature map dimensions (only on feature vector depths), our model is significantly smaller than SPADE and PaDiM. If we exclude the encoder parameters for fair comparison, CFLOWAD is 1.7 to 50 smaller than SPADE and 2 to 7 smaller than PaDiM.
Inference speed in Table 6 is measured with INTEL I7 CPU for SPADE and PaDiM in Defard [8] study with 256256 inputs. We deduce that this suboptimal CPU choice was made due to large memory requirements for these models in Table 6. Thus, their GPU allocation for fast inference is infeasible. In contrast, our CFLOWAD can be processed in realtime with 8 to 25 faster inference speed on 1080 8GB GPU with the same input resolution and feature extractor. In addition, MobileNetV3L encoder provides a good tradeoff between accuracy, model size and inference speed for practical inspection systems.
6 Conclusions
We proposed to use conditional normalizing flow framework to estimate the exact data likelihoods which is infeasible in other generative models. Moreover, we analytically showed the relationship of this framework to previous distancebased models with multivariate Gaussian prior.
We introduced CFLOWAD model that addresses the complexity limitations of existing unsupervised AD models by employing fullyconvolutional translationequivariant architecture. As a result, CFLOWAD is faster and smaller by a factor of 10 than prior models with the same input resolution and feature extractor setup.
CFLOWAD achieves new stateoftheart for popular MVTec with 98.26% AUROC in detection, 98.62% AUROC and 94.60% AUPRO in localization. Our new stateoftheart for STC dataset is 72.63% and 94.48% AUROC in detection and localization, respectively. Our ablation study analyzed design choices for practical realtime processing including feature extractor choice, multiscale pyramid pooling setup and the flow model hyperparameters.
References

[1]
(2019)
Analyzing inverse problems with invertible neural networks
. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §5.1.  [2] (2019) Guided image generation with conditional invertible neural networks. arXiv:1907.02392. Cited by: §2, §4.2.
 [3] (2018) Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. arXiv:1804.04488. Cited by: §2, §2, §2.

[4]
(2019)
MVTec AD – a comprehensive realworld dataset for unsupervised anomaly detection.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: Figure 1, §1, §1, §1, §2, §2, §2, §5.1, §5.1, Table 2, Table 3, Table 4.  [5] (2020) Uninformed students: studentteacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [6] (2019) Improving unsupervised defect segmentation by applying structural similarity to autoencoders. Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Cited by: §2.
 [7] (2021) Subimage anomaly detection with deep pyramid correspondences. arXiv:2005.02357v3. Cited by: §2, §2, §2, §2, §3.3, §4.1, §4.3, Table 1, §5.3, Table 3, Table 4, Table 5, Table 6.
 [8] (2021) PaDiM: a patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International Conference on Pattern Recognition (ICPR) Workshops, Cited by: §1, §2, §3.2, §3.3, §4.1, §4.3, Table 1, §5.1, §5.3, §5.5, §5.5, Table 3, Table 4, Table 5, Table 6.
 [9] (2017) Density estimation using real NVP. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2, §3.3, §3.3.
 [10] (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE. Cited by: §1.
 [11] (2016) Deep learning. MIT Press. Cited by: §3.1, §3.3.
 [12] (2015) Delving deep into rectifiers: surpassing humanlevel performance on ImageNet classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §3.1.
 [13] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1, §5.2.
 [14] (2019) Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §5.2.
 [15] (2020) Why normalizing flows fail to detect outofdistribution data. In Advances in Neural Information Processing Systems, Cited by: §2.
 [16] (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
 [17] (1992) A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems, Cited by: §3.1.
 [18] (2018) A simple unified framework for detecting outofdistribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, Cited by: §3.2.
 [19] (2021) CutPaste: selfsupervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.1, Table 3, Table 4.
 [20] (2020) Multigranularity tracking with modularlized components for unsupervised vehicles anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1.
 [21] (2017) A revisit of sparse coding based anomaly detection in stacked RNN framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2, §5.1, Table 5.
 [22] (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §4.1.
 [23] (1936) On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta). Cited by: §2, §3.2.
 [24] (2019) Do deep generative models know what they don’t know?. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
 [25] (2018) Anomaly detection in nanofibrous materials by CNNbased selfsimilarity. Sensors. Cited by: §2, §2, §2.
 [26] (2021) Deep learning for anomaly detection: a review. ACM Comput. Surv.. Cited by: footnote 2.

[27]
(2021)
Normalizing flows for probabilistic modeling and inference.
Journal of Machine Learning Research
. Cited by: §3.3.  [28] (2017) Automatic differentiation in PyTorch. In Autodiff workshop at Advances in Neural Information Processing Systems, Cited by: §5.1.

[29]
(2020)
Modeling the distribution of normal data in pretrained deep features for anomaly detection
. arXiv:2005.14140. Cited by: §2, §3.2, §3.2.  [30] (2021) Same same but DifferNet: semisupervised defect detection with normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Cited by: §2, Table 4.
 [31] (2021) A unifying review of deep and shallow anomaly detection. Proc of the IEEE. Cited by: footnote 2.
 [32] (2013) Objectcentric anomaly detection by attributebased reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [33] (2021) Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [34] (2020) Understanding anomaly detection with deep invertible networks through hierarchies of distributions and features. In Advances in Neural Information Processing Systems, Cited by: §2, §2, §4.1.
 [35] (2019) FAnoGAN: fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis. Cited by: §2, §2, §2.
 [36] (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, Cited by: §2, §2, §2.
 [37] (2021) Unsupervised anomaly segmentation via deep feature reconstruction. Neurocomputing. Cited by: §2, §2, §2, Table 4.

[38]
(2019)
qspace novelty detection with variational autoencoders
. MICCAI 2019 International Workshop on Computational Diffusion MRI. Cited by: §2, §2, §2.  [39] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §4.2.
 [40] (2020) Attention guided anomaly localization in images. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 5.
 [41] (2021) Studentteacher feature pyramid matching for unsupervised anomaly detection. arXiv:2103.04257. Cited by: §2.
 [42] (2020) Patch SVDD: patchlevel SVDD for anomaly detection and segmentation. In Proceedings of the Asian Conference on Computer Vision (ACCV), Cited by: §2, §2, §4.1, Table 4.
 [43] (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2, §3.1, §5.2.
 [44] (2020) Encoding structuretexture relation with PNet for anomaly detection in retinal images. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1.
Comments
There are no comments yet.