CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows

07/27/2021 ∙ by Denis Gudovskiy, et al. ∙ Panasonic Corporation of North America 0

Unsupervised anomaly detection with localization has many practical applications when labeling is infeasible and, moreover, when anomaly examples are completely missing in the train data. While recently proposed models for such data setup achieve high accuracy metrics, their complexity is a limiting factor for real-time processing. In this paper, we propose a real-time model and analytically derive its relationship to prior methods. Our CFLOW-AD model is based on a conditional normalizing flow framework adopted for anomaly detection with localization. In particular, CFLOW-AD consists of a discriminatively pretrained encoder followed by a multi-scale generative decoders where the latter explicitly estimate likelihood of the encoded features. Our approach results in a computationally and memory-efficient model: CFLOW-AD is faster and smaller by a factor of 10x than prior state-of-the-art with the same input setting. Our experiments on the MVTec dataset show that CFLOW-AD outperforms previous methods by 0.36 1.12 our code with fully reproducible experiments.



There are no comments yet.


page 1

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection with localization (AD) is a growing area of research in computer vision with many practical applications industrial inspection 

[4], road traffic monitoring [20], medical diagnostics [44] . However, the common supervised AD [32] is not viable in practical applications due to several reasons. First, it requires labeled data which is costly to obtain. Second, anomalies are usually rare long-tail examples

and have low probability to be acquired by sensors. Lastly, consistent labeling of anomalies is subjective and requires extensive domain expertise as illustrated in Figure 

1 with industrial cable defects.

Figure 1: An example of the proposed out-of-distribution (OOD) detector for anomaly localization trained on anomaly-free (top row). Sliced cable images are from the MVTec dataset [4], where the bottom row illustrates ground truth masks for anomalies (red) and the middle row shows examples of anomaly-free patches (green). The OOD detector learns the distribution of anomaly-free patches with

density and transforms it into a Gaussian distribution with

density. Threshold separates in-distribution patches from the OOD patches with density.

With these limitations of the supervised AD, a more appealing approach is to collect only unlabeled anomaly-free images for train dataset as in Figure 1

(top row). Then, any deviation from anomaly-free images is classified as an anomaly. Such data setup with low rate of anomalies is generally considered to be

unsupervised [4]. Hence, the AD task can be reformulated as a task of out-of-distribution detection (OOD) with the AD objective.

While OOD for low-dimensional industrial sensors (power-line or acoustic) can be accomplished using a common -nearest-neighbor or more advanced clustering methods [10]

, it is less trivial for high-resolution images. Recently, convolutional neural networks (CNNs) have gained popularity in extracting semantic information from images into downsampled feature maps 


. Though feature extraction using CNNs has relatively low complexity, the post-processing of feature maps is far from real-time processing in the state-of-the-art unsupervised AD methods 


To address this complexity drawback, we propose a CFLOW-AD model that is based on conditional normalizing flows. CFLOW-AD is agnostic to feature map spatial dimensions similar to CNNs, which leads to a higher accuracy metrics as well as a lower computational and memory requirements. We present the main idea behind our approach in a toy OOD detector example in Figure 1. A distribution of the anomaly-free image patches

with probability density function

is learned by the AD model. Our translation-equivariant model is trained to transform the original distribution with density into a Gaussian distribution with density. Finally, this model separates in-distribution patches with from the out-of-distribution patches with using a threshold computed as the Euclidean distance from the distribution mean.

2 Related work

We review models222For comprehensive review of the existing AD methods we refer readers to Ruff  [31] and Pang  [26] surveys. that employ the data setup from Figure 1 and provide experimental results for popular MVTec dataset [4] with factory defects or Shanghai Tech Campus (STC) dataset [21] with surveillance camera videos. We highlight the research related to a more challenging task of pixel-level anomaly localization (segmentation) rather than a more simple image-level anomaly detection.

Napoletano  [25]

propose to use CNN feature extractors followed by a principal component analysis and

-mean clustering for AD. Their feature extractor is a ResNet-18 [13]

pretrained on a large-scale ImageNet dataset 

[16]. Similarly, SPADE [7] employs a Wide-ResNet-50 [43] with multi-scale pyramid pooling that is followed by a

-nearest-neighbor clustering. Unfortunately, clustering is slow at test-time with high-dimensional data. Thus, parallel convolutional methods are preferred in real-time systems.

Numerous methods are based on a natural idea of generative modeling. Unlike models with the discriminatively-pretrained feature extractors [25, 7]

, generative models learn distribution of anomaly-free data and, therefore, are able to estimate a proxy metrics for anomaly scores even for the unseen images with anomalies. Recent models employ generative adversarial networks (GANs) 

[35, 36]

and variational autoencoders (VAEs) 

[3, 38].

A fully-generative models [35, 36, 3, 38] are directly applied to images in order to estimate pixel-level probability density and compute per-pixel reconstruction errors as anomaly scores proxies. These fully-generative models are unable to estimate the exact data likelihoods [6, 24] and do not perform better than the traditional methods [25, 7] according to MVTec survey in [4]. Recent works [34, 15] show that these models tend to capture only low-level correlations instead of relevant semantic information. To overcome the latter drawback, a hybrid DFR model [37] uses a pretrained feature extractor with multi-scale pyramid pooling followed by a convolutional autoencoder (CAE). However, DFR model is unable to estimate the exact likelihoods.

Another line of research proposes to employ a student-teacher type of framework [5, 33, 41]. Teacher is a pretrained feature extractor and student is trained to estimate a scoring function for AD. Unfortunately, such frameworks underperform compared to state-of-the-art models.

Patch SVDD [42] and CutPaste [19] introduce a self-supervised pretraining scheme for AD. Moreover, Patch SVDD proposes a novel method to combine multi-scale scoring masks to a final anomaly map. Unlike the nearest-neighbor search in [42], CutPaste estimates anomaly scores using an efficient Gaussian density estimator. While the self-supervised pretraining can be helpful in uncommon data domains, Schirrmeister  [34]

argue that large natural-image datasets such as ImageNet can be a more representative for pretraining compared to a small application-specific datasets industrial MVTec 


The state-of-the-art PaDiM [8] proposes surprisingly simple yet effective approach for anomaly localization. Similarly to [37, 7, 42], this approach relies on ImageNet-pretrained feature extractor with multi-scale pyramid pooling. However, instead of slow test-time clustering in [7] or nearest-neighbor search in [42], PaDiM uses a well-known Mahalanobis distance metric [23]

as an anomaly score. The metric parameters are estimated for each feature vector from the pooled feature maps. PaDiM has been inspired by Rippel  

[29] who firstly advocated to use this measure for anomaly detection without localization.

DifferNet [30] uses a promising class of generative models called normalizing flows (NFLOWs) [9] for image-level AD. The main advantage of NFLOW models is ability to estimate the exact likelihoods for OOD compared to other generative models [35, 36, 3, 38, 37]. In this paper, we extend DifferNet approach to pixel-level anomaly localization task using our CFLOW-AD model. In contrast to RealNVP [9] architecture with global average pooling in [30], we propose to use conditional normalizing flows [2] to make CFLOW-AD suitable for low-complexity processing of multi-scale feature maps for localization task. We develop our CFLOW-AD with the following contributions:

  • Our theoretical analysis shows why multivariate Gaussian assumption is a justified prior in previous models and why a more general NFLOW framework objective converges to similar results with the less compute.

  • We propose to use conditional normalizing flows for unsupervised anomaly detection with localization using computational and memory-efficient architecture.

  • We show that our model outperforms previous state-of-the art in both detection and localization due to the unique properties of the proposed CFLOW-AD model.

Figure 2: Overview of our CFLOW-AD with a fully-convolutional translation-equivariant architecture. Encoder is a CNN feature extractor with multi-scale pyramid pooling. Pyramid pooling captures both global and local semantic information with the growing from top to bottom receptive fields. Pooled feature vectors are processed by a set of decoders independently for each th scale. Our decoder is a conditional normalizing flow network with a feature input and a conditional input with spatial information from a positional encoder (PE). The estimated multi-scale likelihoods are upsampled to the input size and added up to produce anomaly map.

3 Theoretical background

3.1 Feature extraction with Gaussian prior

Consider a CNN trained for classification task. Its parameters are usually found by minimizing Kullback-Leibler () divergence between joint train data distribution and the learned model distribution , where

is an input-label pair for supervised learning.

Typically, the parameters are initialized by the values sampled from the Gaussian distribution [12] and optimization process is regularized as


where is a regularization term and

is a hyperparameter that defines regularization strength.

The most popular CNNs [13, 43] are trained with weight decay [17] regularization (). That imposes multivariate Gaussian (MVG) prior not only to parameters , but also to the feature vectors extracted from the feature maps of  [11] intermediate layers.

3.2 A case for Mahalanobis distance

With the same MVG prior assumption, Lee  [18] recently proposed to model distribution of feature vectors by MVG density function and to use Mahalanobis distance [23] as a confidence score in CNN classifiers. Inspired by [18], Rippel  [29] adopt Mahalanobis distance for anomaly detection task since this measure determines a distance of a particular feature vector to its MVG distribution. Consider a MVG distribution with a density function

for random variable

defined as


where is a mean vector and is a covariance matrix of a true anomaly-free density .

Then, the Mahalanobis distance is calculated as


Since the true anomaly-free data distribution is unknown, mean vector and covariance matrix from (3) are replaced by the estimates and calculated from the empirical train dataset . At the same time, density function of anomaly data has different and statistics, which allows to separate out-of-distribution and in-distribution feature vectors using from (3).

This framework with MVG distribution assumption shows its effectiveness in image-level anomaly detection task [29] and is adopted by the state-of-the-art PaDiM [8] model in pixel-level anomaly localization task.

3.3 Relationship with the flow framework

Dinh  [9] introduce a class of generative probabilistic models called normalizing flows. These models apply change of variable formula to fit an arbitrary density by a tractable base distribution with density and a bijective invertible mapping . Then, the -likelihood of any can be estimated by


where a sample is usually from standard MVG distribution and a matrix is the Jacobian of a bijective invertible flow model and parameterized by vector .

The flow model is a set of basic layered transformations with tractable Jacobian determinants. For example, in RealNVP [9]

coupling layers is a simple sum of layer’s diagonal elements. These models are optimized using stochastic gradient descent by maximizing

-likelihood in (4). Equivalently, optimization can be done by minimizing the reverse  [27], where is the model prediction and

is a target density. The loss function for this objective is defined as


If is distributed according to Section 3.1 MVG assumption, we can express (5) as a function of Mahalanobis distance using its definition from (3) as


where is a squared Euclidean distance of a sample (detailed proof in Appendix A).

Then, the loss in (6) converges to zero when the likelihood contribution term of the model (normalized by ) compensates the difference between a squared Mahalanobis distance for from the target density and a squared Euclidean distance for .

This normalizing flow framework can estimate the exact likelihoods of any arbitrary distribution with density, while Mahalanobis distance is limited to MVG distribution only. For example, CNNs trained with regularization would have Laplace prior [11] or have no particular prior in the absence of regularization. Moreover, we introduce conditional normalizing flows in the next section and show that they are more compact in size and have fully-convolutional parallel architecture compared to [7, 8] models.

4 The proposed CFLOW-AD model

4.1 CFLOW encoder for feature extraction

We implement a feature extraction scheme with multi-scale feature pyramid pooling similar to recent models [7, 8]. We define the discriminatively-trained CNN feature extractor as an encoder in Figure 2. The CNN encoder maps image patches into a feature vectors that contain relevant semantic information about their content. CNNs accomplish this task efficiently due to their translation-equivariant architecture with the shared kernel parameters. In our experiments, we use ImageNet-pretrained encoder following Schirrmeister  [34] who show that large natural-image datasets can serve as a representative distribution for pretraining. If a large application-domain unlabeled data is available, the self-supervised pretraining from [42, 19] can be a viable option.

One important aspect of a CNN encoder is its effective receptive field [22]. Since the effective receptive field is not strictly bounded, the size of encoded patches cannot be exactly defined. At the same time, anomalies have various sizes and shapes, and, ideally, they have to be processed with the variable receptive fields. To address the ambiguity between CNN receptive fields and anomaly variability, we adopt common multi-scale feature pyramid pooling approach. Figure 2 shows that the feature vectors are extracted by pooling layers. Pyramid pooling captures both local and global patch information with small and large receptive fields in the first and last CNN layers, respectively. For convenience, we number pooling layers in the last to first layer order.

4.2 CFLOW decoders for likelihood estimation

We use the general normalizing flow framework from Section 3.3 to estimate -likelihoods of feature vectors . Hence, our generative decoder model aims to fit true density by an estimated parameterized density from (1). However, the feature vectors are assumed to be independent of their spatial location in the general framework. To increase efficacy of distribution modeling, we propose to incorporate spatial prior into model using conditional flow framework. In addition, we model densities using independent decoder models due to multi-scale feature pyramid pooling setup.

Our conditional normalizing flow (CFLOW) decoder architecture is presented in Figure 2. We generate a conditional vector using a 2D form of conventional positional encoding (PE) [39]. Each contains and harmonics that are unique to its spatial location . We extend unconditional flow framework to CFLOW by concatenating the intermediate vectors inside decoder coupling layers with the conditional vectors as in [2].

Then, the th CFLOW decoder contains a sequence of conventional coupling layers with the additional conditional input. Each coupling layer comprises of fully-connected layer with kernel, softplus activation and output vector permutations. Usually, the conditional extension does not increase model size since . For example, we use the fixed in all our experiments. Our CFLOW decoder has translation-equivariant architecture, because it slides along feature vectors extracted from the intermediate feature maps with kernel parameter sharing. As a result, both the encoder and decoders have convolutional translation-equivariant architectures.

We train CFLOW-AD using a maximum likelihood objective, which is equivalent to minimizing loss defined by


where the random variable , the Jacobian for CFLOW decoder and an expectation operation in is replaced by an empirical train dataset of size . For brevity, we drop the th scale notation. The derivation is given in Appendix B.

After training the decoders for all scales using (7), we estimate test dataset -likelihoods as


Next, we convert -likelihoods to probabilities for each th scale using (8) and normalize them to be in range. Then, we upsample to the input image resolution (

) using bilinear interpolation

. Finally, we calculate anomaly score maps by aggregating all upsampled probabilities as .

Model Train Test Memory
PaDiM [8]
Table 1: Complexity estimates for SPADE [7], PaDiM [8] and our CFLOW-AD. We compare train and test complexity as well as memory requirements. All models use the same encoder setup, but diverge in the post-processing. SPADE allocates memory for a train gallery used in -nearest-neighbors. PaDiM keeps large matrices for Mahalanobis distance. Our model employs trained decoders for post-processing.

4.3 Complexity analysis

Table 1 analytically compares complexity of CFLOW-AD and recent state-of-the-art models with the same pyramid pooling setup .

SPADE [7] performs -nearest-neighbor clustering between each test point and a gallery of train data. Therefore, the method requires large memory allocation for gallery and a clustering procedure that is typically slow compared to convolutional methods.

PaDiM [8] estimates train-time statistics inverses of covariance matrices to calculate at test-time. Hence, it has low computational complexity, but it stores in memory matrices of size for every th pooling layer.

Our method optimizes generative decoders using (7) during the train phase. At the test phase, CFLOW-AD simply infers data -likelihoods using (8) in a fully-convolutional fashion. Decoder parameters are relatively small as reported in Table 6.

5 Experiments

Encoder WRN50 WRN50 WRN50 WRN50 WRN50 R18 R18 MNetV3 MNetV3
# of CL 4 8 8 8 8 8 8 8 8
# of PL 2 2 3 3 3 3 3 3 3
HW 256 256 256 512 512 256 512 256 512
Bottle 97.280.03 97.240.03 98.760.01 98.980.01 98.830.01 98.470.03 98.640.01 98.74 98.92
Cable 95.710.01 96.170.07 97.640.04 97.120.06 95.290.04 96.750.04 96.070.06 97.62 97.49
Capsule 98.170.02 98.190.05 98.980.00 98.640.02 98.400.12 98.620.02 98.280.05 98.89 98.75
Carpet 98.500.01 98.550.01 99.230.01 99.250.01 99.240.00 99.000.01 99.290.00 98.64 99.00
Grid 93.770.05 93.880.16 96.890.02 98.990.02 98.740.00 93.950.04 98.530.01 94.75 98.81
Hazelnut 98.080.01 98.130.02 98.820.01 98.890.01 98.880.01 98.810.01 98.410.01 98.88 99.00
Leather 98.920.02 99.000.06 99.610.01 99.660.00 99.650.00 99.450.01 99.510.02 99.50 99.64
Metal Nut 96.720.03 96.720.06 98.560.03 98.250.04 98.160.03 97.590.05 96.420.03 98.36 98.78
Pill 98.460.02 98.460.01 98.950.00 98.520.05 98.200.08 98.340.02 97.800.05 98.69 98.44
Screw 94.980.06 95.280.06 98.100.05 98.860.02 98.780.01 97.380.03 98.400.03 98.04 99.09
Tile 95.520.02 95.660.06 97.710.02 98.010.01 97.980.02 95.100.02 95.800.10 96.07 96.48
Toothbrush 98.020.03 97.980.00 98.560.02 98.930.00 98.890.00 98.440.02 99.000.01 98.09 98.80
Transistor 93.090.28 94.050.11 93.280.40 80.520.13 76.280.14 92.710.23 83.340.46 97.79 95.22
Wood 90.650.10 90.590.07 94.490.03 96.650.01 96.560.02 93.510.03 95.000.04 92.24 94.96
Zipper 96.800.02 97.010.05 98.410.09 99.080.02 99.060.01 97.710.06 98.980.01 97.50 99.07
Average 96.31 96.46 97.87 97.36 96.86 97.06 96.90 97.59 98.16
Table 2: Ablation study of CFLOW-AD using localization AUROC metric on the MVTec [4] dataset, %. We experiment with input image resolution (), encoder architecture (ResNet-18 (R18), WideResnet-50 (WRN50) and MobileNetV3L (MNetV3)), type of normalizing flow (unconditional (UFLOW) and conditional (CFLOW)), number of coupling (# of CL) and pooling layers (# of PL).
Task Localization Detection Localization
Encoder ResNet-18 EffNetB4 WideResNet-50
Class/Model CutPaste Ours CutPaste Ours CutPaste Ours SPADE PaDiM Ours
Bottle 97.6 98.64 98.3 100.00 100.0 100.0 (98.4, 95.5) (98.3, 94.8) (98.98, 96.80)
Cable 90.0 96.75 80.6 97.62 96.2 97.59 (97.2, 90.9) (96.7, 88.8) (97.64, 93.53)
Capsule 97.4 98.62 96.2 93.15 95.4 97.68 (99.0, 93.7) (98.5, 93.5) (98.98, 93.40)
Carpet 98.3 99.29 93.1 98.20 100.0 98.73 (97.5, 94.7) (99.1, 96.2) (99.25, 97.70)
Grid 97.5 98.53 99.9 98.97 99.1 99.60 (93.7, 86.7) (97.3, 94.6) (98.99, 96.08)
Hazelnut 97.3 98.81 97.3 99.91 99.9 99.98 (99.1, 95.4) (98.2, 92.6) (98.89, 96.68)
Leather 99.5 99.51 100.0 100.00 100.0 100.0 (97.6, 97.2) (98.9, 88.8) (99.66, 99.35)
Metal Nut 93.1 97.59 99.3 98.45 98.6 99.26 (98.1, 94.4) (97.2, 85.6) (98.56, 91.65)
Pill 95.7 98.34 92.4 93.02 93.3 96.82 (96.5, 94.6) (95.7, 92.7) (98.95, 95.39)
Screw 96.7 98.40 86.3 85.94 86.6 91.89 (98.9, 96.0) (98.5, 94.4) (98.86, 95.30)
Tile 90.5 95.80 93.4 98.40 99.8 99.88 (87.4, 75.9) (94.1, 86.0) (98.01, 94.34)
Toothbrush 98.1 99.00 98.3 99.86 90.7 99.65 (97.9, 93.5) (98.8, 93.1) (98.93, 95.06)
Transistor 93.0 97.69 95.5 93.04 97.5 95.21 (94.1, 87.4) (97.5, 84.5) (97.99, 81.40)
Wood 95.5 95.00 98.6 98.59 99.8 99.12 (88.5, 97.4) (94.9, 91.1) (96.65, 95.79)
Zipper 99.3 98.98 99.4 96.15 99.9 98.48 (96.5, 92.6) (98.5, 95.9) (99.08, 96.60)
Average 96.0 98.06 95.2 96.75 97.1 98.26 (96.0, 91.7) (97.5, 92.1) (98.62, 94.60)
Table 3: The detailed comparison of PaDiM [8], SPADE [7], CutPaste [19] and our CFLOW-AD on the MVTec [4] dataset for every class using AUROC or, if available, a tuple (AUROC, AUPRO) metric, %. CFLOW-AD model is with the best hyperparameters from Section 5.2 ablation study. For fair comparison, we group together results with the same encoder architectures such as ResNet-18 and WideResNet-50.

5.1 Experimental setup

We conduct unsupervised anomaly detection (image-level) and localization (pixel-level segmentation) experiments using the MVTec [4] dataset with factory defects and the STC [21]

dataset with surveillance camera videos. The code is in PyTorch 

[28] with the FrEIA library [1] used for generative normalizing flow modeling.

Industrial MVTec dataset comprises 15 classes with total of 3,629 images for training and 1,725 images for testing. The train dataset contains only anomaly-free images without any defects. The test dataset contains both images containing various types of defects and defect-free images. Five classes contain different types of textures (carpet, grid, leather, tile, wood), while the remaining 10 classes represent various types of objects. We resize MVTec images without cropping according to the specified image resolution ( ) and apply augmentation rotations during training phase only.

STC dataset contains 274,515 training and 42,883 testing frames extracted from surveillance camera videos and divided into 13 distinct university campus scenes. Because STC is significantly larger than MVTec, we experiment only with resolution and apply the same pre-processing and augmentation pipeline as for MVTec.

We compare CFLOW-AD with the models reviewed in Section 2

using MVTec and STC datasets. We use widely-used threshold-agnostic evaluation metrics for localization: area under the receiver operating characteristic curve (AUROC) and area under the per-region-overlap curve (AUPRO) 


. AUROC is skewed towards large-area anomalies, while AUPRO metric ensures that both large and small anomalies are equally important in localization. Image-level AD detection is reported by the AUROC only.

We run each CFLOW-AD experiment four times on the MVTec and report mean (

) of the evaluation metric and, if specified, its standard deviation (

). For the larger STC dataset, we conduct only a single experiment. As in other methods, we train a separate CFLOW-AD model for each MVTec class and each STC scene. All our models use the same training hyperparameters: Adam optimizer with 2e-4 learning rate, 100 train epochs, 32 mini-batch size for encoder and cosine learning rate annealing with 2 warm-up epochs. Since our decoders are agnostic to feature map dimensions and have low memory requirements, we train and test CFLOW-AD decoders with 8,192 (32

256) mini-batch size for feature vector processing. During the train phase 8,192 feature vectors are randomly sampled from 32 random feature maps. Similarly, 8,192 feature vectors are sequentially sampled during the test phase. The feature pyramid pooling setup for ResNet-18 and WideResnet-50 encoder is identical to PaDiM [8]. The effects of other architectural hyperparameters are studied in the ablation study.

5.2 Ablation study

Table 2 presents a comprehensive study of various design choices for CFLOW-AD on the MVTec dataset using AUROC metric. In particular, we experiment with the input image resolution (), encoder architecture (ResNet-18 [13], WideResnet-50 [43], MobileNetV3L [14]), type of normalizing flow (unconditional (UFLOW) or conditional (CFLOW)), number of flow coupling layers (# of CL) and pooling layers (# of PL).

Our study shows that the increase in number of decoder’s coupling layers from 4 to 8 gives on average 0.15% gain due to a more accurate distribution modeling. Even higher 1.4% AUROC improvement is achieved when processing 3-scale feature maps (layers 1, 2 and 3) compared 2-scale only (layers 2, 3). The additional feature map (layer 1) with larger scale () provides more precise spatial semantic information. The conditional normalizing flow (CFLOW) is on average 0.5% better than the unconditional (UFLOW) due to effective encoding of spatial prior. Finally, larger WideResnet-50 outperforms smaller ResNet-18 by 0.81%. MobileNetV3L, however, could be a good design choice for both fast inference and high AUROC.

Importantly, we find that the optimal input resolution is not consistent among MVTec classes. The classes with macro objects cable or pill tend to benefit from the smaller-scale processing (256256), which, effectively, translates to larger CNN receptive fields. Majority of classes perform better with 512512 inputs smaller receptive fields. Finally, we discover that the transistor class has even higher AUROC with the resized to 128128 images. Hence, we report results with the highest performing input resolution settings in the Section 5.3 comparisons.

Model Detection Localization
DifferNet [30] 94.9 - -
DFR [37] - 95.0 91.0
SVDD [42] 92.1 95.7 -
SPADE [7] 85.5 96.0 91.7
CutPaste [19] 97.1 96.0 -
PaDiM [8] 97.9 97.5 92.1
CFLOW-AD (ours) 98.26 98.62 94.60
Table 4: Average AUROC and AUPRO on the MVTec [4] dataset, %. Both the best detection and localization metrics are presented, if available. CFLOW-AD is with WideResNet-50 encoder.
Metric AUROC
Model Detection Localization
CAVGA [40] - 85.0
SPADE [7] 71.9 89.9
PaDiM [8] - 91.2
CFLOW-AD (ours) 72.63 94.48
Table 5: Average AUROC on the STC [21] dataset, %. Both the best available detection and localization metrics are showed. CFLOW-AD is with WideResNet-50 encoder.

5.3 Quantitative comparison

Table 4 summarizes average MVTec results for the best published models. CFLOW-AD with WideResNet-50 encoder outperforms state-of-the-art by 0.36% AUROC in detection, by 1.12% AUROC and 2.5% AUPRO in localization, respectively. Table 3 contains per-class comparison for the subset of models grouped by the task and type of encoder architecture. CFLOW-AD is on par or significantly exceeds the best models in per-class comparison with the same encoder setups.

Table 5 presents high-level comparison of the best recently published models on the STC dataset. CFLOW-AD outperforms state-of-the-art SPADE [7] by 0.73% AUROC in anomaly detection and PaDiM [8] by 3.28% AUROC in anomaly localization tasks, respectively.

Note that our CFLOW-AD models in Tables 3-4 use variable input resolution as discussed in the ablation study: 512512, 256256 or 128128 depending on the MVTec class. We used fixed 256256 input resolution in Table 5 for the large STC dataset to decrease training time. Other reference hyperparameters in Tables 4-5 are set as: WideResnet-50 encoder with 3-scale pooling layers, conditional normalizing flow decoders with 8 coupling layers.

Figure 3: Examples of the input images with ground truth anomaly masks (top row) for various classes of the MVTec. Our CFLOW-AD model from Table 4 estimates anomaly score maps (middle row) and generates segmentation masks (bottom row) for a threshold selected to maximize F1-score. The predicted segmentation mask should match the corresponding ground truth as close as possible.
Figure 4: Distribution of anomaly scores for the cable class from the MVTec learned by CFLOW-AD model from Table 4. Green density represent scores for the anomaly-free feature vectors, while region of red-color density shows scores for feature vectors with anomalies. The threshold is selected to optimize F1-score.

5.4 Qualitative results

Figure 3 visually shows examples from the MVTec and the corresponding CFLOW-AD predictions. The top row shows ground truth masks from including examples with and without anomalies. Then, our model produces anomaly score maps (middle row) using the architecture from Figure 2. Finally, we show the predicted segmentation masks with the threshold selected to maximize F1-score.

Figure 4 presents an additional evidence that our CFLOW-AD model actually addresses the OOD task sketched in Figure 1 toy example. We plot distribution of output anomaly scores for anomaly-free (green) and anomalous feature vectors (red). Then, CFLOW-AD is able to distinguish in-distribution and out-of-distribution feature vectors and separate them using a scalar threshold .

5.5 Complexity evaluations

Complexity metric and Model Inference speed, fps Model size, MB
R18 encoder only 80 / 62 45
PaDiM-R18 [8] 4.4 210 170
CFLOW-AD-R18 34 / 12 96
WRN50 encoder only 62 / 30 268
SPADE-WRN50 [7] 0.1 37,000 1,400
PaDiM-WRN50 [8] 1.1 5,200 3,800
CFLOW-AD-WRN50 27 / 9 947
MNetV3 encoder only 82 / 61 12
CFLOW-AD-MNetV3 35 / 12 25
Table 6: Complexity comparison in terms of inference speed (fps) and model size (MB). Inference speed for CFLOW-AD models from Table 3 is measured for (256256) / (512512) inputs.

In addition to analytical estimates in Table 1, we present the actual complexity evaluations for the trained models using inference speed and model size metrics. Particularly, Table 1 compares CFLOW-AD with the models from Tables 4-5 that have been studied by Defard  [8].

The model size in Table 6 is measured as the size of all floating-point parameters in the corresponding model its encoder and decoder (post-processing) models. Because the encoder architectures are identical, only the post-processing models are different. Since CFLOW-AD decoders do not explicitly depend on the feature map dimensions (only on feature vector depths), our model is significantly smaller than SPADE and PaDiM. If we exclude the encoder parameters for fair comparison, CFLOW-AD is 1.7 to 50 smaller than SPADE and 2 to 7 smaller than PaDiM.

Inference speed in Table 6 is measured with INTEL I7 CPU for SPADE and PaDiM in Defard  [8] study with 256256 inputs. We deduce that this suboptimal CPU choice was made due to large memory requirements for these models in Table 6. Thus, their GPU allocation for fast inference is infeasible. In contrast, our CFLOW-AD can be processed in real-time with 8 to 25 faster inference speed on 1080 8GB GPU with the same input resolution and feature extractor. In addition, MobileNetV3L encoder provides a good trade-off between accuracy, model size and inference speed for practical inspection systems.

6 Conclusions

We proposed to use conditional normalizing flow framework to estimate the exact data likelihoods which is infeasible in other generative models. Moreover, we analytically showed the relationship of this framework to previous distance-based models with multivariate Gaussian prior.

We introduced CFLOW-AD model that addresses the complexity limitations of existing unsupervised AD models by employing fully-convolutional translation-equivariant architecture. As a result, CFLOW-AD is faster and smaller by a factor of 10 than prior models with the same input resolution and feature extractor setup.

CFLOW-AD achieves new state-of-the-art for popular MVTec with 98.26% AUROC in detection, 98.62% AUROC and 94.60% AUPRO in localization. Our new state-of-the-art for STC dataset is 72.63% and 94.48% AUROC in detection and localization, respectively. Our ablation study analyzed design choices for practical real-time processing including feature extractor choice, multi-scale pyramid pooling setup and the flow model hyperparameters.


  • [1] L. Ardizzone, J. Kruse, C. Rother, and U. Köthe (2019)

    Analyzing inverse problems with invertible neural networks

    In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §5.1.
  • [2] L. Ardizzone, C. Lüth, J. Kruse, C. Rother, and U. Köthe (2019) Guided image generation with conditional invertible neural networks. arXiv:1907.02392. Cited by: §2, §4.2.
  • [3] C. Baur, B. Wiestler, S. Albarqouni, and N. Navab (2018) Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. arXiv:1804.04488. Cited by: §2, §2, §2.
  • [4] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger (2019) MVTec AD – a comprehensive real-world dataset for unsupervised anomaly detection. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: Figure 1, §1, §1, §1, §2, §2, §2, §5.1, §5.1, Table 2, Table 3, Table 4.
  • [5] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger (2020) Uninformed students: student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [6] P. Bergmann, S. Löwe, M. Fauser, D. Sattlegger, and C. Steger (2019) Improving unsupervised defect segmentation by applying structural similarity to autoencoders. Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Cited by: §2.
  • [7] N. Cohen and Y. Hoshen (2021) Sub-image anomaly detection with deep pyramid correspondences. arXiv:2005.02357v3. Cited by: §2, §2, §2, §2, §3.3, §4.1, §4.3, Table 1, §5.3, Table 3, Table 4, Table 5, Table 6.
  • [8] T. Defard, A. Setkov, A. Loesch, and R. Audigier (2021) PaDiM: a patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International Conference on Pattern Recognition (ICPR) Workshops, Cited by: §1, §2, §3.2, §3.3, §4.1, §4.3, Table 1, §5.1, §5.3, §5.5, §5.5, Table 3, Table 4, Table 5, Table 6.
  • [9] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using real NVP. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2, §3.3, §3.3.
  • [10] M. Goldstein and S. Uchida (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE. Cited by: §1.
  • [11] I. J. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §3.1, §3.3.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §3.1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1, §5.2.
  • [14] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam (2019) Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §5.2.
  • [15] P. Kirichenko, P. Izmailov, and A. G. Wilson (2020) Why normalizing flows fail to detect out-of-distribution data. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [17] A. Krogh and J. Hertz (1992) A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems, Cited by: §3.1.
  • [18] K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, Cited by: §3.2.
  • [19] C. Li, K. Sohn, J. Yoon, and T. Pfister (2021) CutPaste: self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.1, Table 3, Table 4.
  • [20] Y. Li, J. Wu, X. Bai, X. Yang, X. Tan, G. Li, S. Wen, H. Zhang, and E. Ding (2020) Multi-granularity tracking with modularlized components for unsupervised vehicles anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1.
  • [21] W. Luo, W. Liu, and S. Gao (2017) A revisit of sparse coding based anomaly detection in stacked RNN framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2, §5.1, Table 5.
  • [22] W. Luo, Y. Li, R. Urtasun, and R. Zemel (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §4.1.
  • [23] P. C. Mahalanobis (1936) On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta). Cited by: §2, §3.2.
  • [24] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019) Do deep generative models know what they don’t know?. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [25] P. Napoletano, F. Piccoli, and R. Schettini (2018) Anomaly detection in nanofibrous materials by CNN-based self-similarity. Sensors. Cited by: §2, §2, §2.
  • [26] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel (2021) Deep learning for anomaly detection: a review. ACM Comput. Surv.. Cited by: footnote 2.
  • [27] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan (2021) Normalizing flows for probabilistic modeling and inference.

    Journal of Machine Learning Research

    Cited by: §3.3.
  • [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In Autodiff workshop at Advances in Neural Information Processing Systems, Cited by: §5.1.
  • [29] O. Rippel, P. Mertens, and D. Merhof (2020)

    Modeling the distribution of normal data in pre-trained deep features for anomaly detection

    arXiv:2005.14140. Cited by: §2, §3.2, §3.2.
  • [30] M. Rudolph, B. Wandt, and B. Rosenhahn (2021) Same same but DifferNet: semi-supervised defect detection with normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Cited by: §2, Table 4.
  • [31] L. Ruff, J. R. Kauffmann, R. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Müller (2021) A unifying review of deep and shallow anomaly detection. Proc of the IEEE. Cited by: footnote 2.
  • [32] B. Saleh, A. Farhadi, and A. Elgammal (2013) Object-centric anomaly detection by attribute-based reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [33] M. Salehi, N. Sadjadi, S. Baselizadeh, M. H. Rohban, and H. R. Rabiee (2021) Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [34] R. T. Schirrmeister, Y. Zhou, T. Ball, and D. Zhang (2020) Understanding anomaly detection with deep invertible networks through hierarchies of distributions and features. In Advances in Neural Information Processing Systems, Cited by: §2, §2, §4.1.
  • [35] T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-Erfurth (2019) F-AnoGAN: fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis. Cited by: §2, §2, §2.
  • [36] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, Cited by: §2, §2, §2.
  • [37] Y. Shi, J. Yang, and Z. Qi (2021) Unsupervised anomaly segmentation via deep feature reconstruction. Neurocomputing. Cited by: §2, §2, §2, Table 4.
  • [38] A. Vasilev, V. Golkov, M. Meissner, I. Lipp, E. Sgarlata, V. Tomassini, D.K. Jones, and D. Cremers (2019)

    q-space novelty detection with variational autoencoders

    MICCAI 2019 International Workshop on Computational Diffusion MRI. Cited by: §2, §2, §2.
  • [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §4.2.
  • [40] S. Venkataramanan, K. Peng, R. V. Singh, and A. Mahalanobis (2020) Attention guided anomaly localization in images. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 5.
  • [41] G. Wang, S. Han, E. Ding, and D. Huang (2021) Student-teacher feature pyramid matching for unsupervised anomaly detection. arXiv:2103.04257. Cited by: §2.
  • [42] J. Yi and S. Yoon (2020) Patch SVDD: patch-level SVDD for anomaly detection and segmentation. In Proceedings of the Asian Conference on Computer Vision (ACCV), Cited by: §2, §2, §4.1, Table 4.
  • [43] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2, §3.1, §5.2.
  • [44] K. Zhou, Y. Xiao, J. Yang, J. Cheng, W. Liu, W. Luo, Z. Gu, J. Liu, and Shenghua. Gao (2020) Encoding structure-texture relation with P-Net for anomaly detection in retinal images. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1.