STAD
Uninformed Students: StudentTeacher Anomaly Detection with Discriminative Latent Embeddings
view repo
We introduce a simple, yet powerful studentteacher framework for the challenging problem of unsupervised anomaly detection and pixelprecise anomaly segmentation in highresolution images. To circumvent the need for prior data labeling, student networks are trained to regress the output of a descriptive teacher network that was pretrained on a large dataset of patches from natural images. Anomalies are detected when the student networks fail to generalize outside the manifold of anomalyfree training data, i.e., when the output of the student networks differ from that of the teacher network. Additionally, the intrinsic uncertainty in the student networks can be used as a scoring function that indicates anomalies. We compare our method to a large number of existing deeplearningbased methods for unsupervised anomaly detection. Our experiments demonstrate improvements over stateoftheart methods on a number of realworld datasets, including the recently introduced MVTec Anomaly Detection dataset that was specifically designed to benchmark anomaly segmentation algorithms.
READ FULL TEXT VIEW PDFUninformed Students: StudentTeacher Anomaly Detection with Discriminative Latent Embeddings
A simple implementation of paper: Uninformed Students: Student–Teacher Anomaly Detection with Discriminative Latent Embeddings.
None
anomaly detection 논문들의 baseline code 정리. pytorch 기반.
Unsupervised pixelprecise segmentation of regions that appear anomalous or novel to a machine learning model is an important and challenging task in many domains of computer vision. In automatic industrial inspection scenarios, it is often desirable to train models solely on a single class of anomalyfree images to segment defective regions during inference. In an active learning setting, regions that are detected as previously unknown by the current model can be included into the training set to improve the model’s performance.
Recently, efforts have been made to improve anomaly detection in oneclass or multiclass classification scenarios [1, 2, 9, 10, 18, 25]. However, these algorithms assume that anomalies manifest themselves in the form of images of an entirely different class and a simple binary imagelevel decision whether the image is anomalous or not must be made. Little work has been directed towards the development of methods that can segment anomalous regions that only differ in a very subtle way from the training data manifold. Recently, Bergmann et al. [6] provided benchmarks for several stateoftheart algorithms and identified a large room for improvement.
Existing works predominantly focus on generative algorithms such as Generative Adversarial Networks (GANs) [27, 28]
or Variational Autoencoders (VAEs)
[4, 32]. They detect anomalies using perpixel reconstruction errors or by evaluating the density obtained from the model’s probability distribution. This has been shown to be problematic due to inaccurate reconstructions or poorly calibrated likelihoods
[7, 19].Discriminative embeddings from pretrained networks for transfer learning improve the performance of many supervised computer vision algorithms
[15, 30]. For unsupervised anomaly detection, such approaches have not been thoroughly explored so far. Recent work suggests that these feature spaces generalize well for anomaly detection and even simple baselines outperform generative deep learning approaches [9, 23]. However, the performance of existing methods on large highresolution image datasets is hampered by the use of shallow machine learning pipelines that require a dimensionality reduction of the used feature space. Moreover, they rely on heavy training data subsampling since their capacity does not suffice to model highly complex data distributions with a large number of training samples.We propose to circumvent these limitations of shallow models by implicitly modeling the distribution of training features with a student–teacher approach. This leverages the high capacity of deep neural networks and frames anomaly detection as a feature regression problem. Given a descriptive feature extractor pretrained on a large dataset of patches from natural images (the teacher), we train an ensemble of student networks on anomalyfree training data to mimic the teacher’s output. During inference, the students’ predictive uncertainty together with its regression error with respect to the teacher are combined to yield dense anomaly scores for each input pixel. Our intuition is that students will generalize poorly outside the manifold of anomalyfree training data and start to make wrong predictions. Figure
1 shows qualitative results of our method when applied to images from the MVTec Anomaly Detection dataset [6]. A schematic overview of the entire anomaly detection process is given in Figure 2.Our main contributions are:
We propose a novel framework for unsupervised anomaly detection based on student–teacher learning. Local descriptors from a pretrained teacher network serve as surrogate labels for an ensemble of students. Our models can be trained endtoend on large unlabelled image datasets and make use of all available training data.
We introduce scoring functions based on the students’ predictive variance and regression error to obtain dense anomaly maps for the segmentation of anomalous regions in natural images. We describe how to extend our approach to segment anomalies at multiple scales by adapting the students’ and teacher’s receptive fields.
We demonstrate stateoftheart performance on three realworld computer vision datasets. We compare to a number of shallow machine learning classifiers and deep generative models that are fitted directly to the teacher’s feature distribution. We also compare our method to recently introduced deeplearning based methods for unsupervised anomaly segmentation.
There exists an abundance of literature on anomaly detection [24]. Deeplearningbased methods for the segmentation of anomalies strongly focus on generative models such as autoencoders [7] or GANs [28]. These attempt to learn representations from scratch, leveraging no prior knowledge about the nature of natural images, and segment anomalies by comparing the input image to a reconstruction in pixel space. This can result in poor anomaly detection performance due to simple perpixel comparisons or imperfect reconstructions [7].
Promising results have been achieved by transferring discriminative embedding vectors of pretrained networks to the task of anomaly detection by fitting shallow machine learning models on the features of anomalyfree training data.
Andrews et al. [2] use activations from different layers of a pretrained VGG network and model the anomalyfree training distribution with a SVM. However, they only apply their algorithm to image classification and do not consider the segmentation of anomalous regions. Similar experiments have been performed by Burlina et al. [9]. They report superior performance of discriminative embeddings compared to feature spaces obtained from generative models.
Nazare et al. [21]
investigate the performance of different offtheshelf feature extractors pretrained on an image classification task for the segmentation of anomalies in surveillance videos. Their approach trains a 1NearestNeighbor (1NN) classifier on embedding vectors extracted from a large number of anomalyfree training patches. Prior to the training of the shallow classifier, the dimensionality of the network’s activations is reduced using Principal Component Analysis (PCA). To obtain a spatial anomaly map during inference, the classifier must be evaluated for a large number of overlapping patches, which quickly becomes a performance bottleneck and results in rather coarse anomaly maps. Similarly, Napoletano et al.
[20]extract activations from a pretrained ResNet18 for a large number of cropped training patches and model their distribution using KMeans clustering after prior dimensionality reduction with PCA. They also perform strided evaluation of test images during inference. Both approaches sample training patches from the input images and therefore do not make use of all possible training features. This is necessary since, in their framework, feature extraction is computationally expensive due to the use of very deep networks that output only a single descriptor per patch. Furthermore, since shallow models are employed for learning the feature distribution of anomalyfree patches, the available training information must be strongly reduced.
To circumvent the need for cropping patches and to speed up feature extraction, Sabokrou et al. [26]
extract descriptors from early feature maps of a pretrained AlexNet in a fully convolutional fashion and fit a unimodal Gaussian distribution to all available training vectors of anomalyfree images. Even though feature extraction is achieved more efficiently in their framework, pooling layers lead to a downsampling of the input image. This strongly decreases the resolution of the final anomaly map, especially when using descriptive features of deeper network layers with larger receptive fields. In addition, unimodal Gaussian distributions will fail to model the training feature distribution as soon as the problem complexity rises.
Our work draws some inspiration from the recent success of openset recognition in supervised settings such as image classification or semantic segmentation, where uncertainty estimates of deep neural networks have been exploited to detect outofdistribution inputs using MC Dropout
[13] or deep ensembles [17]. Seeboeck et al. [29] demonstrate that uncertainties from segmentation networks trained with MC Dropout can be used to detect anomalies in retinal OCT images. Beluch et al. [5] show that the variance of network ensembles trained on an image classification task serves as an effective acquisition function for active learning. Inputs that appear anomalous to the current model are added to the training set to quickly enhance its performance.Such algorithms, however, demand prior labeling of images for a supervised task by domain experts, which is not always possible or desirable. In our work, we utilize feature vectors of pretrained networks as surrogate labels for the training of an ensemble of student networks. The predictive variance together with the regression error of the ensemble’s output mixture distribution can then be used as a scoring function to segment anomalous regions in test images.
This section describes the core principles of our proposed method. Given a training dataset of anomalyfree images, our goal is to create an ensemble of student networks that can later detect anomalies in test images J, i.e., that can assign a score to each pixel as of how much it deviates from the training data manifold. For this, the student models are trained against regression targets obtained from a descriptive teacher network pretrained on a large dataset of natural images. After the training, anomaly scores can be derived for each image pixel from the students’ regression error and predictive variance. Given an input image of width , height , and number of channels , each student in the ensemble outputs a feature map . It contains descriptors of dimension for each input image pixel at row and column . By design, we limit the students’ receptive field, such that describes a square local image region of I centered at of side length . The teacher has the same network architecture as the student networks. However, it remains constant and extracts descriptive embedding vectors for each pixel of the input image I that serve as deterministic regression targets during student training.
We begin by describing how to efficiently construct a descriptive teacher network using metric learning and knowledge distillation techniques. In existing work for anomaly detection with pretrained networks, feature extractors only output single feature vectors for patchsized inputs or spatially heavily downsampled feature maps [20, 26]. In contrast, our teacher network efficiently outputs descriptors for every possible square of side length within the input image. is obtained by first training a network to embed patchsized images into a metric space of dimension
using only convolution and maxpooling layers. Fast dense local feature extraction for an entire input image can then be achieved by a deterministic network transformation of
to as described in [3]. This yields significant speedups compared to previously introduced methods that perform patchbased strided evaluations. To let output semantically strong descriptors, we investigate both selfsupervised metric learning techniques as well as distilling knowledge from a descriptive but computationally inefficient pretrained network. A large number of training patches pcan be obtained by random crops from any image database, e.g., ImageNet
[16].Patch descriptors obtained from deep layers of CNNs trained on image classification tasks perform well for anomaly detection when modeling their distribution with shallow machine learning models [20, 21]. However, the architectures of such CNNs are usually highly complex and computationally inefficent for the extraction of local patch descriptors. Therefore, we attempt to distill the knowledge of a powerful pretrained network into by matching the output of with a decoded version of the descriptor obtained from :
denotes a fully connected network that decodes the dimensional output of to the output dimension of the pretrained network’s descriptor.
If for some reason pretrained networks are unavailable, one can also learn local image descriptors in a fully selfsupervised way [11]. Here, we investigate the performance of discriminative embeddings obtained using triplet learning. For every randomly cropped patch p, a triplet of patches is augmented. Positive patches are obtained by small random translations around p, changes in image luminance, and the addition of Gaussian noise. The negative patch is created by a random crop from a randomly chosen different image. Intriplet hard negative mining with anchor swap [33]
is used as a loss function for learning an embedding sensitive to the
metricwhere denotes the margin parameter and intriplet distances and are defined as:
As proposed by Vassileios et al. [31], we minimize the correlation between descriptors within one minibatch of inputs p in order to increase the descriptors’ compactness and remove unnecessary redundancy:
where denotes the entries of the correlation matrix computed over all descriptors in the current minibatch.
The final training loss for is then given as
where , , are weighting factors for the individual loss terms. Figure 3 summarizes the entire learning process for the teacher’s discriminative embedding.
Method  MNIST  CIFAR10  
OCGAN  0.9750  0.6566  
1NN  0.9753  0.8189  
KMeans  0.9457  0.7592  
OCSVM  0.9463  0.7388  
AE  0.9832  0.7898  
VAE  0.9535  0.7502  
Ours  ✓  ✓  0.9935  0.8196  
Ours  ✓  ✓  ✓  0.9926  0.8035 
Ours  ✓  ✓  0.9935  0.7940  
Ours  ✓  0.9917  0.8021 
Category 










Textures 
Carpet 










Grid 










Leather 










Tile 










Wood 










Objects 
Bottle 




0.910 





Cable 




0.825  0.654 




Capsule 




0.862 





Hazelnut 




0.917  0.878 




Metal nut 




0.830 





Pill 




0.893 





Screw 




0.754 





Toothbrush 




0.822 





Transistor 




0.728  0.626 




Zipper 




0.839  0.549 




Mean  0.857  0.640  0.479  0.423  0.790  0.639  0.694  0.443  0.515 
Next, we describe how to train student networks to predict the teacher’s output on anomalyfree training data. We then derive anomaly scores from the students’ predictive uncertainty and regression error during inference. First, the vector of componentwise means
over all training descriptors is computed for data normalization. Descriptors are extracted by applying to each image in the dataset . We then train an ensemble of randomly initialized student networks , that exhibit the identical network architecture as the teacher . For an input image I, each student outputs its predictive distribution over the space of possible regression targets for each local image region centered at row and column . Note that the students’ architecture with limited receptive field of size allows us to obtain dense predictions for each image pixel with only a single forward pass, without having to actually crop the patches . The students’ output vectors are modelled as a Gaussian distribution with constant covariance , where denotes the prediction made by for the pixel at . Let denote the teacher’s respective descriptor that is to be predicted by the students. The loglikelihood training criterion for each student network then simplifies to the squared distance in feature space:where denotes the inverse of the diagonal matrix filled with the values in .
Having trained each student to convergence, a mixture of Gaussians can be obtained at each image pixel by equally weighting the ensemble’s predictive distributions.
From it, measures of anomaly can be obtained in two ways: First, we propose to compute the regression error of the mixture’s mean with respect to the teacher’s surrogate label:
The intuition behind this score is that the student networks will fail to regress the teacher’s output within anomalous regions during inference since the corresponding descriptors have not been observed during training. Note that is nonconstant even for , where only a single student is trained and anomaly scores can be efficiently obtained with only a single forward pass through the student and teacher network, respectively.
As a second measure of anomaly, we compute for each pixel the predictive uncertainty of the Gaussian mixture as defined by Kendall et al. [13], assuming that the student networks generalize similarly for anomalyfree regions and differently in regions that contain novel information unseen during training:
To combine the two scores, the means and standard deviations of all and , respectively, over a validation set of anomalyfree images is computed. Summation of the normalized scores then yields the final anomaly score:
Figure 4 illustrates the basic principles of our anomaly detection method on the MNIST dataset, where images with label 0 were treated as the normal class and all other classes were treated as anomalous. Since the images of this dataset are very small, we extracted a single feature vector for each image using and trained an ensemble of patchsized students to regress the teacher’s output. This results in a single anomaly score for each input image. Feature descriptors were embedded into 2D using multidimensional scaling [8] to preserve their relative distances.
If an anomaly only covers a small part of the teacher’s receptive field of size , the extracted feature vector predominantly describes anomalyfree traits of the local image region. Consequently, the descriptor can be predicted well by the students and anomaly detection performance will decrease. One could tackle this problem by downsampling the input image. This would, however, lead to an undesirable loss in resolution of the output anomaly map.
Our framework allows us explicit control over the size of the students’ and teacher’s receptive field . Therefore, we can detect anomalies at various scales by training multiple student–teacher ensemble pairs with varying values of . At each scale, an anomaly map with the same size as the input image is computed. Given student–teacher ensemble pairs with different receptive fields, the normalized anomaly scores and of each scale can be combined by simple averaging:
To demonstrate the effectiveness of our approach, an extensive evaluation on a number of datasets is performed. We measure the performance of our student–teacher framework against existing pipelines that use shallow machine learning algorithms to model the feature distribution of pretrained networks. To do so, we compare to a KMeans classifier, a OneClass SVM (OCSVM), and a 1NN classifier. They are fitted to the distribution of the teacher’s descriptors after prior dimensionality reduction using PCA. We also experiment with deterministic and variational autoencoders as deep distribution models over the teacher’s discriminative embedding. The reconstruction error [12] and reconstruction probability [1] are used as anomaly score, respectively. We further compare our method to recently introduced generative and discriminative deeplearningbased anomaly detection models and report improved performance over the state of the art. We want to stress that the teacher has not observed images of the evaluated datasets during pretraining to avoid an unfair bias.
As a first experiment and ablation study to find suitable hyperparameters, our algorithm is applied to a oneclass classification setting on the MNIST and CIFAR10 datasets. We then evaluate on the much more challenging MVTec Anomaly Detection (MVTec AD) dataset, which was specifically designed to benchmark algorithms for the segmentation of anomalous regions. It provides over 5000 highresolution images divided into ten object and five texture categories. To highlight the benefit of our multiscale approach, an additional ablation study is performed on MVTec AD which investigates the impact of different receptive fields on the anomaly detection performance.
For our experiments, we use identical network architectures for the student and teacher networks, with receptive field sizes
. All architectures are simple CNNs with only convolutional and maxpooling layers, using leaky rectified linear units (LReLUs) with slope 0.005 as the activation function. Table
IV shows the specific architecture used for . For and , similar architectures are given in Appendix A.For the pretraining of the teacher networks , triplets augmented from the ImageNet dataset are used. Images are zoomed to equal width and height sampled from and a patch of side length is cropped at a random location. A positive patch for each triplet is then constructed by randomly translating the crop location within the interval . Gaussian noise with standard deviation 0.1 is added to . All images within a triplet are randomly converted to grayscale with a probability of 0.1. For knowledge distillation, we extract 512dimensional feature vectors from the fully connected layer of a ResNet18 that was pretrained for classification on the ImageNet dataset. For network optimization, we use the Adam optimizer [14] with an initial learning rate of , a weight decay of , and a batch size of 64. Each teacher network outputs descriptors of dimension and is trained for 50 000 iterations.
Before considering the problem of anomaly segmentation, we evaluate our method on the MNIST and CIFAR10 datasets, adapted for oneclass classification. Five students are trained on only a single class of the dataset, while during inference images of the other classes must be detected as anomalous. Each image is zoomed to the students’ and teacher’s input size and a single feature vector is extracted for each image by passing it through the patchsized networks and . We experiment with differently pretrained teacher networks, varying the weights in the teacher’s loss function . The patch size for the experiments in this subsection is set to . As a measure of anomaly detection performance, the area under the ROC curve (AUC) is evaluated. Shallow and deep distributions models are trained on the teacher’s descriptors of all available indistribution samples. We additionally report numbers for OCGAN [22], a recently proposed generative model directly trained on the input images. Detailed information regarding training parameters for all methods on this dataset can be found in Appendix B.
Table I shows our results. Our approach outperforms the other methods for a variety of hyperparameter settings. Distilling the knowledge of the pretrained ResNet18 into the teacher’s descriptor yields slightly better performance than training the teacher in a fully selfsupervised way using triplet learning. Reducing descriptor redundancy by minimizing the correlation matrix yields improved results. On average, shallow models and autoencoders fitted to our teacher’s feature distribution outpeform OCGAN but do not reach the performance of our approach. Since for 1NN, every single training vector can be stored, it performs exceptionally well on these relatively small datasets. On average, however, our method still outperforms all evaluated approaches.
For all our experiments on MVTec AD, input images are zoomed to size
pixels. We train for 100 epochs and batchsize 1, which is equivalent to training on a large number of patches per batch due to the limited size of the networks’ receptive field. We use the Adam optimizer at initial learning rate
and weight decay . The teacher network was trained with and , since this configuration performed best in our MNIST and CIFAR10 experiments. Ensembles were trained with students.For training shallow classifiers on the teacher’s output descriptors, a subset of vectors is randomly sampled from the teacher’s feature maps. Their dimension is then reduced by PCA, retaining 95% of the variance. The variational and deterministic autoencoders are implemented using a simple fully connected architecture and are trained on all available descriptors. In addition to the models directly fitted to the teacher’s feature distribution, we benchmark our approach against the best performing deep learning based methods presented by Bergmann et al. [6] on this dataset. Specifically, these methods include the CNNFeature Dictionary [20], the SSIMAutoencoder [7], and AnoGAN [28]. Hyperparameters for each evaluated method are detailed in Appendix C.
Category 





Textures 
Carpet 



0.879  
Grid 



0.952  
Leather 



0.945  
Tile 



0.946  
Wood 



0.911  
Objects 
Bottle 



0.931  
Cable 



0.818  
Capsule 



0.968  
Hazelnut 



0.965  
Metal nut 



0.942  
Pill 



0.961  
Screw 



0.942  
Toothbrush 



0.933  
Transistor 



0.666  
Zipper 



0.951  
Mean  0.866  0.900  0.857  0.914 
We compute a thresholdindependent measure based on the perregionoverlap (PRO) as the evaluation metric. It weights groundtruth regions of different size equally, which is in contrast to simple perpixel measures for which a single large correctly segmented region can make up for many incorrectly segmented small ones. It was also used by Bergmann et al. in
[6]. For computing the PRO metric, anomaly maps are first thresholded at a given anomaly score to make a binary decision for each pixel whether an anomaly is present or not. For each connected component within the groundtruth, the percentage of overlap with the thresholded anomaly region is computed. We evaluate the PRO value for a large number of increasing thresholds until an average perpixel false positive rate of 30% for the entire dataset is reached and integrate the area under the PRO curve as a measure of anomaly detection performance. Note that for high false positive rates, large parts of the input images would be wrongly labeled as anomalous and even perfect PRO values of 1.0 would no longer be meaningful. We normalize the integrated area to a maximum achievable value of 1.0.Table II shows our results training each algorithm with a receptive field of for comparability. Our method consistently outperforms all other evaluated algorithms for almost every dataset category. The shallow machine learning algorithms fitted directly to the teacher’s descriptors after applying PCA do not manage to perform satisfactorily for most of the dataset categories. This shows that their capacity does not suffice to accurately model the large number of available training samples. The same can be observed for the CNNFeature Dictionary. As it was the case in our previous experiment on MNIST and CIFAR10, 1NN yields the best results amongst the shallow models. Utilizing a large number of training features together with deterministic autoencoders increases the performance, but still does not match the performance of our approach. Current generative methods for anomaly segmentation such as AnoGAN and the SSIMautoencoder perform similar to the shallow methods fitted to the discriminative embedding of the teacher. This indicates that there is indeed a gap between methods that learn representations for anomaly detection from scratch and methods that leverage discriminative embeddings as prior knowledge.
Layer  Output Size  Parameters  
Kernel  Stride  
Input  65x65x3  
Conv1  61x61x128  5x5  1 
MaxPool  30x30x128  2x2  2 
Conv2  26x26x128  5x5  1 
MaxPool  13x13x128  2x2  2 
Conv3  9x9x128  5x5  1 
MaxPool  4x4x256  2x2  2 
Conv4  1x1x256  4x4  1 
Conv5  1x1x128  3x3  1 
Decode  1x1x512  1x1  1 


Table III shows the performance of our algorithm for different receptive field sizes and when combining multiple scales. For some objects, such as bottle and cable, larger receptive fields yield better results. For others, such as wood and toothbrush, the inverse behavior can be observed. Combining multiple scales enhances the performance for many of the dataset categories. A qualitative example highlighting the benefit of our multiscale anomaly segmentation is visualized in Figure 5.
We have proposed a novel framework for the challenging problem of unsupervised anomaly segmentation in natural images. Anomaly scores are derived from the predictive variance and regression error of an ensemble of student networks, trained against surrogate labels obtained from a descriptive teacher network. Ensemble training can be performed endtoend and purely on anomalyfree training data without requiring prior data annotation. Our approach can be easily extended to detect anomalies at multiple scales. We demonstrate improvements over current stateoftheart methods on a number of realworld computer vision datasets for oneclassclassification and anomaly segmentation.
A description of the network architecture for a patchsized teacher network with receptive field of size can be found in our main paper. Architectures for teachers with receptive field sizes and can be found in Tables IV(a) and IV(b), respectively. Leaky rectified linear units with slope 0.005 are used as activation function after each fully connected layer.
We give details about additional hyperparameters for our experiments on the MNIST and CIFAR10 datasets. We additionally provide the perclass ROCAUC values for the two datasets in Tables VI and VII, respectively.
For the deterministic autoencoder (AE) and the variational autoencoder (VAE), we use a fully connected encoder architecture of shape 128–64–32–10 with leaky rectified linear units of slope 0.005. The decoder is constructed in a manner symmetric to the encoder. Both autoencoders are trained for 100 epochs at an initial learning rate of 0.01 using the Adam optimizer and a batch size of 64. A weight decay of rate
is applied for regularization. To evaluate the reconstruction probability of the VAE, five independent forward passes are performed for each feature vector. For the OCSVM, the radial basis function kernel is used. KMeans is trained with ten cluster centers and the distance to the single closest cluster center is evaluated as the anomaly score for each input sample. For 1NN, the feature vectors of all available training samples are stored and tested during inference.
We give additional information on the hyperparameters used in our experiments on MVTec AD for both shallow machine learning models as well as deep learning methods.
For the 1NN classifier, we construct a dictionary of 5000 feature vectors and take the distance to the closest training sample as the anomaly score. For the other shallow classifiers, we fit their parameters on 50 000 training samples, randomly chosen from the teacher’s feature maps. The KMeans algorithm is run with ten cluster centers and measures the distance to the nearest cluster center in the feature space during inference. The OCSVM is evaluated with a radial basis function as the kernel.
For evaluation on MVTec AD, the architecture of the AE and VAE are identical to the ones used on the MNIST and CIFAR10 dataset. Each fully connected autoencoder is trained for 100 epochs, at an initial learning rate of and weight decay of . Batches are constructed from 512 randomly sampled vectors of the teacher’s feature maps. The reconstruction probability of the VAE is computed by five individual forward passes through the network. For the evaluation of AnoGAN, the SSIMAutoencoder, and the CNNFeature Dictionary, we use the same hyperparameters as Bergmann et al. in the MVTec AD dataset paper [6]. Only a slight adaption is applied to the CNNFeature Dictionary by cropping patches of size and performing the evaluation by computing anomaly scores for overlapping patches with a stride of four pixels.
Method  0  1  2  3  4  5  6  7  8  9  Mean  
OCGAN  0.998  0.999  0.942  0.963  0.975  0.980  0.991  0.981  0.939  0.981  0.9750  
1NN  0.989  0.998  0.962  0.970  0.980  0.955  0.979  0.981  0.968  0.971  0.9753  
KMeans  0.973  0.995  0.898  0.948  0.960  0.920  0.948  0.948  0.940  0.927  0.9457  
OCSVM  0.980  0.998  0.887  0.944  0.964  0.909  0.949  0.957  0.935  0.940  0.9463  
AE  0.992  0.999  0.967  0.980  0.988  0.970  0.988  0.987  0.978  0.983  0.9832  
VAE  0.983  0.998  0.915  0.941  0.969  0.925  0.964  0.940  0.955  0.945  0.9535  
Ours  ✓  ✓  0.999  0.999  0.990  0.993  0.992  0.993  0.997  0.995  0.986  0.991  0.9935  
Ours  ✓  ✓  ✓  0.999  0.999  0.988  0.992  0.988  0.993  0.997  0.995  0.984  0.991  0.9926 
Ours  ✓  ✓  0.999  0.999  0.992  0.992  0.988  0.993  0.997  0.995  0.988  0.992  0.9935  
Ours  ✓  0.999  0.999  0.989  0.990  0.990  0.990  0.997  0.993  0.981  0.989  0.9917 
Method  0  1  2  3  4  5  6  7  8  9  Mean  
OCGAN  0.757  0.531  0.640  0.620  0.723  0.620  0.723  0.575  0.820  0.554  0.6566  
1NN  0.792  0.860  0.746  0.729  0.815  0.797  0.876  0.836  0.856  0.882  0.8189  
KMeans  0.673  0.822  0.665  0.676  0.742  0.746  0.828  0.780  0.817  0.843  0.7592  
OCSVM  0.651  0.785  0.618  0.679  0.733  0.730  0.797  0.760  0.799  0.836  0.7388  
AE  0.747  0.862  0.690  0.698  0.788  0.759  0.849  0.824  0.812  0.869  0.7898  
VAE  0.705  0.819  0.605  0.700  0.734  0.731  0.797  0.751  0.801  0.859  0.7502  
Ours  ✓  ✓  0.789  0.849  0.734  0.748  0.851  0.793  0.892  0.830  0.862  0.848  0.8196  
Ours  ✓  ✓  ✓  0.784  0.836  0.706  0.742  0.826  0.768  0.870  0.815  0.857  0.831  0.8035 
Ours  ✓  ✓  0.804  0.855  0.706  0.709  0.798  0.738  0.860  0.797  0.849  0.824  0.7940  
Ours  ✓  0.766  0.817  0.715  0.736  0.855  0.763  0.885  0.819  0.838  0.827  0.8021 
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.2.Where’s Wally Now? Deep Generative and Discriminative Embeddings for Novelty Detection
. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1.ImageNet Classification With Deep Convolutional Neural Networks
. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §3.1.