Log In Sign Up

Anomaly Detection via Multi-Scale Contrasted Memory

by   Loic Jezequel, et al.

Deep anomaly detection (AD) aims to provide robust and efficient classifiers for one-class and unbalanced settings. However current AD models still struggle on edge-case normal samples and are often unable to keep high performance over different scales of anomalies. Moreover, there currently does not exist a unified framework efficiently covering both one-class and unbalanced learnings. In the light of these limitations, we introduce a new two-stage anomaly detector which memorizes during training multi-scale normal prototypes to compute an anomaly deviation score. First, we simultaneously learn representations and memory modules on multiple scales using a novel memory-augmented contrastive learning. Then, we train an anomaly distance detector on the spatial deviation maps between prototypes and observations. Our model highly improves the state-of-the-art performance on a wide range of object, style and local anomalies with up to 35% error relative improvement on CIFAR-10. It is also the first model to keep high performance across the one-class and unbalanced settings.


page 1

page 2

page 3

page 4


Unsupervised Anomaly Detection and Localisation with Multi-scale Interpolated Gaussian Descriptors

Current unsupervised anomaly detection and localisation systems are comm...

Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder for Unsupervised Anomaly Detection

Deep autoencoder has been extensively used for anomaly detection. Traini...

From Unsupervised to Few-shot Graph Anomaly Detection: A Multi-scale Contrastive Learning Approach

Anomaly detection from graph data is an important data mining task in ma...

Improving unsupervised anomaly localization by applying multi-scale memories to autoencoders

Autoencoder and its variants have been widely applicated in anomaly dete...

A Unified Model for Multi-class Anomaly Detection

Despite the rapid advance of unsupervised anomaly detection, existing me...

Anomaly DetectionWith Multiple-Hypotheses Predictions

In one-class-learning tasks, only the normal case (foreground) can be mo...

Explainable Deep One-Class Classification

Deep one-class classification variants for anomaly detection learn a map...

1 Introduction

Detecting observations straying apart from a well defined normal baseline consistently lies at the center of many modern machine learning challenges. Given the complexity of the anomalous class and the high cost of obtaining labeled anomalies, this task of anomaly detection differs quite a lot from classical binary classification. This accordingly gave birth to many deep anomaly detection (AD) methods producing more stable results given an extremely unbalanced training dataset. Deep AD has been successful in various applications such as in fraud detection

[DBLP:conf/cvpr/FeiDYSX022], medical imaging [DBLP:conf/cvpr/SalehiSBRR21], video surveillance [DBLP:conf/cvpr/DoshiY21] or defect detection [DBLP:conf/cvpr/RothPZSBG22].

However, existing anomaly detection models still present some limitations. (1) There is a hard trade-off between remembering edge-case normal samples and remaining generalizable enough toward anomalies. This lack of normal long-tail memorization often leads to high false reject rates on harder samples. (2) These models tend to focus on either local low-scale anomalies or global object oriented anomalies but fail to combine both. Current models often remain highly dataset-dependent and do not explicitly use multi-scaling. (3) Anomaly detection lacks an efficient unified framework which could easily tackle one-class and semi-supervised detection. Indeed existing methods are either introduced as one-class or semi-supervised detectors, with different specialized approaches and set of hyper-parameters.

In the light of these limitations, we introduce in this paper a novel two-stage AD model named AnoMem which memorizes during training multi-scale normal class prototypes to compute an anomaly deviation score at several scales. Contrary to previous memory bank equipped methods [DBLP:conf/iccv/GongLLSMVH19, DBLP:conf/cvpr/ParkNH20], our normal memory layers encompass the normal class at several scales and do not only improve anomaly detection but also the quality of the learned representations. Moreover, by using the modern Hopfield layers for memorization, our method is much more efficient than nearest neighbor anomaly detectors [DBLP:journals/corr/abs-2002-10445, DBLP:journals/corr/abs-2112-02597]. These detectors require to keep the whole normal set, while ours can learn the most representative samples with a fixed size. By extensive experiments, we show that our method greatly outperforms all previous anomaly detectors using memory.

clip,trim=0.3cm 0.30cm 0.2cm 0.25cm





















Figure 1: Comparison of our model with the best state-of-art methods on CIFAR-100 for different anomaly ratio .

Our main contributions in this paper are the following:

  • [itemsep=1pt]

  • We propose to integrate memory modules into contrastive learning in order to remember normal class prototypes during training. In a first stage, we simultaneously learn representations using contrastive learning and memory modules (long-tail memorization). Then in a second stage, we learn to detect anomalies using the aforementioned prototypes. When additional anomalous samples are available, we train an anomaly distance detector on the spatial deviation maps between prototypes and observations. To the best of our knowledge, our algorithm is the first working well in both one-class and semi-supervised settings with a few anomalies (unifying framework).

  • AnoMem is further improved using multi-scale normal prototypes in both representation learning and AD stage. We introduce a novel way to efficiently memorize 2D features maps spatially. This enables our model to accurately detect low-scale, texture-oriented anomalies and higher-scale, object-oriented anomalies (multi-scale anomaly detection).

  • We validate the efficiency of our method and compare it with SoTA methods on one-vs-all and anti-spoofing problems. Our model improves anomaly detection with up to 35% error relative improvement on object anomalies and 14% on face anti-spoofing.

2 Related work

Figure 2: Overview of AnoMem’s training: (a) the first representation learning stage, (b) the second anomaly detection stage. Learnable parts of the model are in dark gray.

2.1 Memory modules

A memory module should achieve two main operations: (i) writing inside a memory from a given set of samples (remembering) and (ii) recovering from a partial or corrupted input the most similar sample in its memory with minimal error (recalling). Most of the time, memory modules will differ on the amount of images they can memorize given a model size and the average reconstruction error.

The simplest memory module possible is a nearest neighbor queue. Given a maximum size , it remembers the last samples by enqueuing their representations. To remember an incomplete input , it retrieves the nearest neighbor from the queue.

A more effective memory module is the modern Hopfield layer [DBLP:conf/iclr/RamsauerSLSWGHA21]. It represents the memory as a learnable matrix of weights and retrieves samples by recursively applying the following formula until convergence:



is the query vector and the scalar

is the inverse temperature. Its form is similar to the attention mechanism in transformers, except it reapplies the self-attention until convergence. This layer has proven to have a very high memory capacity and remember samples with very low redundancy [DBLP:conf/iclr/RamsauerSLSWGHA21]. In the following sections we simply call this layer a Hopfield layer of size .

2.2 Anomaly detection

Anomaly detection can be considered as a binary classification problem where the normal class is usually well-defined and well-understood as being sampled from a distribution , whereas the anomaly class is implicitly defined as anything not normal. The anomaly class is significantly broader and more complex than the normal class, creating a natural imbalance between the required amount of normal data and anomalous data. We call unbalanced supervised AD, or semi-supervised AD (SSAD) the setting where a small additional set of anomalies is available. In one-class AD (OC-AD

), the anomaly estimator is trained only from a normal class sample


There exist several families of approaches for OC-AD. Pretext task methods learn to solve on the normal data an auxiliary task [DBLP:conf/avss/JezequelVBH21, DBLP:conf/nips/HendrycksMKS19, DBLP:conf/icpr/Jezequel22LPT, DBLP:conf/iclr/BergmanH20], different from AD task. The inferred anomaly score describes how well the auxiliary task is performed on the input. Similarly, two-stage methods

consist of a representation learning step and an anomaly score estimation step. After learning an encoder on the normal data using self-supervised learning

[DBLP:conf/ijcai/ChoSL21, DBLP:conf/nips/TackMJS20, DBLP:conf/ijcai/ChenXLQZTZM21, DBLP:conf/icml/ZbontarJMLD21] or using an encoder pre-trained on generic datasets [DBLP:journals/corr/abs-2105-09270], the anomaly score is computed with a simple OC classifier fitted on the latent space [DBLP:conf/cvpr/LiSYP21, DBLP:journals/corr/abs-2106-03844, DBLP:conf/iclr/SohnLYJP21]. The methods [DBLP:journals/corr/abs-2002-10445, DBLP:journals/corr/abs-2112-02597] have used a nearest neighbor queue to fetch the closest prototypical normal samples inside the latent space. The mean distance to these samples is then used as the anomaly score. Density estimation methods tackle the estimation of the distribution using deep high-dimensional density estimators such as normalizing-flows [DBLP:journals/corr/abs-2106-12894], likelihood ratio methods [DBLP:conf/nips/RenLFSPDDL19], variational models [DBLP:journals/corr/abs-1911-04971] or more recently diffusion models [DBLP:conf/cvpr/WyattLSW22, DBLP:journals/corr/abs-2205-14297]. Reconstruction methods

measure the reconstruction error of a bottleneck encoder-decoder, trained using denoising autoencoders

[DBLP:conf/cvpr/PereraNX19, DBLP:conf/cvpr/SchneiderASS22] or two-way GANs [DBLP:conf/accv/AkcayAB18, DBLP:conf/acpr/TuluptcevaBFK19, DBLP:conf/ipmi/SchleglSWSL17, DBLP:conf/icip/LiuLZHW21]. Some works used memory in the latent space of an auto-encoder [DBLP:conf/iccv/GongLLSMVH19, DBLP:conf/cvpr/ParkNH20] for AD. During training the latent memory weights are learned to achieve optimal reconstruction on the normal class. Then, the reconstruction is performed from the memory closest fetched latent vector. More recently, knowledge distillation methods have been adapted to AD by using the representation discrepancy of anomalies in the teacher-student model [DBLP:conf/cvpr/DengL22, DBLP:conf/cvpr/CohenA22].

SSAD mainly revolves around two-stage methods and anomaly distance methods. We note however that some recent work has tried to generalize pretext task to SSAD [DBLP:conf/icpr/Jezequel22LPT]. In the SSAD two-stage methods, a supervised classifier with the anomalous samples is trained in the second stage instead of the aforementioned one-class estimator [DBLP:conf/bmvc/0001S0PC21]. Distance methods directly use a distance to a centroid as the anomaly score and learn the model to maximize the anomaly distance on anomalous samples and minimize it on normal samples [DBLP:conf/iclr/RuffVGBMMK20, DBLP:conf/icpr/Jezequel22CR]. To the best of our knowledge, no SSAD methods in the literature use any kind of memory mechanism for the anomaly score computation.

In this paper, we present a unified two-stage distance model which can be used in both settings.

2.3 Contrastive learning

Contrastive learning is a self-supervised representation learning method. It operates on the basis of two principles: (1) two different images should yield dissimilar representations, and (2) two different views of the same image should yield similar representations. The different views of an image are characterized by a set of invariance transformations . There have been many methods introduced to efficiently enforce these two principles: SimCLR [DBLP:conf/icml/ChenK0H20], Barlow-Twins [DBLP:conf/icml/ZbontarJMLD21] and VICReg [DBLP:conf/iclr/BardesPL22] with a siamese network, MoCo [DBLP:conf/cvpr/He0WXG20] with a negative sample bank, BYOL [DBLP:conf/nips/GrillSATRBDPGAP20] and SimSiam [DBLP:conf/cvpr/ChenH21] with a teacher-student network or SwAV [DBLP:conf/nips/CaronMMGBJ20] with contrastive clusters. While some contrastive methods such as SimCLR and MoCo require negatives samples, other such as BYOL, SimSiam and SwAV do not.

In the simplest formulation, the pairs of representations to contrast are only considered from two views of a batch augmented by transformations . In SimCLR, the following loss is minimized on those two batches:


where are the last features of the two augmented batches and is the Normalized Temperature-scaled Cross Entropy Loss (NT-Xent). In practice, minimizing

will yield representations of the dataset with the most angular spread variance, while retaining angular invariance in regards to the set of invariance transformations


Memory mechanisms can also be used into contrastive learning as proposed in [DBLP:conf/iccv/DwibediATSZ21, DBLP:conf/iccv/KoohpayeganiTP21]. During training, the positive and negative pairs are augmented with the samples nearest neighbors from a memory queue. This allows the method to reach better performance for smaller batch sizes.

3 Proposed method

Sec. 3.1 first details a novel training procedure to simultaneously learn an encoder representations and a set of multi-scale normal class prototypes. Sec. 3.2 then presents how to use the encoder and the normal prototypes to train a one-class or unbalanced anomaly detector in a unified framework. Our model training is fully summarized in Fig. 2.

Notation. Let be a training dataset made of normal samples () and of anomalies () in the case of unbalanced classification. We use a backbone network composed of several stages such that the dimensions of the th scale feature map are . We note .

3.1 Memorizing normal class prototypes

In this section, we first introduce how memory modules can be used in the contrastive learning scheme to provide robust and representative normal class prototypes. Then we generalize our idea to several scales throughout the encoder.

Foremost, we choose to apply a contrastive learning type method rather than other unsupervised learning schemes, as it has been proven to produce better representations in with very few labeled data

[DBLP:journals/corr/abs-2105-05837]. We also favor self-supervised learning on the normal data rather than using a pre-trained encoder on generic datasets [DBLP:journals/corr/abs-2105-09270]

which often performs poorly on data with a significant distribution shift. In order to learn unsupervised representations and a set of normal prototypes, we could sequentially apply contrastive learning then perform k-means clusterisation and use the cluster centroids as the normal prototypes. However this approach has two main flaws. First, the representation learning step and the construction of prototypes are completely separated. Indeed, it has been shown in several contrastive learning methods

[DBLP:conf/iccv/DwibediATSZ21, DBLP:conf/nips/CaronMMGBJ20] that the inclusion of a few representative samples in the negative examples can significantly improve the representation quality, and alleviate the need for large batches. Moreover, the resulting k-means prototypes do not often cover atypical samples. This means that harder normal samples will not be well encompassed by the normal prototypes, resulting in high false rejection rate during AD. We compare our approach with k-means centroids in Sec. 4.4.

In the light of the aforementioned classical pipeline flaws, we introduce a novel approach based on memory modules to simultaneously learn an encoder and normal prototypes. Let and respectively be the encoder features for the contrastive upper and lower branch. Instead of directly contrasting the representations of the two views, we apply beforehand an Hopfield memory layer to the first branch in the case of normal samples. It is important to note that we only apply the memory layer when the sample is normal since we assumed that anomalous data is significantly more variable and less defined than normal data. As such, we note


We choose SimCLR as the contrastive loss baseline, however our method can be integrated into any other two-branched contrastive learning framework. Indeed, we can use our method with Barlow-Twins by replacing with the cross-correlation. We then use the following contrasted memory loss for self-supervised training:




The representation covers any representation inside the multi-view batch, is a temperature hyper-parameter and

is the cosine similarity. We note that in contrast with existing two-stage AD methods, our model explicitly introduces the anomalous and normal labels from the very first step of representation learning.

Variance loss as regularization.

Our procedure can be prone to representation collapse during the first epochs. Indeed, we observed that the dynamic between contrastive learning and the randomly initialized memory layer can occasionally lead to a collapse of all prototypes to a single point during the first epochs. To prevent this, we introduce an additional regularization loss which ensures the variance of the retrieved memory samples does not reach zero:

Multi-scale contrasted memory.

To make our model use information from several scales, we apply our contrasted memory loss not only to the flattened 1D output of our encoder but also to intermediate layer 3D feature maps .

We add after each scale representation a memory layer to effectively capture normal prototypes on several scales. However, memorizing the full 3D map as a single flattened vector would not be ideal. Indeed, at lower scales we are mostly interested in memorizing local patterns regardless of their position. Moreover, the memory would span across a space of very high dimensions. Therefore, we view the 3D intermediate maps as a collection of 1D feature vectors rather than a single flattened 1D vector. This is equivalent to splitting the image into patches and remembering each of them.

Since earlier features map will have a high resolution, the computational cost and memory usage of such approach can quickly explode. Thus, we only apply our contrasted memory loss on a random sample with ratio of the available vectors on the th scale. If is high enough, our model will see most of the local patterns during training.

Our multi-scale contrasted memory loss becomes


where controls the impact of the variance loss, controls the importance of the th scale, and is a random sample without replacement of points from . Since earlier scales are less complex and less semantically meaningful than later scales, we choose to put more confidence on the latest stages, meaning that .

We simultaneously minimize this loss on all of the encoder stages and the memory layers’ weights. An overview of this first stage is given in Fig. 1(a), and its algorithm is presented in Alg. 1. Compared to previous memory bank equipped anomaly detectors [DBLP:conf/iccv/GongLLSMVH19, DBLP:conf/cvpr/ParkNH20, DBLP:journals/corr/abs-2002-10445, DBLP:journals/corr/abs-2112-02597], our model is the first to memorize the normal class at several scales allowing it to be more robust to anomaly sizes. Moreover, the use of normal memory does not only improve anomaly detection but also the quality of the learned representations, as will be discussed in Sec. 4.4.

1:  Input: batch size , invariance transformations
2:  Initialization: encoder stages , memory layers .
3:  while not reach the maximum epoch do
4:      Sample image minibatch with labels
5:      Sample augmentations from
6:      Get augmented views and
7:      for  do
8:           and
9:          Sample vectors from
10:          Retrieve each memory prototypes using Eq. 3 with the th scale memory layer.
11:      Compute from Sec. 3.1
12:      Gradient descent on to update and .
13:  Output: Encoder network , and the multi-scale memory prototypes from .
Algorithm 1 AnoMem first learning stage

3.2 Multi-scale normal prototype deviation

In this second step of training, our goal is to compute for an input an anomaly score given the pre-trained encoder and the multi-scale normal memory layers .

For each scale , we consider the difference between the encoder feature map and its recollection from the th memory layer. The recollection process consists in spatially applying the memory layer to each depth 1D vector, or more formally


where .

One-class AD.

In this case, we use the norm of the difference map as an anomaly score for each scale and no further training is required:

Unbalanced supervised AD.

We use the additional labeled data to train scale-specific classifiers working on the difference map . Each classifier is first composed of an average pooling layer followed by an MLP composed of two dense layers with a single scalar output. is required to reduce the spatial resolution of , which would require very large layers on earlier scales. The output of directly corresponds to the th scale anomaly distance:


Each scale-specific classifier is trained using the intermediate features of the same normal and anomalous samples used during the first step along their labels. The training procedure is similar to other distance-based anomaly detectors [DBLP:conf/iclr/RuffVGBMMK20, DBLP:conf/icpr/Jezequel22CR] where the objective is to obtain small distances for normal samples while keeping high distances on anomalies. We note that our model is the first to introduce memory prototypes learned during representation learning into the anomaly distance learning. The distance constraint is enforced via a double-hinge distance loss:


where is the anomaly distance for a given sample, and controls the size of the margin around the unit ball frontier. One advantage of this loss is that both normal samples and anomalies will be correctly separated without encouraging anomalous features to be pushed toward infinity. Our second stage supervised loss is the following


Finally, all scale anomaly scores are merged into a single anomaly score using a sum weighted by the confidence parameters :


The anomaly score effectively combines the expertise of each anomaly score on different scales, making it much more robust to different sizes of anomalies than other detectors. As mentioned in Sec. 3.1, the will put more weight to the later scales, which is desirable since we want our detector to be more sensitive to broad object anomalies.

This second stage is summarized in Fig. 1(b)

. As we can see, only the second stage has to be swapped between one-class learning and semi-supervised learning, resulting in a unified easily-switchable framework for AD.

4 Experiments

We assess the performance of our anomaly detector AnoMem on many datasets with the one-vs-all protocol and the face attack detection intra-dataset cross-type protocol.


CUB-200 CIFAR-10 CIFAR-100
Models \ 0. 0.01 0.05 0.10 0. 0.01 0.05 0.10 0. 0.01 0.05 0.10
MemAE [DBLP:conf/iccv/GongLLSMVH19] 59.6 60.9 57.4
OC-SVM [DBLP:conf/nips/ScholkopfWSSP99] 76.3 64.7 62.6
IF [DBLP:conf/icdm/LiuTZ08] 74.2 60.0 58.5
PIAD [DBLP:conf/acpr/TuluptcevaBFK19] 63.5 79.9 78.8
Reverse Distillation [DBLP:conf/cvpr/DengL22] - 86.5 -
ARNet [fei2020attribute] - 86.6 78.8
GOAD [DBLP:conf/iclr/BergmanH20] 66.6 88.2 74.5
MHRot [DBLP:conf/nips/HendrycksMKS19] 77.6 89.5 83.6
CSI [DBLP:conf/nips/TackMJS20] 52.4 94.3 85.8
Supervised 53.1 58.6 62.4 55.6 63.5 67.7 53.8 58.4 62.5
SS-DGM [DBLP:conf/nips/KingmaMRW14] - - - 49.7 50.8 52.0 - - -
Elsa [DBLP:conf/bmvc/0001S0PC21] 77.8 81.3 82.9 80.0 85.7 87.1 81.3 84.6 86.0
SadCLR [DBLP:conf/icpr/Jezequel22CR] 78.3 80.0 81.8 88.4 96.1 96.7 89.7 92.9 94.6
SSAD [DBLP:journals/jair/GoernitzKRB13] - - - - 62.0 73.0 71.5 70.1 57.4 65.0 67.3 68.1
DeepSAD [DBLP:conf/iclr/RuffVGBMMK20] 53.9 62.7 63.4 65.1 60.9 72.6 77.9 79.8 56.3 67.7 71.6 73.2
DP VAE [DBLP:journals/corr/abs-1911-04971] 61.7 65.4 67.2 69.6 52.7 74.5 79.1 81.1 56.7 68.5 73.4 75.8
AnoMem (ours) 81.4 84.1 85.3 86.0 91.5 92.5 97.1 97.6 86.1 90.9 92.3 94.7


Table 1: Comparison with the SOTA methods over several datasets using the AUC in the one-vs-all semi-supervised protocol. The three blocks respectively contain one-class, semi-supervised and methods usable in both settings. Underline indicates the overall best result, bold indicates the best semi-supervised method (We re-evaluated Elsa, DP-VAE, SSAD, GOAD and ARNet on CIFAR100 and CUB-200).

4.1 Evaluation protocol

In the first one-vs-all protocol, one class is considered as normal and the others as anomalous. The final reported result is the mean of all runs obtained for each possible normal class. We consider various ratios of anomaly data in the training dataset and for each, average the metrics on 10 random samples. The one-class AD setting is a special case of the semi-supervised AD setting with .

The second protocol of intra-dataset cross-type is centered around face presentation attack detection (FPAD), which goal is to discriminate real faces from fake representations of someone’s face. Training and test data are sampled from the same dataset, albeit with one tested attack type being unseen during training. By evaluating the model on unseen anomaly type, we can evaluate its generalization power and robustness. We consider the following unseen attack types: Paper Print (PP), Screen Recording (SR), Paper Mask (PM) and Flexible Mask (FM).

To evaluate the representation learning process, we use the linear evaluation protocol which corresponds to the accuracy of a linear classifier trained on the encoder frozen representations.

Specifically, the considered datasets for the one-vs-all protocol are chosen to cover object and style anomalies:

  • [leftmargin=10pt,topsep=0pt,itemsep=0pt]

  • CIFAR-10 [Krizhevsky2009LearningML]: an object recognition dataset composed of 10 wide classes with 6000 images per class.

  • CIFAR-100 [Krizhevsky2009LearningML]: more challenging version of CIFAR-10 with 100 classes each containing 600 images.

  • CUB-200: a challenging fine-grained dataset consisting of 200 classes of birds each containing around 50 images.

For the FPAD, we use the WMCA dataset [DBLP:journals/tifs/GeorgeMGNAM20] containing more than 1900 RGB videos of real faces and presentation attacks. There are several types of attacks which cover object anomalies, style anomalies and local anomalies.

In all AD evaluations, the metric used is the area under the ROC curve (AUROC) or alternatively the error 1-AUROC, averaged over all possible normal classes in the case of one-vs-all datasets.

4.2 Implementation details


Training is performed under SGD optimizer with nesterov momentum

[DBLP:conf/icml/SutskeverMDH13], using a batch size of and a cosine annealing learning rate scheduler [DBLP:conf/iclr/LoshchilovH17] for both of the stages.

Data augmentation.

For the contrastive invariance transformations, we use random crop with rescale, horizontal symmetry, brightness jittering, contrast jittering, saturation jittering with gaussian blur and noise as in [DBLP:conf/icml/ChenK0H20].

Model design.

Regarding network architecture, we use a Resnet-50 [DBLP:conf/cvpr/HeZRS16] ( parameters) for the backbone . We consider two different memory scales: one after the third stage and another after the last stage. The associated memory layers are respectively of size and with an inverse temperature , along with a pattern sampling ratio of and . The choices of memory size and sampling ratios are respectively discussed in Sec. 4.5 and Sec. 4.6. The scale confidence factors are set to increase exponentially as , and the variance loss factor is fixed to after optimization on CIFAR-10. We use as suggested in [DBLP:conf/icml/ChenK0H20]. For the anomaly distance loss, we choose a margin size of .

4.3 Comparison to the state-of-the-art

4.3.1 One-vs-all

In this section, we compare our model with SoTA AD methods on the one-vs-all protocol, in the one-class setting and on the semi-supervised setting when possible.

Considered one-class methods are hybrid models [DBLP:conf/nips/ScholkopfWSSP99, DBLP:conf/icdm/LiuTZ08], reconstruction error generative model [DBLP:conf/acpr/TuluptcevaBFK19, DBLP:conf/cvpr/ParkNH20], the knowledge distillation method [DBLP:conf/cvpr/DengL22], pretext tasks methods [fei2020attribute, DBLP:conf/iclr/BergmanH20, DBLP:conf/nips/HendrycksMKS19], and the two-stage method [DBLP:conf/nips/TackMJS20]. We also consider semi-supervised methods such as density estimation methods [DBLP:conf/nips/KingmaMRW14], two-stage AD [DBLP:conf/bmvc/0001S0PC21], and anomaly distance model [DBLP:conf/icpr/Jezequel22CR]. To further show the disadvantages of classical binary classification, we also include a classical deep classifier trained with batch balancing between normal samples and anomalies. Lastly, unified methods usable in both one-class and semi-supervised learning are included with the reconstruction error model [DBLP:journals/corr/abs-1911-04971], and direct anomaly distance models [DBLP:journals/jair/GoernitzKRB13, DBLP:conf/iclr/RuffVGBMMK20]. For a fair comparison in the same conditions, we take the existing implementations or re-implement and evaluate ourselves all one-class methods, except [fei2020attribute, DBLP:conf/iclr/BergmanH20, DBLP:conf/nips/TackMJS20]. The results are presented in Tab. 1.

First of all, we can notice the classical supervised approach falls far behind anomaly detectors on all datasets. This highlights the importance of specialized AD models, as classical models are likely to over-fit on anomalies.

Furthermore, our method AnoMem overall performs significantly better than all other evaluated detectors on various datasets with up to 35% relative error improvement on CIFAR-10 and . Although performance greatly increases with more anomalous data, it remains highly competitive with only normal samples. In the one-class setting, AnoMem outperforms all methods specialized for one-class including pretext task methods, reconstruction error methods and by far hybrid methods. It even surpasses the much more costly multi-pretext task method PuzzleGeom on CIFAR-10 and CIFAR-100. We also show that the usage of memory in our method is much more efficient than the memory for reconstruction used in MemAE. Indeed, while we learn the memory through contrastive learning, MemAE and others [DBLP:conf/iccv/GongLLSMVH19, DBLP:conf/cvpr/ParkNH20] learned it via the pixel-wise reconstruction loss. Their normal prototypes are much more constrained and therefore less semantically rich and generalizable. As for semi-supervised AD, AnoMem reduces state-of-the-art error gap on nearly all anomalous data ratio. Its multi-scale anomaly detectors allow capturing more fine-grained anomalies as we can see in the CUB-200 results.

Finally, our model performs very well in both one-class and semi-supervised AD while other unified methods generally fail in the one-class setting. In this regard, our model is to the best of our knowledge the first efficient unified anomaly detector. We also note that the change from one-class AD to semi-supervised AD on our model was done with minimal hyper-parameter tuning. This is due to the first training step being shared between the one-class and the semi-supervised settings.

4.3.2 Face Presentation Attack Detection

We now compare our model on the FPAD intra-dataset cross-type protocol with state-of-the-art methods presented in Sec. 4.3.1. The results are displayed in Tab. 2.

Without any further tuning for face data, our method improves face anti-spoofing performance on WMCA with an error relative improvements of up to 14% on paper prints. It outperforms existing anomaly detectors on all unseen attack type, including in the one-class setting. We can also notice that it reduces the error gap between coarse attacks (PM, FM) and harder fine-grained attacks (PP, SR) thanks to its multi-scale AD.


Models All PP SR PM FM
PIAD 76.4
ARNet 84.5
GOAD 86.1
MHRot 81.3
PuzzleGeom 85.6
Supervised 78.3 77.1 80.7 81.9
Elsa 86.1 84.3 89.2 89.1
SadCLR 89.8 88.5 92.7 91.9
DP VAE 53.9 - - - -
DeepSAD 71.2 79.9 80.3 81.8 83.4
AnoMem (ours) 86.9 91.3 89.8 93.0 92.7


Table 2: Comparison with the SOTA methods over face anti-spoofing datasets in the cross-type protocol. The columns indicate the type of presentation attack unseen during training. Bold indicates the best.

4.4 Ablation study

In this section we study the impact of the multi-scale memory layers in the two training stages and show they are essential to our model performance.

First, we evaluate using linear evaluation on CIFAR-10 and CIFAR-100 how the memory affects the contrastive learning of the encoder representations. As we can see in Tab. 3(a), the inclusion of the memory layers on the first branch drastically improves the quality of the encoder representations. We hypothesis that, as shown in [DBLP:conf/iccv/DwibediATSZ21], the inclusion of prototypical samples in one of the branch allows to contrast positive images against representative negatives. This alleviates the need for large batch size, and highly reduces the multi-scale contrastive learning memory usage.

To support the importance of memory during AD as well, we compare the performance of the anomaly detector with k-means cluster centroids or with the normal prototypes learned during the first stage. In the first case, we train the anomaly detectors using the same procedure but instead of fetching the Hopfield layer output we use the closest k-means centroid. The results on CIFAR-10 and CIFAR-100 are presented in Tab. 3(b).

Lastly we measure in Tab. 3(a)(b) the impact of multi-scale AD by comparing the single-scale model only using the last feature map, and our two-scale model. While we sacrifice some of the memory and training time for the additional scale, our AD performance is improved significantly with up to 10% error relative improvement on CIFAR-10.


Memory Multi-scale CIFAR-10 CIFAR-100
(a) Representation Learning (Linear Evaluation)
- - 88.2 80.5
- 91.4  (+3.2) 84.1  (+3.6)
91.9  (+0.5) 84.8  (+0.7)
(b) Anomaly Detection (One-vs-all)
- - 88.1 82.7
- 90.5  (+2.4) 85.3  (+2.6)
91.5  (+1.0) 86.1  (+0.8)


Table 3: Ablation study of our anomaly detector during the two stages. (a) We perform a linear evaluation of representation learning, (b) we evaluate the one-vs-all anomaly detection AUC. The baseline corresponds to a model using only the last feature map, with kmeans cluster centroids as its normal prototypes.

4.5 Memory size

The memory size must be carefully chosen at every scale to reach a good balance between the normal class prototype coverage and the memory usage during training and inference. We present in this section some rules of thumb regarding the sizing of memory layers depending on the scale through two experiments.

We start by plotting the relation between the last scale memory size, the representation separability, and the final AD performance in Fig. 2(a). As one could expect, higher memory size produces better quality representations and more accurate anomaly detector. However, we can notice that increasing the memory size above 256 has significantly less impact on the anomaly detector performance. Therefore a good trade-off between memory usage and performance seems to be at 256 on CIFAR-10.

Furthermore, we study the impact of feature map scale on the required size of memory. In Fig. 2(b) we fix the last scale memory size to 256 and look at the AD accuracy for different ratios of memory size between each scale. An interesting observation is that lower scales seem to benefit more from a larger memory than higher scales. We can hypothesis that the feature vectors of more local texture-oriented features will be richer and more variable than global object-oriented features. Memory layers thus need to be of higher capacity to capture this complexity.

clip,trim=0.1in 0.1in 0.11in 0







Memory size




(a) Linear evaluation (acc)

clip,trim=0.1in 0.1in 0.11in 0







Memory size




(b) One-vs-all (AUC)
Figure 3: Memory experiments on CIFAR-10.

4.6 Spatial sampling ratios

Sampling ratios are introduced during the first step in order to reduce the amount of patterns considered in the contrastive loss, and consequently the similarity matrix size. In low scales, we can expect nearby samples to be quite similar. Therefore, it is not as detrimental to the training to skip some of the available patterns.

To guide our choice of sampling ratio, we measure our anomaly detection AUC with various sampling ratio and anomalous data ratio . Since the last scale feature maps are spatially very small, we only vary the first scale ratio and set . The batch size if fixed throughout the experiments. The results are displayed in Fig. 4. We can see that low sampling ratios () significantly decrease the AD performance. However the gain in performance for higher ratios is generally not worth the additional computational cost: by more than doubling the amount of sampled patterns, we only increase the relative AUC by 2%.

clip,trim=0 0.05in 0 0.05in










% of best AUC

Figure 4: Sampling ratio experiments on CIFAR-10.

5 Conclusion and Future Work

In this paper, we present a new two-stage anomaly detection model which memorizes normal class prototypes in order to compute an anomaly deviation score. By introducing normal memory layers in a contrastive learning setting, we can first jointly learn the encoder representations and a set of normal prototypes. This improves the quality of the learned representations, and allows a normal long-tail memorization. The normal prototypes are then used to train a simple detector in a unified framework for one-class or semi-supervised AD. Furthermore, we extend these prototypes to several scales making our model more robust to different anomaly sizes. Finally, we assess its performance on a wide array of dataset containing object, style and local anomalies. AnoMem greatly outperforms state-of-the-art performance on all datasets and different anomalous data regimes.

For future work, we could explore the use of multi-scale anomaly score for anomaly localization. Indeed, in the one-class setting of our model we could merge the several scale anomaly maps into a single heatmap.