1 Introduction
Anomalies are image regions not conforming with the rest of the image. Detecting them is a challenging image analysis problem, as there seems to be no straightforward definition of what is (ab)normal for a given image.
Anomalies in images can be highlevel or lowlevel outliers. Highlevel anomalies are related to the semantic information presented in the scene. For example, human observers immediately detect a person inappropriately dressed for a given social event. In this work, we focus on the problem of detecting anomalies due to low or mid level rare local patterns present in images. This is an important problem in many industrial, medical or biological applications.
We introduce in this paper an unsupervised method for detecting anomalies in an arbitrary image. The method does not rely on a training dataset of normal or abnormal images, neither on any other prior knowledge about the image statistics. It directly detects anomalies with respect to residual images estimated solely from the image itself. We only use a generic, qualitative background image model: we assume that anything that repeats in an image is
not an anomaly. In a nutshell, our method removes from the image its selfsimilar content (considered as being normal). The residual is modeled as colored Gaussian noise, but still contains the anomalies according to their definition: they do not repeat.Detecting anomalies in noise is far easier and can be made rigorous and unsupervised by the acontrario theory [1] which is a probabilistic formalization of the nonaccidentalness principle [2]. The acontrario
framework has produced impressive results in many different detection or estimation computer vision tasks, such as, segment detection
[3], spots detection [4], vanishing points detection [5], mirrorsymmetry detection [6], among others. The fundamental property of the acontrario theory is that it provides a way for automatically computing detection thresholds that yield a control on the number of false alarms (NFA). It favorably replaces the usual pvalue when multiple testing is involved. It follows that not only one can detect anomalies in arbitrary images without complex modeling, but in addition the anomalies are associated an NFA which is often very small and therefore offers a strong guarantee of the validity of the detection. We shall show detections performed directly on the image residual, or alternatively on residuals extracted from dense low and midlevel features of the VGG neural net [7].The paper is organized as follows. Section 2 discusses previous work while Section 3 explains the proposed method and its implementation. Section 4 presents results of the proposed method on real/synthetic data, and a comparison to other stateoftheart anomaly detectors. We finally close in Section 5.
2 Related Work
The 2009 review [8] examining about 400 papers on anomaly detection considered allegedly all existing techniques and application fields. It is fairly well completed by the more recent [9] review. These reviews agree that classification techniques like SVM can be discarded, because anomalies are generally not observed in sufficient number and lack statistical coherence. There are exceptions like the recent method [10] which defines anomalies as rare events that cannot be learned, but after estimating a background density model, the right detection thresholds are nevertheless learned from anomalies. A broad related literature exists on saliency measures, for which learning from average fixation maps by humans is possible [11]
. Saliency detectors try to mimic the human visual perception and in general introduce semantic prior knowledge (e.g., face detectors). This approach works particularly well with neural networks trained on a base of detect/nondetect with ground truth obtained by for example, gaze trackers
[12].Anomaly detection has been generally handled as a “one class” classification problem. In [13]
authors concluded that most research on anomaly detection was driven by modeling background data distributions, to estimate the probability that test data do not belong to such distributions
[4, 14, 15, 16]. Autoencoders neural networks can be used to model background
[17, 18]. The general idea is to compute the norm between the input and a reconstruction of the input. Another successful background based method is the detection of anomalies in periodic patterns of textile [19, 20]. In [21, 22], center surround detectors based on color, orientation and intensity filters are combined to produce a final saliency map. Detection in image and video is also done in [23] with centersurround saliency detectors which stem from [24] adopting similar image features. In [14], the main idea is to estimate the probability of a region conditioned on the surroundings. A more recent non parametric trend is to learn a sparse dictionary representing the background (i.e., normality) and to characterize outliers by their nonsparsity [25, 26, 27, 28, 29].The selfsimilarity principle has been successfully used in many different applications [30, 31]. The basic assumption of this generic background model, is that in normal data, features are densely clustered. Anomalies instead occur far from their closest neighbors. This idea is implemented by clustering (anomalies being detected as far away from the centroid of their own cluster), or by simple rarity measurements based on nearest neighbor search (NN) [32, 33, 34].
Background probabilistic modeling is powerful when images belong to a restricted class of homogeneous objects, like textiles. But, regrettably, this method is nearly impossible to apply on generic images. Similarly, background reconstruction models based on CNNs are restrictive and do not rely on provable detection thresholds. Centersurround contrast methods are successful for saliency enhancement, but lack a formal detection mechanism. Being universal, the sparsity and the selfsimilarity models are tempting and thriving. But again, they lack a rigorous detection mechanism, because they work on a feature space that is not easily modeled.
We propose to benefit of the above methods while avoiding their mentioned limitations. To this aim, we do construct a probabilistic background model, but it is applied to a new feature image that we call the residual. This residual is obtained by computing the difference between a selfsimilar version of the target image and the target itself. Being not selfsimilar, this background is akin to a colored noise. Hence a hypothesis test can be applied, and more precisely multiple hypothesis testing (also called a contrario method), as proposed in [4]. In that way, we present a general and simple method that is universal and detects anomalies by a rigorous threshold. It does not require learning, and it is easily made multiscale.
3 Method
Our method is built on two main blocks: a removal of the selfsimilar image component, and a simple statistical detection test on the residual based on the a contrario framework.
3.1 Construction of the residual image
The proposed selfsimilarity based background subtraction is inspired from patchbased nonlocal denoising algorithms, where the estimate is done from a set of similar patches [31]. This search is generally performed locally around each patch [35, 31] to keep computational cost low and to avoid noise overfitting. The main difference with nonlocal denoisers is that we forbid local comparisons. The nearest neighbor search is performed outside a square region surrounding each query patch. This square region is defined as the union of all the patches intersecting the query patch. Otherwise any anomaly with some internal structure might be considered a valid structure. What matters is that the event represented by the anomaly is unique, and this is checked away from it.
For each patch in the image the most similar patches denoted by are searched and averaged to give a selfsimilar estimate,
(1) 
where is a normalizing constant, and is a parameter.
Since each pixel belongs to several different patches, they will therefore receive several distinct estimates that can be averaged. Algorithm 1 gives a generic pseudocode for this process, which ends with the generation of a residual image allegedly containing only noise and the anomalies (see Figure 1). The intuition is that it is much easier to detect anomalies in than in .
3.2 Statistical detection by the a contrario approach
Our goal is to detect structure in the residual image . We are in a much better situation modeling than . Indeed, contrarily to , is by construction unstructured and akin to a colored noise (as illustrated in Fig. 1). In what follows we assume that is a spatial stationary random process and follow [4], who proposed automatic detection thresholds in any colored Gaussian noise.
Given a set of random variables
a functionis called an NFA if it guarantees a bound on the expectation of its number of false alarms under the nullhypothesis, namely,
. In other words, thresholding all the by should give up to false alarms when verifies the nullhypothesis. In our case, we consider(2) 
Where index among the executed tests (detailed below), is a random variable distributed as the residual at position , and the actual measured value (pixel or feature value) at position . The nullhypothesis is that the residual, represented by , verifies that each
follows a standard normal distribution. Independence is not required.
Residual distribution. In practice the distribution of the residual
is not necessarily Gaussian. A careful study of the residual distribution lead us to consider that it follows a generalized Gaussian distribution (GCD). We approximately estimate the GCD parameters, and then apply a nonlinear mapping to make it normally distributed.
Choice of NFA. The choice of the NFA given in (2) enables to detect anomalies in both tails of the Gaussian distribution (i.e., very bright or very dark spots). To detect anomalies of all sizes, the detection is carried out independently at scales computed from the residual at the original resolution (by Gaussian subsampling of factor two). Let us denote by the set of pixels in the residual image at scale having number of features. When working with colored noise, Grosjean and Moisan [4] propose to convolve the noise with a measure kernel to detect spots of a certain size. This corresponds to the generation of new image features , where is a disk of a given radius. This idea is used in our framework, where the residual is convolved with kernels of small sizes. Since we apply the detection at all dyadic scales, the tested radii are limited to a small set of
values (1,2 to 3) at each scale. Because the residual is assumed to be a stationary Gaussian field, the result after filtering is also Gaussian. The variance is estimated and the filtered residuals are normalized to have unit variance. This is the input to the NFA (
2) computation (i.e., ). Thus, the inputs to the detection phase are multichannel images of different scales, where each pixel channel, representing a given feature, follows a standard normal distribution.Then, the number of tests is
3.3 Choice of the image features
Anomaly detectors work either directly on image pixels or on some feature space but the detection in the residual, which is akin to unstructured noise, is fairly independent of the choice of the features. We used with equal success the raw image color pixels, or some intermediate feature representation extracted from the VGG convolutional neural network
[7]. To compress the dynamical range of the feature space we apply a square root function to the network features.In order to reduce the feature space dimension, we compute the principal components (PCA) and keep only the first five. This is done per input image independently.
Parameters.
The main method parameter is the number of allowed false alarms in the statistical test. In all presented experiments, we set NFA=. Hence, an anomaly is detected at pixel in channel iff the NFA function is below . This implies a (theoretical) expectation of less than “casual” detection per image under the null hypothesis that the residual image is noise.
Obviously the lower the NFA the better. Most anomalies have a much lower NFA.
For the basic method working on image pixels we used two disks of radius one and two, while for the neural network features, we add a third disk of radius three.
The number of scales is set to in all tests. The patch size in Alg. 1 is for the pixels variant, while when using neural nets features, we use a patch size of . The number of nearest patches is always set to , and . Results presented herein use the outputs from VGG19 layers conv1_1
, conv2_1
and conv3_1
.
4 Experiments
In absence of a valid test image database for anomalies, we used the most common images proposed in the literature (see Fig. 2) and we adopted the following comparison methodology, that was applied to our method and to other four stateoftheart ones for comparison:
a) Sanity check: verifying that for toy examples proposed in the literature the sole detection is the anomaly;
b) Theoretical sanity check: verify the a contrario
principle: ”no detection in white noise”
c) Classic challenging images: we verify the detector power on classic challenging images of the literature: side scan sonar, textile, mammography and natural images. In the case of the mammography where one paper computed an NFA, we verify crucially that by computing the NFA on the residual instead of the image, we gain a huge factor, the NFA being divided by eleven orders of magnitude.
We tested our proposed anomaly detector on two different input image representations: the basic one, pixels
, directly applies the anomaly detection procedure to the residuals obtained from the color channels, and three different variants using as input features extracted at different levels from the VGG network
[7], namely, very low level (conv1_1), low level (conv2_1), and medium level (conv3_1) features. As we shall check the four detections are similar and can be fused by a mere pixel union of all detections.Existent anomaly detectors are often tuned for specific applications, which probably explains the poor code availability. We compared to Mishne and Cohen [36], a stateoftheart anomaly detector with available code, to the salient object detector DRFI [37] (which is stateoftheart according to [40]), and to the stateoftheart human gaze predictor SALICON [12]. We also compared to the Itti et al. salient object detector [21], which works reasonably well for anomaly detection. All methods produce saliency maps where anomalies have the highest score. Anomalies for Mishne and Cohen are redcolored, while the other methods don’t have a threshold for anomalies. More results are available in the supplementary materials.
Synthetic images. The proposed method performs well on synthetic examples as shown in Figure 2). Some weak false detections are found when using as input features extracted at different layers of the VGG net. All the other compared methods miss some detections. SALICON successfully detects the anomalous density on the fourth example but misses several anomalies in others or introduces numerous wrong detections. Itti et al. method successfully detects the anomalous color structure in the first example, but fails to detect the other ones. Mishne and Cohen and DRFI methods do not perform well on any of the five synthetic examples.
Real images. The comparison on real images is more intricate and requires looking in detail to find out whether detections make sense (Figure 2). In the garage door (fourth row), there are two detections that stand out (lens flare and red sign), some others – less visible – can be found (door scratches or holes in the brick wall). For our method, the main detections are present in all the variants. There are also specific anomalies that can be detected only at a given layer of the neural network. For example, conv1_1 detects the holes in the brick wall and the gap between the garage door and the wall, in addition to the ones detected with pixels input. The variants conv2_1 and conv3_1 detect a missing part of a brick in the wall. Saliency methods detect the red sign but not the lens flare. Mishne and Cohen one only detects the garage door gap. The second real example is a man walking in front of some trees. Our method detects the man with pixels and conv1_1. DRFI and SALICON detect the man while Mishne and Cohen and Itti et al. do not. The third real example is a radar image showing a mine, while the last example is a defect in a periodic textile. All methods detect the anomalies, with more or less precision. Note that the detection in the top right corner for both pixels and conv1_1 (and only these) correspond to a defect inside the periodic pattern.
Comparison to the a contrario method of Grosjean and Moisan [4]. This a contrario method is designed to detect spots in colored noise textures, and was applied to the detection of tumors in mammographies. This detection algorithm is the only other one computing NFAs, and we can directly compare them to ours. The detection results on a real mammography (having a tumor) are shown in Figure 3. With our method the tumor is detected with a much significant NFA (NFA of whereas in [4] NFA of ). Our selfsimilar anomaly detection method shows fewer false detections, actually corresponding to rare events like the crossings of arterials.
5 Conclusion
We have shown that anomalies are easier detected on the residual image, computed by removing the selfsimilar component, and then performing hypothesis testing. It is reassuring to see that our method finds all anomalies proposed in the literature with very low NFA. In addition, we have experimentally shown that the method verifies the nonaccidentalness principle: no anomalies are detected in white noise. We plan to build a database of test images with anomalies to run extensive validation and comparison. We also plan to extend the method to videos, by analyzing anomalies in the motion field.
References
 [1] A. Desolneux, L. Moisan, and J.M. Morel, From gestalt theory to image analysis: a probabilistic approach, vol. 34, Springer Science & Business Media, 2007.
 [2] D. Lowe, Perceptual organization and visual recognition, Kluwer Academic Publishers, 1985.
 [3] R. Grompone Von Gioi, J. Jakubowicz, J.M. Morel, and G. Randall, “Lsd: A fast line segment detector with a false detection control,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 4, pp. 722–732, 2010.
 [4] B. Grosjean and L. Moisan, “Acontrario detectability of spots in textured backgrounds,” J. Math. Imaging Vis., vol. 33, no. 3, pp. 313–337, 2009.
 [5] J. Lezama, R. Grompone von Gioi, G. Randall, and J.M. Morel, “Finding vanishing points via point alignments in image primal and dual domains,” in CVPR, 2014.
 [6] V. Patraucean, R. Grompone von Gioi, and M. Ovsjanikov, “Detection of mirrorsymmetric image patches,” in CVPR, 2013.
 [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2015.
 [8] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15, 2009.

[9]
M. Pimentel, D. Clifton, L. Clifton, and L. Tarassenko,
“A review of novelty detection,”
Signal Processing, vol. 99, pp. 215–249, 2014.  [10] X. Ding, Y. Li, A. Belatreche, and L. Maguire, “An experimental evaluation of novelty detection methods,” Neurocomputing, vol. 135, pp. 313–327, 2014.

[11]
H. Tavakoli, E. Rahtu, and J. Heikkilä,
“Fast and efficient saliency detection using sparse sampling and kernel density estimation,”
in Scandinavian Conf. on Ima. Anal., 2011.  [12] Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao, “Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks,” in ICCV, 2015.
 [13] M. Markou and S. Singh, “Novelty detection: a review –part 1: statistical approaches,” Signal processing, vol. 83, no. 12, pp. 2481–2497, 2003.
 [14] T. Honda and S. Nayar, “Finding “anomalies” in an arbitrary image,” in ICCV, 2001.
 [15] A. Goldman and I. Cohen, “Anomaly detection based on an iterative local statistics approach,” Signal Processing, vol. 84, no. 7, pp. 1225–1229, 2004.
 [16] D. Aiger and H. Talbot, “The phase only transform for unsupervised surface defect detection,” in CVPR, 2010.
 [17] J. An, “Variational Autoencoder based Anomaly Detection using Reconstruction Probability,” CoRR, 2016.
 [18] T. Schlegl, P. Seeböck, S. Waldstein, U. SchmidtErfurth, and G. Langs, “Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery,” in IPMI, 2017.
 [19] D. Tsai and T. Huang, “Automated surface inspection for statistical textures,” Image Vis. Comput., vol. 21, no. 4, pp. 307–323, 2003.
 [20] D. Perng, S. Chen, and Y. Chang, “A novel internal thread defect autoinspection system,” Int. J. Adv. Manuf. Tech., vol. 47, no. 58, pp. 731–743, 2010.
 [21] L. Itti, C. Koch, and E. Niebur, “A model of saliencybased visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, 1998.
 [22] N. Murray, M. Vanrell, X. Otazu, and C. Parraga, “Saliency estimation using a nonparametric lowlevel vision model,” in CVPR, 2011.
 [23] D. Gao, V. Mahadevan, and N. Vasconcelos, “The discriminant centersurround hypothesis for bottomup saliency,” in NIPS, 2008.
 [24] L. Itti and C. Koch, “A saliencybased search mechanism for overt and covert shifts of visual attention,” Vision research, vol. 40, no. 10, pp. 1489–1506, 2000.
 [25] R. Margolin, A. Tal, and L. ZelnikManor, “What makes a patch distinct?,” in CVPR, 2013.
 [26] G; Boracchi, D. Carrera, and B. Wohlberg, “Novelty detection in images by sparse representations,” in IES, 2014.
 [27] E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Sparse modeling for finding representative objects,” in CVPR, 2012.
 [28] A. Adler, M. Elad, Y. HelOr, and E. Rivlin, “Sparse coding with anomaly detection,” J. Signal Process. Syst., vol. 79, no. 2, pp. 179–188, 2015.
 [29] D. Carrera, G. Boracchi, A. Foi, and B. Wohlberg, “Detecting anomalous structures by convolutional sparse models,” in IJCNN, 2015.
 [30] A. Efros and T. Leung, “Texture synthesis by nonparametric sampling,” in ICCV, 1999.
 [31] A. Buades, B. Coll, and J.M. Morel, “A nonlocal algorithm for image denoising,” in CVPR, 2005.
 [32] O. Boiman and M. Irani, “Detecting irregularities in images and in video,” IJCV, vol. 74, no. 1, pp. 17–31, 2007.
 [33] H. Seo and P. Milanfar, “Static and spacetime visual saliency detection by selfresemblance,” Journal of vision, vol. 9, no. 12, pp. 15–15, 2009.
 [34] S. Goferman, L. ZelnikManor, and A. Tal, “Contextaware saliency detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 10, pp. 1915–1926, 2012.
 [35] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3d transformdomain collaborative filtering,” IEEE Trans. Image Process., vol. 16, no. 8, pp. 2080–2095, 2007.
 [36] G. Mishne and I. Cohen, “Multiscale anomaly detection using diffusion maps,” IEEE J. Sel. Topics Signal Process, vol. 7, no. 1, pp. 111–123, 2013.
 [37] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in CVPR, 2013.
 [38] N. Bruce and J. Tsotsos, “Saliency based on information maximization,” in NIPS, 2006.
 [39] D.M. Tsai and C.Y. Hsieh, “Automated surface inspection for directional textures,” Image Vis. Comput., vol. 18, no. 1, pp. 49–62, 1999.
 [40] A. Borji, M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE Trans. Image Process., vol. 24, no. 12, pp. 5706–5722, 2015.