Abstract
Visual anomaly detection is common in several applications including medical screening and production quality check. Although a definition of the anomaly is an unknown trend in data, in many cases some hints or samples of the anomaly class can be given in advance. Conventional methods cannot use the available anomaly data, and also do not have a robustness of noise. In this paper, we propose a novel spatiallyweighted reconstructionlossbased anomaly detection with a likelihood value from a regression model trained by all known data. The spatial weights are calculated by a region of interest generated from employing visualization of the regression model. We introduce some ways to combine with various strategies to propose a stateoftheart method. Comparing with other methods on three different datasets, we empirically verify the proposed method performs better than the others.
I Introduction
Anomaly detection [1] has widely been used in many fields. For example, it is now common to see the use for medical diagnosis [2, 3]. Depending on the realworld anomaly detection problems, some of the anomaly patterns might already be known. For example, some typical different patterns from healthy older adults of a screening test for dementia called the Yamaguchi FoxPigeon Imitation Test (YFPIT) [4, 5] have been reported. The detection should be able to leverage this information to improve the accuracy. Since the manual check of anomaly conditions is mostly done by visual inspection and recent vision researches have led to significant breakthrough [6, 7], using image information is natural for developing an automatic system [8]. In this paper, we focus on visual anomaly detection for problems with some known anomalies.
A straightforward method where normal and anomaly patterns are given is to train a regression function for these classes by a convolutional neural network (CNN). However, this method suffers from the data imbalance problem, and the regression value for unknown patterns is intrinsically unexpected. Hence, a structure for detecting anomaly pattern is required.
Visual anomaly detections [9, 10] mostly use a reconstructionloss computed by a generative model which is minimized a loss between normal inputs and the reconstructed images. However, noise in the image will be averaged or eliminated by the generative models; thus, these methods will misclassify noise as the anomaly. The “Raw loss” image in Fig. 1 shows this issue; where the “Unexpected” loss is a problem. Furthermore, these methods could not use the known anomaly patterns.
A region of interest (ROI) can reduce this effect. However, defining the ROI manually will be tricky, errorprone, and suboptimal. Recently, GradCAM [11] was proposed as a method which computes the ROI from gradients of CNN; it does not require any region information.
We propose a hybrid method of spatiallyweighted reconstructionlossbased anomaly detection and a likelihood value from a regression model trained known data. The weights are computed by GradCAM [11] to decrease noise effects, and a combination with the regression is to improve accuracy by known information. Moreover, we introduce various strategies to combine. We named the SPatiallyweighted Anomaly DEtection as “SPADE” [12], and a method with SPADE and Regression as “SPADER”. Fig. 1 shows a flow of the SPADER. We verify the method with some methods on three datasets. The major contributions are, (1) we proposed a method for the condition where some anomaly patterns are known, (2) we proposed a spatiallyweighted method for noise reduction, (3) we proposed a hybrid strategy to get stateoftheart results.
Ii Proposed method
As problem statement, there are three classes: normal class , known anomaly class , and unknown anomaly class
. Given a set of training patterns from
and , method does detection for test patterns of all classes.Iia Training
The proposed method have two training networks: a variational autoencoder (VAE) [13] for reconstructing normal patterns, and a CNN for normalness regression.
The VAE network is set with the following equation,
(1) 
During this minimization, the network optimizes,
(2) 
where is the generative parameters, is the variational parameters, and
is a random latent variable. In this paper, we use a normal distribution for the variable space; thus, the generative loss is a mean squared error. In the remaining paper, we call
for both of and .The CNN network is set with the following equation,
(3) 
where is the label value for . When is normal, ; when is anomaly, .
IiB Detection
The proposed method has the following three steps to calculate a score for detection: getting the reconstructionloss image, calculating an ROI image, and combining the spatiallyweighted loss and a likelihood value from the regression model. Algorithm 1 explains the details.
The reason for adding an ROI for the VAE output (line 68 in Algorithm 1) is, if we use only GradCAM of the input and the input is the anomaly, the area that is related to the normal class will not be focused on; however, the loss might appear in this region. Adding enables to focus on areas for the normal class as well.
The reason for using an absolute function (line 5) is, the input is not only the normal class but also the anomaly class. When we use a ReLU function that is same as the original GradCAM [11], ROI will only focus on the normal class. However, as the reconstructed image by the VAE appears similar to the normal patterns, the function for VAE output (line 7) must be a ReLU function.
The reason for adding normalization (line 9) is, weighting without will be affected by the region size and strength of . When the ROI is wide or it has high value, will easily become a high value. Therefore, we included the normalization in this equation.
Iii Experiments
In this paper, we prepared the following three datasets that include noise: handwritten digit images [14] with noise, a public hand gesture dataset [15], and images of human gestures which are described in [5]. The first and second datasets are for quantitative evaluation, and the third is for effectiveness assessment.
Iiia Methods
We compared the following methods in each dataset:
VAE [10]: VAE reconstructionloss anomaly detection [10]  
(5)  
Naïve VAE + GradCAM: Naïve weightedloss by ROI  
(6)  
SPADE w/o norm.: SPADE without normalization  
(7)  
SPADE [12]: Spatiallyweighted anomaly detection  
(8)  
CNNReg: Anomaly detection by regression model  
(9)  
VAE + CNNReg: VAEbased detection w/ regression  
(10)  
SPADER: SPADE with regression  
(11) 
where meanings of the variable are in Algorithm 1, and is the number of trials for VAE reconstruction ().
IiiB MNIST with noise
Since the MNIST [14] does not contain noise, the reconstructionloss did not have a problem; An. J, el al. have already reported the performance [10]. In this experiment, we added normal distribution noise . The is generated from for each image. This is because such a condition is common in a real case; for example, each image has different noise according to the difference of the person or background in Fig. 3 and Fig. 4. We also changed the image size to , and the digit was put with random size and position. Fig. 3
shows the examples of generated images. Here, ‘0’ is the normal class, oddnumbers are candidates for the known anomaly class, and the others are the unknown anomaly. The encoder and decoder of VAE have four convolutional (conv) layers, and the latent space has 128 units. The CNN for getting regression value and GradCAM has three conv layers. Note that we did not change the network among the methods; we only changed the
function.Noisy MNIST (original dataset [14])  Pigeon (the problem is from [5])  

Known anomaly digit  Average  Known anomaly pose  Average  
1  3  5  7  9  c  d  e  
VAE [10]  .63.01  .63.01  .63.01  .63.01  .63.01  .632±.01  .95.01  .95.01  .95.01  .948±.01 
Naïve VAE + GradCAM  .67.04  .59.03  .65.01  .76.02  .53.02  .640±.02  .80.02  .71.09  .78.10  .762±.07 
SPADE w/o norm.  .85.01  .83.02  .88.02  .87.02  .85.02  .857±.02  .96.02  .96.03  .82.09  .911±.04 
SPADE [12]  .85.04  .87.01  .92.02  .86.04  .90.03  .880±.03  .98.01  .97.01  .96.03  .970±.02 
CNNReg (Pigeon:[6])  .73.04  .88.02  .96.02  .88.02  .96.01  .881±.02  .86.03  .97.02  .90.03  .908±.03 
VAE + CNNReg  .73.02  .81.03  .87.03  .85.02  .89.03  .831±.03  .97.00  .99.00  .98.01  .980±.00 
Ours: SPADER  .94.02  .92.01  .97.00  .95.01  .96.01  .947±.01  .99.00  1.00.00  .98.01  .988±.01 
The left side of Table I shows the averages of area under curve (AUC) for ROC among 5 different trials with each condition. For example, the second column shows the result for a condition that 0 is normal, 1 is known anomaly, and the others (29) are unknown anomaly classes. The proposed method (SPADER) has the best performance.
IiiC Hand gesture
We used a public hand gesture dataset because we plan to apply gesture detection for YFPIT [5] in the next section. We used depth images in this dataset, the size of image is , and the images are taken from people with various backgrounds. Fig. 3 explains class definitions and examples. We set ‘1’gesture as the normal, and the others are the anomaly class. The encoder and decoder also have four conv layers, and the latent space has 256 units. The CNN also consists three conv layers.
Hand gesture [15]  Known anomaly gesture  Average  

2  3  4  5  6  7  8  9  10  
VAE [10]  .82.17  .82.17  .82.17  .82.17  .82.17  .82.17  .82.17  .82.17  .82.17  .822.17 
Naïve VAE + GradCAM  .82.17  .80.20  .77.30  .78.17  .76.23  .80.16  .80.23  .81.16  .77.17  .790.20 
SPADE w/o norm.  .83.17  .81.16  .87.09  .83.16  .80.22  .83.15  .84.18  .82.17  .82.18  .828.17 
SPADE [12]  .84.18  .82.16  .86.13  .85.15  .85.16  .85.15  .85.16  .84.18  .85.16  .845.16 
CNNReg  .77.25  .94.03  .92.03  .96.01  .95.02  .87.03  .94.01  .85.20  .97.01  .908.06 
VAE + CNNReg  .87.20  .95.04  .95.04  .97.04  .96.03  .92.05  .95.04  .88.20  .97.02  .934.07 
Ours: SPADER  .87.20  .95.03  .95.04  .96.04  .96.03  .92.05  .95.04  .88.20  .97.02  .937.07 
Table II shows the AUC of the ROC curve for the detection. Sometimes the values of the proposed method are less than or similar to the ‘VAE + Reg’ result; however, the SPADER shows the best results in total.
IiiD Pigeon gesture
We took images for the “pigeon”pose test of YFPIT [5]. We used a Kinect depth image, and we took from people. During the shooting, we changed position and angle of the hand, close/open fingers, righty/lefty, sitting position, and stooping angle. In total, we took around 189,000 images for 7 poses. Fig. 4 shows the definition of poses, and the samples of took images and some difficult images. We set the ‘b’pose as the normal class, ‘c, d, e’ as candidates for the known anomaly class (because Yamaguchi et al. reported these are typical patterns for the subjects), and the others as the unknown anomaly class. The encoder and decoder also have four conv layers, and the latent space also has 256 units. The CNN has the ResNet [6] structure; however, the final layer is different.
Iv Conclusion
We proposed a novel hybrid method with spatiallyweighted anomaly detection and regression model. We conducted experiments on three different datasets, then the proposed method produced the best performance compared to previous methods. We hope this hybrid architecture will contribute in various applications including reinforcement learning works
[18].References
 [1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys, 2009.

[2]
M. Prastawa et al.
, “A brain tumor segmentation framework based on outlier detection,”
MIA, 2018.  [3] T. Schlegl et al., “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” Information Processing in Medical Imaging, 2017.
 [4] H. Yamaguchi et al., “Yamaguchi foxpigeon imitation test for dementia in clinical practice,” Psychogeriatrics, 2011.
 [5] H. Yamaguchi, Y. Maki, and T. Yamagami, “Yamaguchi foxpigeon imitation test: a rapid test for dementia,” Dementia and geriatric cognitive disorders, 2010.
 [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognitio,” CVPR, 2016.

[7]
D. Kimura, K. Pichai, A. Kawewong, and O. Hasegawa, “Ultrafast and online incremental transfer learning,” in
Symposium on Sensing via Image Information, 2011.  [8] A. TaboadaCrispi et al., “Anomaly detection in medical image analysis,” Advanced Techniques in Diagnostic Imaging and Biomedical Applications, 2009.

[9]
M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with nonlinear dimensionality reduction,” in
ACM MLSDA, 2014. 
[10]
J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,”
SNU Data Mining Center, Tech. Rep., 2015.  [11] R. R. Selvaraju et al., “Gradcam: Visual explanations from deep networks via gradientbased localization,” in ICCV, 2017.
 [12] M. Narita, D. Kimura, and R. Tachibana, “Spatiallyweighted anomaly detection,” arXiv:1810.02607, 2018.
 [13] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” ICLR, 2014.
 [14] Y. LeCun et al., “MNIST handwritten digit database,” 2010.
 [15] G. Marin et al., “Hand gesture recognition with leap motion and kinect devices,” in ICIP, 2014.
 [16] University of Padova, “Hand gesture datasets, http://lttm.dei.unipd.it/downloads/gesture/,” 2014.
 [17] A. Chattopadhyay et al., “Gradcam++: Generalized gradientbased visual explanations for deep convolutional networks,” WACV, 2018.
 [18] D. Kimura, “Daqn: Deep autoencoder and qnetwork,” arXiv preprint arXiv:1806.00630, 2018.
Comments
There are no comments yet.