Novelty detection is an important and challenging task in computer vision. It mainly addresses the problem of quantifying the probability of a test sample belonging to a distribution defined by a training dataset. Different from other machine learning tasks, in novelty detection only a single class of samples can be observed during training. However, the scarcity, variability and unpredictability of outlier samples make it difficult to make a decision, as shown in figure 1.
Since novelty detection has many applications on anomaly detection(Schlegl et al., 2017; Zenati et al., 2018; An and Cho, 2015; Akçay et al., 2019), intruder detection(Oza and Patel, 2019; Perera and Patel, 2019), and biomedical data processing(Roberts, 1999), it has attracted extensive attention of many researchers. The key of novelty detection is to make the model have a better representation of inlier samples and a degree of surprisal(Tribus, 1961)
to outlier samples after only learning from inlier samples. Generally speaking, both of these abilities should be strengthened in training. A assumption is that outlier samples differ from inlier samples not only in high dimensional data space but also in low dimensional latent space(Akçay et al., 2019). Therefore, the significance of this problem is how to obtain a better reconstruction of high dimensional data space and low dimensional latent space. In view of this, in (Akçay et al., 2019), it uses a conventional generative adversarial network to jointly learn the generation of a high dimensional image space and the low dimensional latent space for anomaly detection. The anomaly detection model is comprised of generator and discriminator. The generator of model adopts UNet(Ronneberger et al., 2015) style network for better image reconstruction and the discriminator is used to discriminate images. They learn from each other in the form of adversarial training. But this method does not pay attention to the data distribution of latent space.
Partly inspired by the work (Akçay et al., 2019), we propose an improved model and apply to novelty detection. The above method regard multiclass targets as normal in problem setting. However, there is only one class of normal samples in novelty detection. In order to focus on the expression of a class of samples and improve the detection effect, we propose to introduce the channel attention mechanism into network structure for focusing on the important and interested features. Secondly, from the perspective of information theory, the information entropy of the latent layer is decreased to constrain the expression of diversity. Experimental results on three public datasets demonstrate the validity of the proposed model. The main contributions of this paper are as follows:
Features choosing: The channel attention mechanism is applied to the network structure, which makes the network focus on the important and interesting features.
Constraining latent space: The information entropy is introduced into the latent layer to make it sparse and reduce the expression of diversity.
2 Related Work
Novelty detection and anomaly detection are highly correlated. Due to the scarcity and unpredictability of anomaly samples, unsupervised learning is generally used to solve the problem. Some traditional solutions include one-class SVM(Chen et al., 2001) which confines inlier samples to a subspace, and LOF (Breunig et al., 2000) which is based on k
clustering and density estimation. But the traditional methods depend on the normal and abnormal samples in training. Nowadays, with the great achievements of deep learning in the field of image processing, the main strategies are based on the reconstruction method and probability distribution method by introducing deep neural network to solve this problem.
In the reconstruction-based strategy, the reconstruction error is used to measure the novelty of samples. For instance, the classical methods include principal component analysis(PCA). There are also many methods based on deep learning. In AnoGAN(Schlegl et al., 2017), it is the first time to apply GAN into image anomaly detection. The best potential space z
is found by backpropagation method, and the anomaly is judged according to the reconstructed image and the original image. Besides, the auto-encoder is also exploited to model and measure the reconstruction error. In ALOCC(Sabokrou et al., 2018), the auto-encoder is used as a generator and play two roles, one is to enhance normal samples, and the other is to disturb abnormal samples. Furtherly, in (Liu et al., 2018), they propose for the first time to use the predicted future frame to compare with the current frame for anomaly detection. It adds temporal constraint in the video anomaly detection. The generator of model adopts UNet(Ronneberger et al., 2015) structure to predict next frame.
In the probability distribution-based strategy, more attention is paid to the distribution of low-dimensional latent layer vectors. For instance, in(Abati et al., 2019), it learns the probability distribution of potential representation through the autoregressive process. In OCGAN (Perera et al., 2019)
, the denoising auto-encoder network is used as the generator. By using the adversarial training mechanism, the inlier samples and low-dimensional latent space can achieve one-to-one mapping to exclusively represent given inliers, etc. Recently, there have some related works in self-supervised learning. The main idea is take advantage of some auxiliary tasks to better describe the data distribution of normal samples. In(Golan and El-Yaniv, 2018), dozens of geometric transformations are applied on samples to train multi-class model and this auxiliary task is used to obtain the feature detector to identify anomalies. In(Hendrycks et al., 2019)
,the normal samples are rotated in four angles of 0, 90,180,270, one label corresponds to one rotation type to build an auxiliary self-supervised rotation loss classifier. In this paper, we consider to forming an improved method based on the reconstruction-based strategy for one-class novelty detection.
3 Proposed method
First, we will define our problems. The given dataset is divided into training set and test set . The data in training set is composed of data samples with , in which denotes normal samples. The test set consists of samples, including normal samples which are denoted and abnormal samples which are denoted .
In Skip-GANomaly(Akçay et al., 2019)
, the goal is to acquire the high-dimensional and low-dimensional image features. It is trained by the means of unsupervised adversarial learning. The following three loss functions are used to train the model. An adversarial lossis used to constrain the generator and discriminator so that they can learn from each other. is data distribution of training data.
Then, in order to make the generated image similar to the original image , some context information is obtained by the reconstruction error denoted as a context loss .
Similarly, low dimensional features of the generated image and the original image are expected to be as close as possible by a feature match loss . Note that the feature here refers to the penultimate output in the discriminator which is marked with P in the figure 2.
During the test, it is observed that the ability of network feature extraction is very important. But this method does not pay attention to the expression of latent space. In novelty detection setting, it only focus on the representation of one class of samples. Desired goal is other class samples have a poor reconstruction after the generator. Therefore, it is particularly important for the model to only focus on the expression of normal samples. Considering the selection of important features and the constrain of latent space , we propose to introduce channel attention mechanism and information entropy minimization into the network based on the use of de-noising auto-encoder to achieve certain constraints on auto-encoder.
The proposed improved network is mainly composed of two parts: the generator G and the discriminator D, as shown in Figure 2. Generator G is a de-noising auto-encoder, which is responsible for the generation of images. Discriminator D is a plain identification network, which is responsible for the identification of images. The generator and the discriminator are trained against to learn from each other. In the generator, the network is a symmetrical denosing auto-encoder including encoder and decoder . When an image added with random noise is put into the network, a latent layer h is obtained. Then, the information entropy of latent layer h is calculated. Correspondingly, the generated image is obtained by decoder
. The network structure of encoder are composed of stacked convolution blocks including convolution,batchnorm and activation function. Channel attention mechanism is added after each block. The decoder are composed of a series of deconvolution, batchnorm and activation function in each convolution blocks and similarly channel attention operations is added. The network structure of discriminator is similar with encoder of DCGAN(Radford et al., 2015). Besides, the latent layer h of generator is constrained by information entropy minimization.The proposed novelty detection framework is showed in Fig.2. The two newly added parts are explained below.
Information entropy: it is generally used to measure the amount of information and describe the uncertainty of information source. In the novelty detection, there is only one kind of training samples. When the samples passing through the encoder of the generator, the compressed information will be saved in the latent layer h, and then the generated image will be obtained through the decoder . The information of latent layer h is generally considered to retain the feature information of the training class. Because in the process of training, only normal samples can be observed in the model. Latent layer is hoped to show a lower entropy in the possible representation so as to extract features that are easy to predict and repeat in inliers. From the constrain of latent space and preserving unique information, we optimize this layer by minimizing the information entropy. Before calculating the information entropy of the latent layer h, the sum of the latent layer is changed to 1 via Softmax activation function. The length of latent layer h is denoted as . Finally, the information entropy loss function is defined as:
: in addition to the improvement of the loss function mentioned above, we also modify the network structure. In order to increase the representation power of the network for paying more attention to the important and interesting features, we consider introducing the attention mechanism to the network. As a complement to convolution operation, the attention mechanism has shown great ability in natural language and image processing. In recent years, there are many works show that the attention mechanism plays an important role in refining feature selection. In(Wang et al., 2017), the classification ability of the network is improved by introducing the attention mechanism to obtain the refined feature map. In SAGAN (Zhang et al., 2018), in order to make the generated image more realistic, it assigns weights to each element of the middle feature map to focus on the global dependency. Moreover, in CBAM(Woo et al., 2018), there are two forms of attention mechanism, one is channel attention mechanism, which refines the feature map by assigning weights to each channel of the feature map, and the other is spatial attention mechanism, which pays more attention to the spatial relationship between features. In novelty detection, it pays attention to the content information instead of the location of the target and specific pixel information. We take notice of the global semantic information, which can obtain by the global maximum pooling and average pooling operators. Therefore we choose to introduce the channel attention mechanism(Woo et al., 2018) into the network to make it focus on the important and interesting features.
As shown in Figure 2, we add the channel attention mechanism to the generator G to choose important features information. Specifically, given a feature map F, two vectors and
are obtained through the max-pooling and average-pooling operations respectively. After that, the vectors pass through a shared three-layer fully connected network denoted as MLP in turn. The dimensions of the output layer and the input layer are the same, and the number of neurons in the middle hidden layer is controlled by the proportionR. Finally, the corresponding elements of the two outputs are added and the weight of each channel is obtained by the activation function Sigmoid denoted as . The given feature map F is multiplied by the corresponding channel weight to get the refined feature map , which can be formulated as follows:
Finally, the model is optimized to minimize the integrated loss function L:
where , , , are the weighting parameters adjusting the impact of individual loss to the overall objective function.
4 Experimental Results
4.1 Datasets and Experimental Results
In the inference stage, we adopt the novelty score, proposed in (Schlegl et al., 2017) and also employed in (Zenati et al., 2018). The high-dimension context loss and low-dimension feature loss are used to get the novelty score, therefore the novelty score is defined as A(x) for a test sample x:
By normalizing the novelty score to [0, 1], we get the novelty score of each test image. When the score value is closer to 1, it indicates that the image is more likely to be an outlier image. In this way, all novelty score values of the test set are compared with the real label to obtain the result.
In order to better validate our idea, we conduct expensive experiments on three datasets, namely COIL-100, MNIST and CIFAR10. Figure 3 shows some typical images of each datasets. During all experiments, the pixel values of all images are scaled in [-1, 1]. Since the problem domain settings of Skip-ganomaly(Akçay et al., 2019) for anomaly detection and novelty detection are different, these two tasks are not comparable. Therefore, we compared some traditional methods, such as OCSVM(Chen et al., 2001), KDE(Bishop, 2006) and DAE(Hadsell et al., 2006), and some deep learning methods, such as VAE(Kingma and Welling, 2013), PixCNN(Van den Oord et al., 2016) and AnoGAN(Schlegl et al., 2017) etc. The AUC (Area Under Curve) criterion is used as the measurement for performance evaluation.
COIL100: COIL100 dataset is a collection of color pictures, including 100 objects taken from different views. Among them, the difference between images within a class is small, while the gap between classes is relatively large. For consistency, the training set accounts for 80% of normal samples and the remaining 20% of normal samples are used for test. The abnormal test samples are randomly selected, so they make up half of the test set. Because the dataset is relatively simple, the method considered produces a higher AUC value illustrated in table 1. In ALOCC (Sabokrou et al., 2018), the importance of reconstruction step is reflected. The AUC value of our method is 0.961 more than DCAE (Sakurada and Yairi, 2014) .
|ALOCC DR(Sabokrou et al., 2018)||0.809|
|ALOCC D(Sabokrou et al., 2018)||0.686|
|DCAE(Sakurada and Yairi, 2014)||0.949|
|OCSVM(Chen et al., 2001)||0.988||0.999||0.902||0.950||0.955||0.968||0.978||0.965||0.853||0.955||0.9513|
|DAE(Hadsell et al., 2006)||0.894||0.999||0.792||0.851||0.888||0.819||0.944||0.922||0.740||0.917||0.8766|
|VAE(Kingma and Welling, 2013)||0.997||0.999||0.936||0.959||0.973||0.964||0.993||0.976||0.923||0.976||0.9696|
|PixCNN(Van den Oord et al., 2016)||0.531||0.995||0.476||0.517||0.739||0.542||0.592||0.789||0.00||0.662||0.6183|
|GAN(Schlegl et al., 2017)||0.926||0.995||0.805||0.818||0.823||0.803||0.890||0.898||0.817||0.887||0.8662|
|AND(Abati et al., 2018)||0.984||0.995||0.947||0.952||0.960||0.971||0.991||0.970||0.922||0.979||0.9671|
|AnoGAN(Schlegl et al., 2017)||0.966||0.992||0.850||0.887||0.894||0.883||0.947||0.935||0.849||0.924||0.9127|
|OCSVM(Chen et al., 2001)||0.630||0.440||0.649||0.487||0.735||0.500||0.725||0.533||0.649||0.508||0.5856|
|DAE(Hadsell et al., 2006)||0.411||0.478||0.616||0.562||0.728||0.513||0.688||0.497||0.487||0.378||0.5358|
|VAE(Kingma and Welling, 2013)||0.700||0.386||0.679||0.535||0.748||0.523||0.687||0.493||0.696||0.386||0.5833|
|PixCNN(Van den Oord et al., 2016)||0.788||0.428||0.617||0.574||0.511||0.571||0.422||0.454||0.715||0.426||0.5506|
|GAN(Schlegl et al., 2017)||0.708||0.458||0.664||0.510||0.722||0.505||0.707||0.471||0.713||0.458||0.5916|
|AND(Abati et al., 2018)||0.717||0.494||0.662||0.527||0.736||0.504||0.726||0.560||0.680||0.566||0.6172|
|AnoGAN(Schlegl et al., 2017)||0.671||0.547||0.529||0.545||0.651||0.603||0.585||0.625||0.758||0.665||0.6179|
MNIST: It is a classical dataset which is often used as novelty detection. It consists of ten types of handwritten digits. The size of each image is 28*28 resolutions. To accommodate the input of the model, each sample is resized to 32*32 resolutions through the method of bilinear interpolation. In training, one class is considered inliers, while the remaining classes are considered outliers. In order to be consistent with the comparison methods, we use the partition of training-testing splits of the given dataset. The normal samples of training split are used for training set and the all samples of testing split are used for testing set. The results are showed in table 2. We consider some traditional methods, OCSVM(Chen et al., 2001) and KDE(Bishop, 2006) etc., and some methods based on deep learning, GAN (Schlegl et al., 2017) etc. Each row represents the experimental results of one method, expressed with AUC value. The last column is the average AUC value of each method. Compared with the methods in table 2, our result has proximal effects with VAE and AND. We analyse that the effects of the channel attention and information entropy has a very small effect on gray image with simple background. When the background becomes complex and diverse, the newly added module can play a more effective role. This can be verified in CIFAR10 dataset. In general, our method has achieved desirable results.
CIFAR10: The dataset contains 60,000 natural images, all of which are derived from the real world. They are 32*32 resolutions and are divided into 10 categories with 6000 images in each category. In the experiment, the partition of the training set and testing set is the same as MNIST dataset. One class is considered inliers while the other nine classes are considered outliers. In the table 3, we show the results of each class as an inlier sample. In the method based on deep learning, our method shows great advantages. Because Anogan needs to find the best representation through backpropagation, it is quite slow. Our method can complete the judgment as long as we have a forward inference. It is worth noting that when the class of cat or dog is regarded as outliers, almost all methods show common results. Similarly, the identification ability of the considered model is not very good in the similar classes car and truck. Relative to AnoGAN and AND, our results has been a marked improvement on this issue improved by 4. In AND, autoregressive density estimation is used, but the method we put forward is relatively simple and can achieve better results. Through the newly added part, the model can adaptively select important features and retain unique information.
4.2 Ablation Study
In order to verify the validity of the improved model, a series of ablation studies are carried out on the natural image dataset CIFAR10. Two new components are tested respectively. Firstly, we consider the structure only consists of generator and discriminator. The generator is a denosing auto-encoder. The value of AUC is 0.619 in table 4. Secondly, information entropy is added to the latent layer h to optimize loss function in generator. The experimental result increased by 2%. Thirdly, on the basis of adding above operation, channel attention mechanism is added into generator. The performance improves about 1.3%. The effectiveness of the new addition was confirmed by the experimental analysis of each component.
|Generator and discriminator||0.619|
|With latent entropy loss||0.642|
|With channel attention||0.655|
In this paper, we note that the quality of feature extraction within one class directly affects the familiarity of the model with inlier samples and the sensitivity of outlier samples. We propose to introduce channel attention mechanism into the generator to better extract inlier features. At the same time, we improve the loss function from the perspective of information entropy to constrain the expression of the latent layer and remove the redundancy of coding information. Besides, unsupervised adversarial learning is used to optimize both high-dimensional data space and low-dimensional latent space. The proposed method is validated in three open datasets. In future, we will pay more attention to the expression of the model for single class samples.
This work was supported in part by Natural Science Foundation of Zhejiang Province (LQ18F030013, LQ18F030014)
- Abati et al. (2018) Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. And: Autoregressive novelty detectors. arXiv preprint arXiv:1807.01653, 2018.
- Abati et al. (2019) Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Latent space autoregression for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 481–490, 2019.
- Akçay et al. (2019) Samet Akçay, Amir Atapour-Abarghouei, and Toby P Breckon. Skip-ganomaly: Skip connected and adversarially trained encoder-decoder anomaly detection. arXiv preprint arXiv:1901.08954, 2019.
An and Cho (2015)
Jinwon An and Sungzoon Cho.
Variational autoencoder based anomaly detection using reconstruction probability.Special Lecture on IE, 2(1), 2015.
- Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
- Breunig et al. (2000) Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In ACM sigmod record, volume 29, pages 93–104. ACM, 2000.
Chen et al. (2001)
Yunqiang Chen, Xiang Sean Zhou, and Thomas S Huang.
One-class svm for learning in image retrieval.In ICIP (1), pages 34–37. Citeseer, 2001.
- Golan and El-Yaniv (2018) Izhak Golan and Ran El-Yaniv. Deep anomaly detection using geometric transformations. 2018.
- Goodfellow et al. (2014) Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Advances in Neural Information Processing Systems, 3:2672–2680, 2014.
- Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
- Hendrycks et al. (2019) Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learning can improve model robustness and uncertainty. 2019.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Liu et al. (2018) Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6536–6545, 2018.
- Oza and Patel (2019) Poojan Oza and Vishal M Patel. Active authentication using an autoencoder regularized cnn-based one-class classifier. arXiv preprint arXiv:1903.01031, 2019.
- Perera and Patel (2018) Pramuditha Perera and Vishal M Patel. Dual-minimax probability machines for one-class mobile active authentication. In 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–8. IEEE, 2018.
Perera and Patel (2019)
Pramuditha Perera and Vishal M Patel.
Learning deep features for one-class classification.IEEE Transactions on Image Processing, 2019.
- Perera et al. (2019) Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. Ocgan: One-class novelty detection using gans with constrained latent representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2898–2906, 2019.
- Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Computer Science, 2015.
- Roberts (1999) Stephen J Roberts. Novelty detection using extreme value statistics. IEE Proceedings-Vision, Image and Signal Processing, 146(3):124–129, 1999.
- Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. 2015.
- Sabokrou et al. (2018) Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3379–3388, 2018.
- Sakurada and Yairi (2014) Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, page 4. ACM, 2014.
- Schlegl et al. (2017) Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, pages 146–157. Springer, 2017.
- Tribus (1961) M Tribus. Thermostatics and thermodynamics: An introduction to energy. In Information and Sates of Matter with Engineering Applications. D. van Nostrand Princeton, 1961.
- Van den Oord et al. (2016) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790–4798, 2016.
- Wang et al. (2017) Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2017.
- Woo et al. (2018) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
- Zenati et al. (2018) Houssam Zenati, Chuan Sheng Foo, Bruno Lecouat, Gaurav Manek, and Vijay Ramaseshan Chandrasekhar. Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222, 2018.
- Zhang et al. (2018) Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
Zong et al. (2018)
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho,
and Haifeng Chen.
Deep autoencoding gaussian mixture model for unsupervised anomaly detection.2018.