## 1 Introduction

Generative networks can learn to generate high-dimensional data from lower dimensional embeddings. Most of the applications require the generative models to generalize given a limited amount of training data. As a consequence, even the signals that are far from the training data distribution can be generated fairly well. Controlling this generalization property of the generative networks can increase their efficiency in the domains where we need to separate one kind of data from the other. Some of the applications of system with limited generative ability include noise reduction and anomaly detection.

In this paper, we present a method to control the generative capabilities of a system in such a way that it can only reconstruct a limited range of input signal space. The technique can be used with different network structures and training algorithms. We will explain the proposed method by focusing on anomaly detection in higher dimensional spaces (e.g. images etc.) using a kind of generative neural networks called autoencoder. Using the proposed technique, generative models can be trained in a way to learn a latent representation that can only encode the input distributions of non-anomalous data. After decoding the latent space back to the signal space, the reconstruction similarity can be used to judge if the input signal contains an anomaly or not.

Anomaly detection is a key and usually the first requirement in many signal-processing applications pipeline [3, 10]. Generative models have previously been applied to anomaly detection [11, 1] and noise reduction [14]. In anomaly detection the task is to find if the input distribution is normal or has an anomaly. It is a one-class classification problem where the training data consists mostly of the non-anomalous class. We argue that due to the generalization property the classic training methods are not ideal for anomaly detection using generative models (as shown in Section 5).

The main contribution of this paper is a new approach to limit the reconstruction capability of the generative networks by learning conflicting objectives for the normal and anomaly data. The technique can use the limited real or synthetic anomalous data by using a negative learning phase in the training. For example, in case of anomaly detection on the road [1], any non-road object (e.g. vehicles, bushes etc.) can be treated as the anomalous data. Some anomaly data is available in most of the anomaly detection applications. The anomaly data might be gathered over time automatically or by human intervention. For instance, in case of a misclassification by a radar based target detection system, the human operator can label the sample correctly for future use. Instead of ignoring this anomaly data, the proposed method uses this data to improve the future detections.

The rest of the paper is organized as follows. Related work is given in Section 2. We formally define the problem in Section 3. The specificities of our approach are detailed in Section 4. Quantitative analysis of the technique are presented in Section 5. Finally, we conclude the paper with some directions for future work in Section 6.

## 2 Related Work

There are a large number of literatures on noise reduction and anomaly detection using generative models. M.N. Schmidt et al. [13]

uses non-negative sparse coding to reduce the wind noise in speech data. They rely on a system that have the source model for the wind noise but not for the speech to reduce the noise. The work done on denoising autoencoder by V. Pascal et al.

[15] is also very important in this area. L. Gondara [4] presents an application of such denoising system to remove noise from medical images. These techniques can reduce the noise the input data but they do not limit the generative capabilities of the network. Due to the generalization property of such networks, they can also generate the data that is very different from the data shown during training.Similar to the method proposed in this paper, for anomaly detection, the machine learns a model to represent normality and then use the model to detect anomalous data. B. Saleh et al.

[12] proposed a method to model a normality of a particular class of object using visual attributes. The attributes [2]are handcrafted and mainly based on the appearance of the input data, i.e. shape, texture and color. A generative model is then trained and used to reason about normal and anomalous data. Recent trend tends to replace these handcrafted attributes with a deep feature representation. W. Lawson et al.

[9] uses deep visual features obtained from AlexNet [8] to represent objects and associated them with a scene to define type of objects that can be found in the certain environment. D. Xu et al. [17]used stacked denoising autoencoders to learn the deep features in an unsupervised fashion and use them to represent both appearance and motion of the scene. Anomalous data is in turn detected by a multiple one-class SVM classifiers. These approaches are more likely to suffer from the imbalance between normal and anomalous data which are the common characteristic of an anomaly detection problem, The proposed method try to solve this problem by effectively using the anomaly data.

Our proposed method uses a similar approach to C. Creusot and A. Munawar [1]

. They use an extremely compressive Restricted Boltzmann Machine (RBM) to form a deep feature representation. But rather than training a classifier in the feature space, anomaly detection is performed by reconstructing the data back to the original image space and use conventional image difference as a metric. The extreme compression in autoencoders can severely effect the reconstruction of input appearance in case the non-anomalous data have a non-trivial appearance.

## 3 Problem Statement

In this section, we will formally describe the problem of limiting the generative network to learn a single type of input distribution. Consider two random variables

X and Y representing instances of two input distributions in same signal space (e.g. image space). Lets assume we have and number of samples from each distribution, and . X is the input distribution we want the network to reconstruct as well as possible, let the reconstruction be called . On the other hand, Y is the distribution that we do not want to the network to reconstruct. Let its reconstructed space be represented by . In order to achieve this objective, we need to maximize(1) |

By maximizing the probability of reconstruction for

X and minimizing it for Y, the generative properties of the model can be controlled in the desired way. It is important to note that usually the data for the distribution X is available in plenty while the data for Y is available scarcely ().## 4 Proposed Method

In this section we discuss the proposed approach to maximize Equation 1. Generative models can be used in a variety of settings and configurations. In this paper we deal with the generative models that encode the input distribution into a latent feature space

and then reconstruct it back in the original signal space. Such generative systems are also known as autoencoders. We use the word “autoencoder” for any kind of generative neural network structure including but not limited to RBM, variational autoencoders and Convolutional Neural Network based autoencoders.

The problem is to learn latent representation L such that it can learn to encode and decode X fairly well but fails to do the same for Y

distribution. In order to formally define the autoencoder like generative models, let us consider a network with input vector of size

, a latent space or hidden layer of size . As the network will learn to reconstruct the input the output of the network will also be of size . Given the training data X, a function F can transform this input signal to the hidden layer while a function G can reconstruct the image from the latent space. The network parameters for encoder and decoder are represented by and respectively. We want to find the optimal parameters to minimize the reconstruction error. When presented with an input vector x, the network produces a hidden vector and an output vector . The goal of the learning is to minimize an error or energy function(2) | ||||

where, is a distance or dissimilarity measure. We can use any dissimilarity measure, in this paper we use mean square error: . Optimum set of parameters can be found by

(3) |

In order to create an interesting representation of the data, usually the size of hidden layer is kept smaller then the input size . However, can also be used with additional sparsity constraints to see very interesting behaviors for some applications.

In this paper, we propose using any real or synthetic anomalous data Y to limit the reconstruction capability of the autoencoder. This is done by incorporating a negative learning phase in the training. System parameters learned during the training allow reconstructing a wide variety of input patterns. Negative training adjusts the system parameters in a way that the anomalous patterns cannot be reconstructed well. In terms of neural networks, the connections that are used to reconstruct anomalies are weakened during the negative learning. The negative learning can formally be defined as

(4) |

Using Equation 3 and Equation 4, the model can be controlled to reconstruct non-anomalous data better than the anomalies. It is important to note that both the equations are optimizing the same set of parameters with conflicting objectives. Going along the gradient for non-anomalous data and against the gradient for negative makes the system go towards a minimum where it can reconstruct only non-anomalous data.

Algorithm 1

introduces the negative learning phase. The strategy is to use all the non-anomalous data and finish one epoch of positive learning in which the system learns to reconstruct the non-anomalous training data

X. Then in the negative learning step the available anomalous data Y is used to make the system unlearn the reconstruction capability of the anomalies. Positive and negative learning steps are repeated until the termination criteria is met. This enables the system to learn only the reconstruction of non-anomalous signals. We show that the benefits of using the negative learning approach are significant even when the size of anomalous training samples is much smaller than the non-anomalous signal data.It is important to keep a balance between the negative and the positive learning. If the size of anomalous data is very small compared to non-anomalous data, a single positive learning iteration should be followed by multiple iterations of negative learning. An adaptive approach can be used to compute optimal number of iterations for the negative learning. This adaptive algorithm is out of the scope of this paper.

## 5 Experimental Results

In order to explain the working of the algorithm, the initial experiments are conducted with the MNIST digits dataset. The later part of this section uses actual highway data to show the validity of the approach for real world problems.

### 5.1 Evaluation using MNIST

For this experiment we have used a single layer RBM based autoencoder. gray-scale images of MNIST digit dataset are used to train a fully connected autoencoder of size 784500

784. Sigmoid was used as the activation function. Termination criteria, maximum number of epochs was set to

. Batch size was. The network was trained by using single-step contrastive divergence (CD-1)

[5].(5) |

where represents visible layer, is the hidden layer, represents the expected value and is the sign. For positive learning stage the change in weights are updated by Equation 5 with , and for negative stage (going against the gradient) the same equation is used with .

Figure 1(a) shows images from MNIST test dataset. Digit and are considered to be the anomalies; hence, the autoencoder trained with the proposed method should not be able to reconstruct these digits well.

Figure 1(b) shows the results of reconstruction using a conventional autoencoder, trained using CD-1 with epochs and batch size of . The training data for conventional training method contains all the digits except the images of digit and . It can be clearly seen that that even though the system knows nothing about digits and , it is able to reconstruct them fairly accurately. This property of autoencoders is not desirable for anomaly detection.

Figure 1(c) shows the reconstruction results using the proposed approach. Digits and are no more reconstructed properly; rather they are converted to the closest point in the non-anomalous signal space. From the shapes point of view, digits can be thought of as a part of digit and similarly digit is a part of digit . Yet the system trained with the proposed approach was able to reconstruct digits and but failed to reconstruct and .

Figure 2 shows the frequency distribution of dissimilarity measure for normal and anomaly data. For the conventional autoencoder, we can observe a huge overlap between the curves, making it difficult to select a suitable threshold to decide anomalies. The proposed autoencoder shifts and spreads the curve of the anomaly data horizontally while the curve for the non-anomalous data largely remain unaffected.

To simulate the case where the anomaly data is much smaller than the non-anomalous data, another experiment is conducted using only anomalous images (the first images of and ), and roughly non-anomalous images for training. In order to create a balance between the positive and negative learning phase, five iterations of negative learning are performed after each positive learning phase (number of iterations for negative learning can be computed adaptively, this however is out of the scope of current paper). As shown in Figure 2(c), even for this experiment where the anomalous data size is times smaller then regular data, there is still a major improvement as compared to the results of conventional autoencoder given in Figure 2(a). Similar results were achieved using other digits as anomaly.

Another very interesting results are shown in Figure 3. Figure 3(b) shows that a conventional autoencoder trained solely using the images of digits in MNIST dataset can also reconstruct random shapes. However, as visible in Figure 3(c), even though only digit and were used as anomalous images during the training process, the system failed to reconstruct anomalies that it has never seen before. This proves that knowing the appearance of all the anomalies is not important.

### 5.2 Evaluation on obstacle detection

In the second experiment, we used the 4K highway video on Japan highways [16]. It consists in a 1h40m sequence of Japan highway recorded from the car dashboard with a Panasonic GH4 camera in 4K resolution (). We considered the video between frame 105360 to frame 114360.

These frames were selected as they have a good view of the road without any vehicle occluding the road. The images were converted to gray-scale and then resized to of the original size. We then selected a fix mask of in the center road area. gray-scale road patches images of size

were extracted with random strides from the rescaled video as non-anomalous data. During the dataset creating all featureless road patches were ignored. In this experiment we used just

gray-scale CIFAR-10 images [7] as anomaly data (as shown in Figure 4). Randomly selectedof the data was used for training while the remaining data was used for testing. Mean and standard deviation is computed for all the images in the training data. Training and test data is then normalized by subtracting this mean from each image and dividing each image by the computed standard deviation. The network was of size 1024

5121024 in size. Adam [6] was used as the optimizer to verify the validity of the proposed technique for different learning methods. Parameters used for Adam were, , , and . The termination criteria was number of epochs that was set at . Batch size was selected as .Area under the receivers operating characteristics curve (AUROC) was used for quantitative measure. Figure

5 shows that with conventional training of autoencoder using only the road data can still reconstruct the CIFAR images fairly well. The value of AUROC for conventional autoencoders stays around . As the system learns to reconstruct the road better it becomes equally good at reconstructing the CIFAR data. For the proposed technique the AUROC tends to reach as the number of epochs increases (for the proposed method one epoch means one iteration of positive and negative learning). For this experiment the amount of negative learning to perform was computed adaptively by observing the maximum gain in AUROC by increasing or decreasing the size of data and iterations for negative learning. Details on adaptive control of negative learning are out of the scope of this paper. Figure 4 shows the reconstruction quality for the road and CIFAR images by the conventional and the proposed approach. It is clear that after epochs the conventional method can reconstruct the anomaly images much better then the proposed approach. This results in lowering the AUROC for the conventional approach.In Figure 6 we have compared the ROC of our system with classical two class classifiers. In this case a mask was used to locate the road in the video and anything outside the mask was treated as anomaly. We captured a video on Japan highways in similar conditions to [16]. However, while [16] use a relative wide angle 12-35mm lens, we used 70-150mm lens. road patches of size were extracted, while only patches were extracted for anomaly data. The size of anomaly data was times less than that of the road data. data was used as training and the remaining for testing. The data was normalized in a manner similar to the previous experiment. The AUROC for SVM and LDA classifier is and respectively. The AUROC for the anomaly detection technique proposed in this paper reaches in epochs. The network was of size 256200256. The batch size was kept at

. Vanilla stochastic gradient descent algorithm was used with a learning rate of

. A significant improvement in AUROC clearly shows the benefit of the proposed approach.### 5.3 Limitations

This technique works for the cases where non-anomalous data compared to anomaly data is confined in the input space. In the above experiment treating CIFAR as normal and road as anomaly will not produce expected results. This assumption is generally true for anomaly detection applications where normal operation is more or less predictable and uniform.

## 6 Conclusions

We proposed a novel method to train generative models namely autoencoders for anomaly detection. Anomaly is determined by considering the similarity measure of the input and the reconstructed signal. Conventional training methods allow the reconstruction of the signal space far beyond the training data. The proposed method ensures that the autoencoder only learns to reconstruct the signals that are similar to the training distribution. This makes it easier to separate a normal signal from an anomalous signal. The core idea of this research is the introduction of a negative learning phase, in which the system unlearns the reconstruction of anomalous signals. The balance of positive learning with negative learning phases help to move the frequency distribution curves for dissimilarity of regular and anomalous data away from each other.

As for a future direction, we are currently working on adding the notion of time to this approach. In this case the system will only be able to predict the road feature that should appear next. By matching prediction with the actual observation we can reveal anomalies.

## References

- [1] C. Creusot and A. Munawar. Real-time small obstacle detection on highways using compressive rbm road reconstruction. In IEEE Intelligent Vehicles Symposium, Seoul, Korea, 2015.
- [2] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778–1785, June 2009.
- [3] D. Forslund and J. Bjärkefur. Night vision animal detection. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pages 737–742, June 2014.
- [4] L. Gondara. Medical image denoising using convolutional denoising autoencoders. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pages 241–246, Dec 2016.
- [5] G. E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines, pages 599–619. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
- [6] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- [7] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
- [9] W. Lawson, L. Hiatt, and K. Sullivan. Detecting anomalous objects on mobile platforms. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2016.
- [10] A. Munawar, P. Vinayavekhin, and G. D. Magistris. Spatio-temporal anomaly detection for industrial robots through prediction in unsupervised feature space. In IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, USA, 2017.
- [11] M. Sakurada and T. Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2Nd Workshop on Machine Learning for Sensory Data Analysis, MLSDA’14, pages 4:4–4:11, New York, NY, USA, 2014. ACM.
- [12] B. Saleh, A. Farhadi, and A. Elgammal. Object-centric anomaly detection by attribute-based reasoning. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, pages 787–794, Washington, DC, USA, 2013. IEEE Computer Society.
- [13] M. N. Schmidt, J. Larsen, and F. T. Hsiao. Wind noise reduction using non-negative sparse coding. In 2007 IEEE Workshop on Machine Learning for Signal Processing, pages 431–436, Aug 2007.
- [14] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), pages 1096–1103. ACM, 2008.
- [15] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, Dec. 2010.
- [16] Wataken777. Youtube. tokyo express way gh4 4k. https://www.youtube.com/watch?v=UQgj3zkh8zk. (3840 x 2160), 2014.
- [17] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe. Learning deep representations of appearance and motion for anomalous event detection. In X. Xie, M. W. Jones, and G. K. L. Tam, editors, Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pages 8.1–8.12. BMVA Press, 2015.

Comments

There are no comments yet.