Generative Probabilistic Novelty Detection with Adversarial Autoencoders
Novelty detection is the problem of identifying whether a new data point is considered to be an inlier or an outlier. We assume that training data is available to describe only the inlier distribution. Recent approaches primarily leverage deep encoder-decoder network architectures to compute a reconstruction error that is used to either compute a novelty score or to train a one-class classifier. While we too leverage a novel network of that kind, we take a probabilistic approach and effectively compute how likely is that a sample was generated by the inlier distribution. We achieve this with two main contributions. First, we make the computation of the novelty probability feasible because we linearize the parameterized manifold capturing the underlying structure of the inlier distribution, and show how the probability factorizes and can be computed with respect to local coordinates of the manifold tangent space. Second, we improved the training of the autoencoder network. An extensive set of results show that the approach achieves state-of-the-art results on several benchmark datasets.READ FULL TEXT VIEW PDF
One-class novelty detection is the process of determining if a query exa...
Novelty detection is the process of identifying the observation(s) that
Autoencoders (AE) have recently been widely employed to approach the nov...
Novelty detection, i.e., identifying whether a given sample is drawn fro...
In unsupervised novelty detection, a model is trained solely on the in-c...
Novelty detection is the unsupervised problem of identifying anomalies i...
We propose a new method for novelty detection that can tolerate nontrivi...
Generative Probabilistic Novelty Detection with Adversarial Autoencoders
Novelty detection is the problem of identifying whether a new data point is considered to be an inlier or an outlier. From a statistical point of view this process usually occurs while prior knowledge of the distribution of inliers is the only information available. This is also the most difficult and relevant scenario because outliers are often very rare, or even dangerous to experience (e.g., in industry process fault detection ge13process ), and there is a need to rely only on inlier training data. Novelty detection has received significant attention in application areas such as medical diagnoses Schlegl17ipmi , drug discovery Kadurin17drugcong2011sparse ; Vasconcelos14tpami , videos Sabokrou17tpami
, and outlier detectionxia2015learning ; you2017provable . We refer to PIMENTEL2014215 for a general review on novelty detection. The most recent approaches are based on learning deep network architectures ravanbakhsh2017abnormal ; sabokrou2018adversarially , and they tend to either learn a one-class classifier khan_madden_2014 ; sabokrou2018adversarially , or to somehow leverage as novelty score, the reconstruction error of the encoder-decoder architecture they are based on sabokrou2016video ; xia2015learning .
In this work, we introduce a new encoder-decoder architecture as well, which is based on adversarial autoencoders makhzani2015adversarial
. However, we do not train a one-class classifier, instead, we learn the probability distribution of the inliers. Therefore, the novelty test simply becomes the evaluation of the probability of a test sample, and rare samples (outliers) fall below a given threshold. We show that this approach allows us to effectively use the decoder network to learn the parameterized manifold shaping the inlier distribution, in conjunction with the probability distribution of the (parameterizing) latent space. The approach is made computationally feasible because for a given test sample we linearize the manifold, and show that with respect to the local manifold coordinates the data model distribution factorizes into a component dependent on the manifold (decoder network plus latent distribution), and another one dependent on the noise, which can also be learned offline.
We named the approach generative probabilistic novelty detection (GPND)
because we compute the probability distribution of the full model, which includes the signal plus noise portion, and because it relies on being able to also generate data samples. We are mostly concerned with novelty detection using images, and with controlling the distribution of the latent space to ensure good generative reproduction of the inlier distribution. This is essential not so much to ensure good image generation, but for the correct computation of the novelty score. This aspect has been overlooked by the deep learning literature so far, since the focus has been only on leveraging the reconstruction error. We do leverage that as well, but we show in our framework that the reconstruction error affects only the noise portion of the model. In order to control the latent distribution and image generation we learn an adversarial autoencoder network with two discriminators that address these two issues.
Section 2 reviews the related work. Section 3 introduces the GPND framework, and Section 4 describes the training and architecture of the adversarial autoencoder network. Section 6 shows a rich set of experiments showing that GPND is very effective and produces state-of-the-art results on several benchmarks.
Novelty detection is the task of recognizing abnormality in data. The literature in this area is sizable. Novelty detection methods can be statistical and probabilistic based kim2012robust ; eskin2000anomaly , distance based hautamaki2004outlier , and also based on self-representation you2017provable . Recently, deep learning approaches xia2015learning ; sabokrou2018adversarially have also been used, greatly improving the performance of novelty detection.
Statistical methods barnett1974outliers ; yamanishi2004line ; kim2012robust ; eskin2000anomaly usually focus on modeling the distribution of inliers by learning the parameters defining the probability, and outliers are identified as those having low probability under the learned model. Distance based outlier detection methods knorr2000distance ; hautamaki2004outlier ; eskin2002geometric identify outliers by their distance to neighboring examples. They assume that inliers are close to each other while the abnormal samples are far from their nearest neighbors. A known work in this category is LOF breunig2000lof , which is based on
-nearest neighbors and density based estimation. More recently,bodesheim2013kernel introduced the Kernel Null Foley-Sammon Transform (KNFST) for multi-class novelty detection, where training samples of each known category are projected onto a single point in the null space and then distances between the projection of a test sample and the class representatives are used to obtain a novelty measure. liu2017incremental improves on previous approaches by proposing an incremental procedure called Incremental Kernel Null Space Based Discriminant Analysis (IKNDA).
. Similarly, deep learning based approaches have used neural networks and leveraged the reconstruction error of encoder-decoder architectures.hasan2016learning ; xu2015learning used deep learning based autoencoders to learn the model of normal behaviors and employed a reconstruction loss to detect outliers. wang2018generative used a GAN goodfellow2014generative based method by generating new samples similar to the training data, and demonstrated its ability to describe the training data. Then it transformed the implicit data description of normal data to a novelty score. ravanbakhsh2017abnormal trained GANs using optical flow images to learn a representation of scenes in videos. xia2015learning minimized the reconstruction error of an autoencoder to remove outliers from noisy data, and by utilizing the gradient magnitude of the auto-encoder they make the reconstruction error more discriminative for positive samples. In sabokrou2018adversarially
they proposed a framework for one-class classification and novelty detection. It consists of two main modules learned in an adversarial fashion. The first is a decoder-encoder convolutional neural network trained to reconstruct inliers accurately, while the second is a one-class classifier made with another network that produces the novelty score.
The proposed approach relates to the statistical methods because it aims at computing the probability distribution of test samples as novelty score, but it does so by learning the manifold structure of the distribution with an encoder-decoder network. Moreover, the method is different from those that learn a one-class classifier, or rely on the reconstruction error to compute the novelty score, because in our framework we represent only one component of the score computation, allowing to achieve an improved performance.
State-of-the art works on density estimation for image compression include Pixel Recurrent Neural Networksoord2016pixel and derivatives van2016conditional ; salimans2017pixelcnn++
. These pixel-based methods allow to sequentially predict pixels in an image along the two spatial dimensions. Because they model the joint distribution of the raw pixels along with their sequential correlation, it is possible to use them for image compression. Although they could also model the probability distribution of known samples, they work at a local scale in a patch-based fashion, which makes non-local pixels loosely correlated. Our approach instead, does not allow modeling the probability density of individual pixels but works with the whole image. It is not suitable for image compression, and while its generative nature allows in principle to produce novel images, in this work we focus only on novelty detection by evaluating the inlier probability distribution on test samples.
A recent line of work has focussed on detecting out-of-distribution samples by analyzing the output entropy of a prediction made by a pre-trained deep neural network kendalG17nips ; hendrycksG17iclar ; devries2018learning ; liang2018enhancing . This is done by either simply thresholding the maximum softmax score hendrycksG17iclar , or by first applying perturbations to the input, scaled proportionally to the gradients w.r.t. to the input and then combining the softmax score with temperature scaling, as it is done in Out-of-distribution Image Detection in Neural Networks (ODIN) liang2018enhancing . While these approaches require labels for the in-distribution data to train the classifier network, our method does not use label information. Therefore, it can be applied for the case when in-distribution data is represented by one class or label information is not available.
We assume that training data points , where , are sampled, possibly with noise , from the model
where . The mapping defines , which is a parameterized manifold of dimension , with . We also assume that the Jacobi matrix of is full rank at every point of the manifold. In addition, we assume that there is another mapping , such that for every , it follows that , which means that acts as the inverse of on such points.
Given a new data point , we design a novelty test to assert whether was sampled from model (1). We begin by observing that can be non-linearly projected onto via , where . Assuming to be smooth enough, we perform a linearization based on its first-order Taylor expansion
where is the Jacobi matrix computed at , and is the L norm. We note that represents the tangent space of at that is spanned by the
independent column vectors of, see Figure 2. Also, we have , where
is the singular value decomposition (SVD) of the Jacobi matrix. The matrixhas rank , and if we define such that
is a unitary matrix, we can represent the data pointwith respect to the local coordinates that define the tangent space , and its orthogonal complement . This is done by computing
where the rotated coordinates are decomposed into , which are parallel to , and which are orthogonal to .
We now indicate with, from which training data points have been drawn. Also, is the probability density function of the random variable representing after the change of coordinates. The two distributions are identical. However, we make the assumption that the coordinates , which are parallel to , and the coordinates , which are orthogonal to , are statistically independent. This means that the following holds
This is motivated by the fact that in (1) the noise is assumed to predominantly deviate the point away from the manifold in a direction orthogonal to . This means that is primarely responsible for the noise effects, and since noise and drawing from the manifold are statistically independent, so are and .
From (4), given a new data point , we propose to perform novelty detection by executing the following test
where is a suitable threshold.
The novelty detector (5) requires the computation of and . Given a test data point its non-linear projection onto is . Therefore, can be written as , where we have made the approximation that . Since , then in its neighborhood it can be parameterized as in (2), which means that . Therefore, if represents the random variable from which samples are drawn from the parameterized manifold, and is its probability density function, then it follows that
In order to compute , we approximate it with its average over the hypersphere of radius , giving rise to
where represents the gamma function. This is motivated by the fact that noise of a given intensity will be equally present in every direction. Moreover, its computation depends on , which is the distribution of the norms of , and which can easily be learned offline by histogramming the norms of .
In this section we describe the network architecture and the training procedure for learning the mapping that define the parameterized manifold , and also the mapping . The mappings and represent and are modeled by an encoder network, and a decoder network, respectively. Similarly to previous work on novelty detection japkowicz1995novelty ; manevitz2007one ; sakurada2014anomaly ; xia2015learning ; sabokrou2018adversarially ; sabokrou2016video , such networks are based on autoencoders bourlard1988auto ; rumelhart1986learning .
The autoencoder network and training should be such that they reproduce the manifold as closely as possible. For instance, if represents the distribution of images depicting a certain object category, we would want the estimated encoder and decoder to be able to generate images as if they were drawn from the real distribution. Differently from previous work, we require the latent space, represented by
, to be close to a known distribution, preferably a normal distribution, and we would also want each of the components ofto be maximally informative, which is why we require them to be independent random variables. Doing so facilitates learning a distribution from training data mapped onto the latent space . This means that the autoenoder has generative properties, because by sampling from we would generate data points . Note that differently from GANs goodfellow2014generative we also require an encoder function .
Variational Auto-Encoders (VAEs) kingma2013auto are known to work well in presence of continuous latent variables and they can generate data from a randomly sampled latent space. VAEs utilize stochastic variational inference and minimize the Kullback-Leibler (KL) divergence penalty to impose a prior distribution on the latent space that encourages the encoder to learn the modes of the prior distribution. Adversarial Autoencoders (AAEs) makhzani2015adversarial , in contrast to VAEs, use an adversarial training paradigm to match the posterior distribution of the latent space with the given distribution. One of the advantages of AAEs over VAEs is that the adversarial training procedure encourages the encoder to match the whole distribution of the prior.
Unfortunately, since we are concerned with working with images, both AAEs and VAEs tend to produce examples that are often far from the real data manifold. This is because the decoder part of the network is updated only from a reconstruction loss that is typically a pixel-wise cross-entropy between input and output image. Such loss often causes the generated images to be blurry, which has a negative effect on the proposed approach. Similarly to AAEs, PixelGAN autoencoders makhzani2017pixelgan introduce the adversarial component to impose a prior distribution on the latent code, but the architecture is significantly different, since it is conditioned on the latent code.
Similarly to boesen2015autoencoding ; sabokrou2018adversarially we add an adversarial training criterion to match the output of the decoder with the distribution of real data. This allows to reduce blurriness and add more local details to the generated images. Moreover, we also combine the adversarial training criterion with AAEs, which results in having two adversarial losses: one to impose a prior on the latent space distribution, and the second one to impose a prior on the output distribution.
Our full objective consists of three terms. First, we use an adversarial loss for matching the distribution of the latent space with the prior distribution, which is a normal with 0 mean, and standard deviation 1,. Second, we use an adversarial loss for matching the distribution of the decoded images from and the known, training data distribution. Third, we use an autoencoder loss between the decoded images and the encoded input image. Figure 3 shows the architecture configuration.
For the discriminator , we use the following adversarial loss:
where the encoder tries to encode to a with distribution close to . aims to distinguish between the encoding produced by and the prior normal distribution. Hence, tries to minimize this objective against an adversary that tries to maximize it.
Similarly, we add the adversarial loss for the discriminator :
where the decoder tries to generate from a normal distribution , in a way that is as if it was sampled from the real distribution. aims to distinguish between the decoding generated by and the real data points . Hence, tries to minimize this objective against an adversary that tries to maximize it.
We also optimize jointly the encoder and the decoder so that we minimize the reconstruction error for the input that belongs to the known data distribution.
where is minus the expected log-likelihood, i.e., the reconstruction error. This loss does not have an adversarial component but it is essential to train an autoencoder. By minimizing this loss we encourage and to better approximate the real manifold.
The combination of all the previous losses gives
Where is a parameter that strikes a balance between the reconstruction and the other losses. The autoencoder network is obtained by minimizing (11), giving:
The model is trained using stochastic gradient descent by doing alternative updates of each component as follows
Maximize by updating weights of ;
Minimize by updating weights of ;
Maximize by updating weights of ;
Minimize and by updating weights of and .
After learning the encoder and decoder networks, by mapping the training set onto the latent space through
, we fit to the data a generalized Gaussian distribution and estimate. In addition, by histogramming the quantities we estimate . The entire training procedure takes about one hour with a high-end PC with one NVIDIA TITAN X.
When a sample is tested, the procedure entails mainly computing a derivative, i.e. the Jacoby matrix , with a subsequent SVD. is computed numerically, around the test sample representation and takes approximately 20.4ms for an individual sample and 0.55ms if computed as part of a batch of size 512, while the SVD takes approximately 4.0ms.
We evaluate our novelty detection approach, which we call Generative Probabilistic Novelty Detection (GPND), against several state-of-the-art approaches and with several performance measures. We use the measure, the area under the ROC curve (AUROC), the FPR at 95% TPR (i.e., the probability of an outlier to be misclassified as inlier), the Detection Error (i.e., the misclassification probability when TPR is 95%), and the area under the precision-recall curve (AUPR) when inliers (AUPR-In) or outliers (AUPR-Out) are specified as positives. All reported results are from our publicly available implementation111https://github.com/podgorskiy/GPNDpaszke2017automatic . An overview of the architecture is provided in Figure 3.
We evaluate GPND on the following datasets.
MNIST lecun1998mnist contains handwritten digits from to . Each of ten categories is used as inlier class and the rest of the categories are used as outliers.
The Coil-100 dataset nene1996columbia contains images of different objects. Each object has images taken at pose intervals of degrees. We downscale the images to size . We take randomly categories, where and randomly sample the rest of the categories for outliers. We repeat this procedure 30 times.
Fashion-MNIST xiao2017fashion is a new dataset comprising of grayscale images of fashion products from categories, with images per category. The training set has images and the test set has images. Fashion-MNIST shares the same image size, data format and the structure of training and testing splits with the original MNIST.
Others. We compare GPND with ODIN liang2018enhancing using their protocol. For inliers are used samples from CIFAR-10(CIFAR-100) krizhevsky2009learning , which is a publicly available dataset of small images of size , which have each been labeled to one of () classes. Each class is represented by () images for a total of samples. For outliers are used samples from TinyImageNet (deng2009imagenet, ), LSUN song2015construction , and iSUN xu2015turkergaze . For more details please refer to liang2018enhancing . We reuse the prepared datasets of outliers provided by the ODIN GitHub project page.
MNIST dataset. We follow the protocol described in sabokrou2018adversarially ; xia2015learning with some differences discussed below. Results are averages from a 5-fold cross-validation. Each fold takes of each class. of each class is used for training, for validation, and for testing. Once is computed for each validation sample, we search for the that gives the highest measure. For each class of digit, we train the proposed model and simulate outliers as randomly sampled images from other categories with proportion from to . Results for and reported in sabokrou2018adversarially correspond to the protocol for which data is not split into separate training, validation and testing sets, meaning that the same inliers used for training were also used for testing. We diverge from this protocol and do not reuse the same inliers for training and testing. We follow the splits for training, validation and testing. This makes our testing harder, but more realistic, while we still compare our numbers against those obtained by others with easier settings. Results on the MNIST dataset are shown in Table 1 and Figure 5, where we compare with sabokrou2018adversarially ; breunig2000lof ; xia2015learning .
|% of outliers||sabokrou2018adversarially||sabokrou2018adversarially||LOF breunig2000lof||DRAE xia2015learning||GPND (Ours)|
Coil-100 dataset. We follow the protocol described in you2017provable with some differences discussed below. Results are averages from 5-fold cross-validation. Each fold takes of each class. Because the count of samples per category is very small, we use of each class for training, and for testing. We find the optimal threshold on the training set. Results reported in you2017provable correspond to not splitting data into separate training, validation and testing sets, because it is not essential, since they leverage a VGG simonyan2014very
network pretrained on ImageNetILSVRC15 . We diverge from that protocol and do not reuse inliers and follow splits for training and testing.
Results on Coil-100 are shown in Table 2. We do not outperform R-graph you2017provable , however as mentioned before, R-graph uses a pretrained VGG network, while we train an autoencoder from scratch on a very limited number of samples, which is on average only 70 per category.
|OutRank moonesignhe2006outlier ; moonesinghe2008outrank||CoP rahmani2016coherence||REAPER lerman2015robust||OutlierPursuit xu2010robust||LRR liu2010robust||DPCP tsakiris2015dual||thresholding soltanolkotabi2012geometric||R-graph you2017provable||Ours|
|Inliers: one category of images , Outliers:|
|Inliers: four category of images , Outliers:|
|Inliers: seven category of images , Outliers:|
Fashion-MNIST dataset. We repeat the same experiment with the same protocol that we have used for MNIST, but on Fashion-MNIST. Results are provided in Table 3.
|% of outliers|
CIFAR-10 (CIFAR-100) dataset. We follow the protocol described in liang2018enhancing , where for inliers and outliers are used different datasets. ODIN relies on a pretrained classifier and thus requires label information provided with the training samples, while our approach does not use label information. The results are reported in Table 4. Despite the fact that ODIN relies upon powerful classifier networks such as Dense-BC and WRN with more than 100 layers, the much smaller network of GPND competes well with ODIN. Note that for CIFAR-100, GPND significantly outperforms both ODIN architectures. We think this might be due to the fact that ODIN relies on the perturbation of the network classifier output, which becomes less accurate as the number of classes grows from 10 to 100. On the other hand, GPND does not use class label information and copes much better with the additional complexity induced by the increased number of classes.
|Outlier dataset||FPR(95%TPR)||Detection||AUROC||AUPR in||AUPR out|
|CIFAR-10||ODIN-WRN-28-10 / ODIN-Dense-BC / GPND|
|Detection error||AUPR in||AUPR out|
Table 5 compares GPND with some baselines to better appreciate the improvement provided by the architectural choices. The baselines are: i) vanilla AE with thresholding of the reconstruction error and same pipeline (AE); ii) proposed approach where the AAE is replaced by a VAE (P-VAE); iii) proposed approach where the AAE is without the additional adversarial component induced by the discriminator applied to the decoded image (P-AAE).
To motivate the importance of each component of in (5), we repeat the experiment with MNIST under the following conditions: a) GPND Complete is the unmodified approach, where is computed as in (5); b) Parallel component only drops and assumes ; c) Perpendicular component only drops and assumes ; d) only drops also and assumes . dThe results are shown in Figure 5. It can be noticed that the scaling factor plays an essential role in the Parallel component only, and that the Parallel component only and the Perpendicular component only play an essential role in composing the GPND Complete model.
Additional implementation details include the choice of hyperparameters. For MNIST and COIL-100 the latent space size was chosen to maximizeon the validation set. It is 16, and we varied it from 16 to 64 without significant performance change. For CIFAR-10 and CIFAR-100, the latent space size was set to 256. The hyperparameters of all losses are one, except for and when optimizing for , which are equal to . For CIFAR-10 and CIFAR-100, the hyperparameter of is 10.0. We use the Adam optimizer with learning rate of
, batch size of 128, and 80 epochs.
We introduced GPND, an approach and a network architecture for novelty detection that is based on learning mappings and that define the parameterized manifold which captures the underlying structure of the inlier distribution. Unlike prior deep learning based methods, GPND detects that a given sample is an outlier by evaluating its inlier probability distribution. We have shown how each architectural and model components are essential to the novelty detection. In addition, with a relatively simple architecture we have shown how GPND provides state-of-the-art performance using different measures, different datasets, and different protocols, demonstrating to compare favorably also with the out-of-distribution literature.
This material is based upon work supported by the National Science Foundation under Grant No. IIS-1761792.
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3449–3456. IEEE, 2011.
The Knowledge Engineering Review, 29(3):345–374, 2014.
Robust kernel density estimation.Journal of Machine Learning Research, 13(Sep):2529–2565, 2012.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
Auto-association by multilayer perceptrons and singular value decomposition.Biological cybernetics, 59(4-5):291–294, 1988.
Tools with Artificial Intelligence, 2006. ICTAI’06. 18th IEEE International Conference on, pages 532–539. IEEE, 2006.
Coherence pursuit: Fast, simple, and robust principal component analysis.IEEE Transactions on Signal Processing, 65(23):6260–6275, 2016.