Deep Semi-Supervised Anomaly Detection

06/06/2019 ∙ by Lukas Ruff, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) Singapore University of Technology and Design Technische Universität Kaiserslautern 12

Deep approaches to anomaly detection have recently shown promising results over shallow approaches on high-dimensional data. Typically anomaly detection is treated as an unsupervised learning problem. In practice however, one may have---in addition to a large set of unlabeled samples---access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. Semi-supervised approaches to anomaly detection make use of such labeled data to improve detection performance. Few deep semi-supervised approaches to anomaly detection have been proposed so far and those that exist are domain-specific. In this work, we present Deep SAD, an end-to-end methodology for deep semi-supervised anomaly detection. Using an information-theoretic perspective on anomaly detection, we derive a loss motivated by the idea that the entropy for the latent distribution of normal data should be lower than the entropy of the anomalous distribution. We demonstrate in extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10 along with other anomaly detection benchmark datasets that our approach is on par or outperforms shallow, hybrid, and deep competitors, even when provided with only few labeled training data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection (AD) (Chandola et al., 2009; Pimentel et al., 2014)

is the task of identifying unusual samples in data. Because this task lacks a supervised learning objective AD methods typically formulate an unsupervised problem to find a “compact” description of the “normal” class. In one-class classification

(Moya et al., 1993; Schölkopf et al., 2001; Tax and Duin, 2004; Ruff et al., 2018) for example the aim is to find a set of small measure which contains most of the data, and samples that deviate from this description are deemed anomalous. Shallow anomaly detectors such as the One-Class SVM (OC-SVM) (Schölkopf et al., 2001), Support Vector Data Description (SVDD) (Tax and Duin, 2004), Isolation Forest (IF) (Liu et al., 2008), or Kernel Density Estimator (KDE) (Parzen, 1962; Kim and Scott, 2012; Vandermeulen and Scott, 2013) often require manual feature engineering to be effective on high-dimensional data and are limited in their scalability to large datasets. These limitations have sparked great interest in developing novel unsupervised deep learning approaches to AD, a line of research which has already shown promising results (Sakurada and Yairi, 2014; Erfani et al., 2016; Zhai et al., 2016; Chen et al., 2017; Ruff et al., 2018; Deecke et al., 2018; Golan and El-Yaniv, 2018; Hendrycks et al., 2019).

Unlike the standard AD setting, in many real-world applications one may also have access to some verified (i.e., labeled) normal or anomalous examples in addition to the unlabeled data. Such samples could be hand labeled by a domain expert, for instance. Unsupervised approaches to AD ignore this valuable information whereas supervised approaches can overfit the training data and fail to generalize to out-of-distribution anomalies. Figure 1 illustrates this situation with a toy example.

Semi-supervised AD (Wang et al., 2005; Liu and Zheng, 2006; Blanchard et al., 2010; Muñoz-Marí et al., 2010; Görnitz et al., 2013)

aims to bridge the gap between unsupervised AD and supervised learning. These approaches do not assume a common pattern among the “anomaly class” and thus do not impose the typical cluster assumption semi-supervised classifiers build upon

(Zhu, 2008; Chapelle et al., 2009). Instead, semi-supervised approaches to AD aim to find a “compact description” of the data while also correctly classifying the labeled instances (Blanchard et al., 2010; Görnitz et al., 2013). Because of this, semi-supervised AD methods do not overfit to the labeled anomalies and generalize well to novel anomalies (Görnitz et al., 2013). Existing work on deepsemi-supervised learning has almost exclusively focused on classification (Kingma et al., 2014; Rasmus et al., 2015; Odena, 2016; Dai et al., 2017; Oliver et al., 2018); only a few deep semi-supervised approaches have been proposed for AD and those tend to be domain or data-type specific (Ergen et al., 2017; Kiran et al., 2018; Min et al., 2018).

In this work, we present Deep SAD (Deep Semi-Supervised Anomaly Detection), an end-to-end deep method for semi-supervised AD. Deep SAD is a generalization of our recently introduced Deep SVDD (Ruff et al., 2018) to include labeled data. We show that our approach can be understood in information-theoretic terms as learning a latent distribution of low entropy for the normal data, with the anomalous distribution having a heavier tailed, higher entropy distribution. To do this we formulate an information-theoretic perspective on deep learning for AD.

(a) Training data
(b) Unsupervised Model
(c) Supervised Model
(d) Semi-Supervised Model (Ours)
Figure 1: The need for semi-supervised AD methods: We consider a setting with only one known anomaly class (orange) at training time (illustrated in (a)) and two new unknown anomaly classes appearing at testing time (bottom left and bottom right in (b), (c), and (d)). The purely unsupervised method (shown in (b)) ignores the known anomalies, which are deemed normal. The purely supervised approach (shown in (c)) overfits to only the previously seen anomalies but fails to generalize to the novel anomalies. Our semi-supervised approach (shown in (d)) strikes a balance.

2 An Information-theoretic Perspective on Deep Anomaly Detection

The study of the theoretical foundations of deep learning is an active and ongoing research effort (Montavon et al., 2011; Tishby and Zaslavsky, 2015; Cohen et al., 2016; Eldan and Shamir, 2016; Neyshabur et al., 2017; Raghu et al., 2017; Zhang et al., 2017; Achille and Soatto, 2018; Arora et al., 2018; Belkin et al., 2018; Wiatowski and Bölcskei, 2018; Lapuschkin et al., 2019). One strong line of research that has emerged is rooted in information theory (Shannon, 1948).

In the supervised setting where one has input variable , latent variable (e.g., the final layer of a deep network), and output variable (i.e., the label), the well-known Information Bottleneck principle (Tishby et al., 1999; Tishby and Zaslavsky, 2015; Shwartz-Ziv and Tishby, 2017; Alemi et al., 2017; Saxe et al., 2018) is an explanation of representation learning as a trade-off between finding a minimal compression of the input while retaining the informativeness of for predicting the label . Put formally: supervised deep learning seeks to minimize the mutual information between the input and the latent representation while maximizing the mutual information between and the task , i.e.

(1)

where

is modeled by a deep network and the hyperparameter

controls the trade-off between compression (i.e., complexity) and classification accuracy.

For unsupervised deep learning, due to the absence of labels and thus the lack of an obvious task, other information-theoretic learning principles have been formulated. Of these, the Infomax principle (Linsker, 1988; Bell and Sejnowski, 1995; Hjelm et al., 2019) is one of the most prevalent and widely used principles. In contrast to (1), the objective of Infomax is to maximize the mutual information between the data and its representation

(2)

This is typically done using some additional constraint or regularization on the representation with hyperparameter

to obtain statistical properties desired for some specific downstream task. Examples in which the Infomax principle has been applied have a long history and include unsupervised tasks such as independent component analysis

(Bell and Sejnowski, 1995), clustering (Slonim et al., 2005; Ji et al., 2018), generative modeling (Chen et al., 2016; Hoffman and Johnson, 2016; Zhao et al., 2017; Alemi et al., 2018), and unsupervised representation learning in general (Hjelm et al., 2019).

We observe that the Infomax principle has also been implicitly applied in previous deep representations for AD. For example autoencoding models

(Rumelhart et al., 1986; Hinton and Salakhutdinov, 2006), which make up the predominant class of approaches to deep AD (Hawkins et al., 2002; Sakurada and Yairi, 2014; Andrews et al., 2016; Erfani et al., 2016; Zhai et al., 2016; Chen et al., 2017; Chalapathy and Chawla, 2019), can be understood as implicitly maximizing the mutual information via the reconstruction objective under some regularization of the latent code . Choices for regularization include sparsity (Makhzani and Frey, 2014), the distance to some prior latent distribution, e.g. measured via the KL divergence (Kingma and Welling, 2013; Rezende et al., 2014), an adversarial loss (Makhzani et al., 2015), or simply a bottleneck in dimensionality. Such restrictions for AD share the idea that the latent representation of the normal data should be in some sense “compact”.

As illustrated in Figure 1, a supervised approach to AD only learns to recognize anomalies similar to those seen in training. However, anything not normal is by definition an anomaly and there is no explicit distribution of the “anomaly class”. This makes supervised learning principles such as (1) ill-defined for AD. We instead build upon principle (2) to derive a deep method for semi-supervised AD, where we include the label information through a novel representation learning regularizer that is based on entropy.

3 Deep Semi-supervised Anomaly Detection

In the following, we introduce Deep SAD, a deep method for semi-supervised AD. To formulate our objective, we first briefly review the unsupervised Deep SVDD method (Ruff et al., 2018) and show its connection to entropy minimization. We then generalize the method to the semi-supervised AD setting.

3.1 Unsupervised Deep SVDD

For input space and output space , let

be a neural network with

hidden layers and corresponding set of weights . The objective of Deep SVDD is to train a neural network to learn a transformation that minimizes the volume of a data-enclosing hypersphere in output space centered on a predetermined point . Given (unlabeled) training samples , the One-Class Deep SVDD objective is defined as:

(3)

The Deep SVDD penalizes the mean squared distance of the mapped data points to the center of the sphere. This forces the network to extract those common factors of variation which are most stable within a dataset. As a consequence, normal data points tend to get mapped near the hypersphere center, whereas anomalies are mapped further away (Ruff et al., 2018). The second term is a weight decay regularizer on the network weights with , where denotes the Frobenius norm.

The unsupervised Deep SVDD can be optimized via SGD using backpropagation. For initialization, the authors first pre-train an autoencoder and then initialize the network

with the converged weights of the encoder. After initializing the network weights , the hypersphere center is fixed as the mean of the network representations obtained from an initial forward pass on the training data (Ruff et al., 2018).

The anomaly score of a test point finally is given by its distance to the center of the hypersphere:

(4)

where are the network weights of a trained model.

3.2 Deep SVDD and Entropy Minimization

We now show that Deep SVDD may not only be understood in terms of minimum volume estimation

(Scott and Nowak, 2006)

, but also in terms of entropy minimization over the latent distribution. For a (continuous) latent random variable

with pdf and support , its (differential) entropy is given by

(5)

Assuming has finite covariance , it follows that

(6)

with equality if and only if is jointly Gaussian (Cover and Thomas, 2012). Thus, if follows an isotropic Gaussian, , with , then

(7)

i.e. for a fixed dimensionality , the entropy of

is proportional to its log-variance.

Now observe that the unsupervised Deep SVDD objective (3) (disregarding weight decay regularization) is equivalent to minimizing the empirical variance thus minimizing an approximate upper bound for the entropy of the latent distribution.

Since the Deep SVDD network is pre-trained on an autoencoding objective (Ruff et al., 2018) that implicitly maximizes the mutual information , we can interpret Deep SVDD as following the Infomax principle (2) with the additional objective that the latent distribution should have low entropy.

3.3 Deep SAD

We are happy to now introduce our Deep SAD method. Assume that, in addition to the unlabeled samples with , we have access to labeled samples and . We denote for known normal examples and for known anomalies.

Following the insights above, we formulate our deep semi-supervised AD objective under the idea that the latent distribution of the normal data, , should have low entropy, whereas the latent distribution of anomalies, , should have high entropy. By this, we do not impose any additional assumption on the anomaly-generating distribution , such as a manifold or cluster assumption that supervised or semi-supervised classification approaches commonly make (Zhu, 2008; Chapelle et al., 2009). We argue that such a model better captures the nature of anomalies, which can be thought of as being generated from an infinite mixture of all distributions that are different from the normal data distribution, indubitably a distribution that has high entropy. We can express this idea in terms of principle (2) with respective entropy regularization of the latent distribution:

(8)

Based on the connection between Deep SVDD and entropy minimization we have shown in Section 3.2, we define our Deep SAD objective as

(9)

with hyperparameters and . We again impose a quadratic loss on the distances of the mapped points to the fixed center , for both the unlabeled as well as the labeled normal examples, thus intending to learn a latent distribution with low entropy for the normal data. This also incorporates the assumption common in AD that most of the unlabeled data is normal. In contrast, for the labeled anomalies we penalize the inverse of the distances such that anomalies must be mapped further away from the center.111To ensure numerical stability, we add a machine epsilon (eps ) to the denominator of the inverse. That is, we penalize low variance and thus the network must attempt to map known anomalies to a heavy-tailed distribution that has high entropy. To maximize the mutual information in (8), we also rely on autoencoder pre-training.

The hyperparameter controls the balance between the labeled and unlabeled terms, where emphasizes the unlabeled and the labeled objective. For , the two terms are weighted equally. The last term is a weight decay regularizer. Note that we recover the unsupervised Deep SVDD (3) formulation as the special case where only unlabeled data is available (). As an anomaly score, we again take the distance of the latent representation to the center as in (4). We optimize the generally non-convex Deep SAD objective (9) via SGD using backpropagation. Appendix A in the supplementary material provides further details.

4 Experiments

We evaluate Deep SAD on MNIST, Fashion-MNIST, and CIFAR-10 as well as classic anomaly detection benchmark datasets. We compare to shallow, hybrid, as well as deep unsupervised, semi-supervised and supervised competitors. We refer to other recent works (Ruff et al., 2018; Golan and El-Yaniv, 2018; Hendrycks et al., 2019) for further comprehensive comparisons solely between unsupervised deep AD methods.222Our code is available at: https://github.com/lukasruff/Deep-SAD-PyTorch

4.1 Competing Methods

We consider the OC-SVM (Schölkopf et al., 2001) and SVDD (Tax and Duin, 2004) with Gaussian kernel (which are in this case equivalent), Isolation Forest (Liu et al., 2008), and KDE (Parzen, 1962) as shallow unsupervised baselines. For unsupervised deep competitors, we consider the well-established autoencoder and the state-of-the-art unsupervised Deep SVDD method (Ruff et al., 2018). For semi-supervised approaches, we consider the shallow state-of-the-art semi-supervised AD method of SSAD (Görnitz et al., 2013) with Gaussian kernel. As mentioned previously, there are no deep methods for semi-supervised AD that are applicable to the general multivariate data setting. However, we add the well-known Semi-Supervised Deep Generative Model (SS-DGM) (Kingma et al., 2014) to make a comparison with a deep semi-supervised classifier. To complete the full learning spectrum, we also compare to a fully supervised deep classifier trained on the binary cross-entropy loss. Finally, in addition to training the shallow detectors on the raw input features, we also consider all their hybrid variants of applying them to the bottleneck representation given by the autoencoder (Erfani et al., 2016; Nicolau et al., 2016).

In our experiments we deliberately grant the shallow and hybrid methods an unfair advantage by selecting their hyperparameters to maximize AUC on a subset (10%) of the test set to establish strong baselines. To control for architectural effects between the competing deep methods, we always employ the same (LeNet-type) deep networks. Full details on network architectures and hyperparameter selection can be found in Appendices B and C of the supplementary material. Due to space constraints, in the main text we only report results for methods which showed competitive performance and defer results for the under-performing methods in Appendix D.

4.2 Experimental Scenarios on MNIST, Fashion-MNIST, and CIFAR-10

Figure 2: Results of experimental scenario (i), where we increase the ratio of labeled anomalies in the training set. We report the avg. AUC with st. dev. computed over 90 experiments at various ratios . A “” indicates a statistically significant () difference between 1st and 2nd.
Semi-supervised anomaly detection setup

The MNIST, Fashion-MNIST, and CIFAR-10 datasets all have ten classes from which we derive ten AD setups on each dataset. In every setup, we set one of the ten classes to be the normal class and let the remaining nine classes represent anomalies. We use the original training data of the respective normal class as the unlabeled part of our training set. Thus we start with a clean anomaly detection setting that fulfills the assumption that most (in this case all) unlabeled samples are normal. The training data of the respective nine anomaly classes then forms the data pool from which we draw anomalies for training to create different scenarios. We compute the AUC metric on the original respective test sets using ground truth labels to make a quantitative comparison, i.e.  for the normal class and for the respective nine anomaly classes. We rescale pixels to via min-max feature scaling as the only data pre-processing step.

Experimental scenarios

We examine three scenarios in which we vary the following three experimental parameters: (i) the ratio of labeled training data , (ii) the ratio of pollution in the unlabeled training data with (unknown) anomalies, and (iii) the number of anomaly classes included in the labeled training data.

(i) Adding labeled anomalies

In this scenario, we investigate the effect that including labeled anomalies into training has on detection performance and potential advantage of using a semi-supervised AD method over other paradigms. To do this we increase the ratio of labeled training data adding more and more known anomalies with to the training set. We add the labeled anomalies from anomaly class (out of the nine remaining ones). For testing, we then consider all nine remaining classes as anomalies, i.e. there are eight novel classes at testing time. We do this to simulate the unpredictable nature of anomalies. For the unlabeled part of the training set, we keep the training data of the respective normal class, which we leave unpolluted for now, i.e. . We iterate this training set generation process per AD setup always over all the nine respective anomaly classes and report the average results over the ten AD setups nine anomaly classes, i.e. over 90 experiments per labeled ratio .

(ii) Polluted training data

Here we investigate the robustness of the different methods to an increasing pollution ratio of the training set with unknown anomalies. To do so we pollute the unlabeled part of the training set with anomalies drawn from all nine respective anomaly classes in each AD setup. We fix the ratio of labeled training samples at where we again draw samples only from anomaly class in this scenario. We repeat this training set generation process per AD setup over all the nine respective anomaly classes and report the average results over the resulting 90 experiments per pollution ratio . We hypothesize that the semi-supervised approach alleviates the negative impact pollution has on detection performance, since labeled anomalies should help to “filter out” similar unknown anomalies.

(iii) Number of known anomaly classes

In the last scenario, we compare the detection performance at various numbers of known anomaly classes. In scenarios (i) and (ii), we have always sampled labeled anomalies only from out of the nine anomaly classes. In this scenario, we now increase the number of anomaly classes included in the labeled part of the training set. Since we have a limited number of anomaly classes (nine) in each AD setup, we expect the supervised classifier to catch up at some point. We fix the overall ratio of labeled training examples again at and consider a pollution ratio of for the unlabeled training data in this scenario. We repeat this training set generation process for ten seeds in every of the ten AD setups and report the average results over the resulting 100 experiments per number . For every seed, the classes are drawn uniformly at random out of the nine respective anomaly classes.

Figure 3: Results of experimental scenario (ii), where we pollute the unlabeled part of the training set with (unknown) anomalies. We report the avg. AUC with st. dev. computed over 90 experiments at various ratios . A “” indicates a statistically significant () difference between 1st and 2nd.
Figure 4: Results of experimental scenario (iii), where we increase the number of anomaly classes included in the labeled training data. We report the avg. AUC with st. dev. computed over 100 experiments at various numbers . A “” indicates a statistically significant () difference between 1st and 2nd.
Results

The results of the scenarios (i)–(iii) are shown in Figures 24

. In addition to reporting the average AUC with standard deviation, we always conduct Wilcoxon signed-rank tests

(Wilcoxon, 1945) between the best and second best performing method and indicate statistically significant () differences. Figure 2 demonstrates the advantage of a semi-supervised approach to AD especially on the most complex CIFAR-10 dataset, where Deep SAD performs best. Moreover, Figure 2 confirms that a supervised approach is vulnerable to novel anomalies at testing when only little labeled training data is available. In comparison, our Deep SAD generalizes to novel anomalies while also taking advantage of the labeled examples. Note that the hybrid SSAD, which has not yet been considered in the literature, also proves to be a sound baseline. Figure 3 shows that the detection performance of all methods decreases with increasing data pollution. Deep SAD proves to be most robust again especially on the most complex CIFAR-10. Finally, Figure 4 shows that the more diverse the labeled anomalies in the training set are, the better the detection performance becomes. We also see that the supervised method is very sensitive to the number of attack classes but catches up at some point as suspected. Overall, we observe that Deep SAD is particularly advantageous on complex data.

Hyperparameter sensitivity analysis

We run Deep SAD experiments on the ten AD setups from above on each dataset for to analyze the sensitivity of Deep SAD with respect to the hyperparameter . In this analysis, we set the experimental parameters to , , and and again iterate over all nine anomaly classes in every AD setup. The results shown in Figure 5 suggest that Deep SAD is fairly robust against changes of the hyperparameter .

Figure 5: Deep SAD sensitivity analysis w.r.t. . We report the avg. AUC with st. dev. computed over 90 experiments for various values of . Table 1: Anomaly detection benchmarks. Dataset

#outliers (%)

arrhythmia 452 274 66 (14.6%) cardio 1,831 21 176 (9.6%) satellite 6,435 36 2,036 (31.6%) satimage-2 5,803 36 71 (1.2%) shuttle 49,097 9 3,511 (7.2%) thyroid 3,772 6 93 (2.5%)
OC-SVM OC-SVM Deep SSAD SSAD Supervised Deep
Dataset Raw Hybrid SVDD Raw Hybrid Classifier SAD
arrhythmia 84.53.9 76.76.2 74.69.0 86.74.0 78.35.1 39.29.5 75.98.7
cardio 98.50.3 82.89.3 84.83.6 98.80.3 86.35.8 83.29.6 95.01.6
satellite 95.10.2 68.64.8 79.84.1 96.20.3 86.92.8 87.22.1 91.51.1
satimage-2 99.40.8 96.72.1 98.31.4 99.90.1 96.82.1 99.90.1 99.90.1
shuttle 99.40.9 94.19.5 86.37.5 99.60.5 97.71.0 95.18.0 98.40.9
thyroid 98.30.9 91.24.0 72.09.7 97.91.9 95.33.1 97.82.6 98.60.9
Table 2: Results on classic AD benchmark datasets in the setting with no pollution and a ratio of labeled anomalies of in the training set. We report the avg. AUC with st. dev. computed over 10 seeds. A “” indicates a statistically significant () difference between 1st and 2nd.

4.3 Classic Anomaly Detection Benchmark Datasets

In the last experiment, we examine the detection performance on some well-established AD benchmark datasets (Rayana, 2016) listed in Table 1. We do this to evaluate the deep against the shallow approaches also on non-image, tabular datasets that are rarely considered in the deep AD literature. For the evaluation, we consider random train-to-test set splits of 60:40 while maintaining the original proportion of anomalies in each set. We then run experiments for 10 seeds with and , i.e. 1% of the training set are labeled anomalies and the unlabeled training data is unpolluted. Since there are no specific different anomaly classes, we set . We standardize features to have zero mean and unit variance as the only pre-processing step.

Table 2 shows the results. We observe that the shallow kernel methods seem to perform slightly better on the rather small, low-dimensional benchmarks. Deep SAD proves competitive though and the small differences might be explained by the strong advantage we deliberately grant the shallow methods in the selection of their hyperparameters. The results in section 4.2 and other recent works (Ruff et al., 2018; Golan and El-Yaniv, 2018; Hendrycks et al., 2019) demonstrate that deep methods are especially superior on complex data with hierarchical structure. Unlike other deep approaches (Ergen et al., 2017; Kiran et al., 2018; Min et al., 2018; Deecke et al., 2018; Golan and El-Yaniv, 2018), however, our Deep SAD method is not domain or data-type specific. Due to its strong performance using both deep and shallow networks we expect Deep SAD to extend well to other data types.

5 Conclusion

We have introduced Deep SAD, a deep method for semi-supervised anomaly detection. To derive our method, we formulated an information-theoretic perspective on deep anomaly detection. Our experiments demonstrate that Deep SAD improves detection performance especially on more complex datasets already with only small amounts of labeled data. Our results suggest that semi-supervised approaches to anomaly detection should always be preferred in applications whenever some labeled information is available.

Acknowledgments

LR acknowledges support from the German Ministry of Education and Research (BMBF) in the project ALICE III (FKZ: 01IS18049B). MK and RV acknowledge support by the German Research Foundation (DFG) award KL 2698/2-1 and by the German Ministry of Education and Research (BMBF) awards 031L0023A, 01IS18051A, and 031B0770E. Part of the work was done while MK was a sabbatical visitor of the DASH Center at the University of Southern California. AB is grateful for support by the Singapore Ministry of Education grant MOE2016-T2-2-154. This work was supported by the German Ministry for Education and Research (BMBF) as Berlin Big Data Center (01IS14013A) and Berlin Center for Machine Learning (01IS18037I). Partial funding by DFG is acknowledged (EXC 2046/1, project-ID: 390685689). This work was also supported by the Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0-00451, No. 2017-0-01779).

References

  • Achille and Soatto [2018] A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(1):1947–1980, 2018.
  • Alemi et al. [2017] A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. In International Conference on Learning Representations, 2017.
  • Alemi et al. [2018] A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken ELBO. In International Conference on Machine Learning, volume 80, pages 159–168, 2018.
  • Andrews et al. [2016] J. T. A. Andrews, E. J. Morton, and L. D. Griffin. Detecting Anomalous Data Using Auto-Encoders. International Journal of Machine Learning and Computing, 6(1):21, 2016.
  • Arora et al. [2018] S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, volume 80, pages 244–253, 2018.
  • Belkin et al. [2018] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 540–548, 2018.
  • Bell and Sejnowski [1995] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159, 1995.
  • Blanchard et al. [2010] G. Blanchard, G. Lee, and C. Scott.

    Semi-supervised novelty detection.

    Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.
  • Chalapathy and Chawla [2019] R. Chalapathy and S. Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019.
  • Chandola et al. [2009] V. Chandola, A. Banerjee, and V. Kumar. Anomaly Detection: A Survey. ACM Computing Surveys, 41(3):1–58, 2009.
  • Chapelle et al. [2009] O. Chapelle, B. Schölkopf, and A. Zien. Semi-supervised learning. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
  • Chen et al. [2017] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 90–98, 2017.
  • Chen et al. [2016] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
  • Cohen et al. [2016] N. Cohen, O. Sharir, and A. Shashua.

    On the expressive power of deep learning: A tensor analysis.

    In International Conference on Algorithmic Learning Theory, volume 49, pages 698–728, 2016.
  • Cover and Thomas [2012] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
  • Dai et al. [2017] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov. Good semi-supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pages 6510–6520, 2017.
  • Deecke et al. [2018] L. Deecke, R. A. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft. Image anomaly detection with generative adversarial networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 3–17, 2018.
  • Eldan and Shamir [2016] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In International Conference on Algorithmic Learning Theory, volume 49, pages 907–940, 2016.
  • Erfani et al. [2016] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognition, 58:121–134, 2016.
  • Ergen et al. [2017] T. Ergen, A. H. Mirza, and S. S. Kozat. Unsupervised and semi-supervised anomaly detection with LSTM neural networks. arXiv:1710.09207, 2017.
  • Glorot and Bengio [2010] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    International Conference on Artificial Intelligence and Statistics

    , pages 249–256, 2010.
  • Golan and El-Yaniv [2018] I. Golan and R. El-Yaniv. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems, pages 9758–9769, 2018.
  • Görnitz et al. [2013] N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld. Toward supervised anomaly detection. Journal of Artificial Intelligence Research, 46:235–262, 2013.
  • Hawkins et al. [2002] S. Hawkins, H. He, G. Williams, and R. Baxter. Outlier Detection Using Replicator Neural Networks. In International Conference on Data Warehousing and Knowledge Discovery, volume 2454, pages 170–180, 2002.
  • Hendrycks et al. [2019] D. Hendrycks, M. Mazeika, and T. G. Dietterich. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
  • Hinton and Salakhutdinov [2006] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.
  • Hjelm et al. [2019] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.
  • Hoffman and Johnson [2016] M. D. Hoffman and M. J. Johnson. ELBO surgery: yet another way to carve up the variational evidence lower bound. In

    NIPS Workshop in Advances in Approximate Bayesian Inference

    , 2016.
  • Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning, pages 448–456, 2015.
  • Ji et al. [2018] X. Ji, J. F. Henriques, and A. Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
  • Kim and Scott [2012] J. Kim and C. D. Scott.

    Robust kernel density estimation.

    Journal of Machine Learning Research, 13(Sep):2529–2565, 2012.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
  • Kingma and Welling [2013] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kingma et al. [2014] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
  • Kiran et al. [2018] B. Kiran, D. Thomas, and R. Parakkal. An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging, 4(2):36, 2018.
  • Lapuschkin et al. [2019] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller. Unmasking clever hans predictors and assessing what machines really learn. Nature Communications, 10(1):1096, 2019.
  • Linsker [1988] R. Linsker. Self-organization in a perceptual network. IEEE Computer, 21(3):105–117, 1988.
  • Liu et al. [2008] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation Forest. In International Conference on Data Mining, pages 413–422, 2008.
  • Liu and Zheng [2006] Y. Liu and Y. F. Zheng. Minimum enclosing and maximum excluding machine for pattern description and discrimination. In International Conference on Pattern Recognition, volume 3, pages 129–132, 2006.
  • Makhzani and Frey [2014] A. Makhzani and B. Frey. K-sparse autoencoders. In International Conference on Learning Representations, 2014.
  • Makhzani et al. [2015] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In International Conference on Learning Representations, 2015.
  • Min et al. [2018] E. Min, J. Long, Q. Liu, J. Cui, Z. Cai, and J. Ma. SU-IDS: A semi-supervised and unsupervised framework for network intrusion detection. In International Conference on Cloud Computing and Security, pages 322–334, 2018.
  • Montavon et al. [2011] G. Montavon, M. L. Braun, and K.-R. Müller. Kernel analysis of deep networks. Journal of Machine Learning Research, 12(Sep):2563–2581, 2011.
  • Moya et al. [1993] M. M. Moya, M. W. Koch, and L. D. Hostetler. One-class classifier networks for target recognition applications. In Proceedings World Congress on Neural Networks, pages 797–801, 1993.
  • Muñoz-Marí et al. [2010] J. Muñoz-Marí, F. Bovolo, L. Gómez-Chova, L. Bruzzone, and G. Camp-Valls.

    Semi-Supervised One-Class Support Vector Machines for Classification of Remote Sensing Sata.

    IEEE Transactions on Geoscience and Remote Sensing, 48(8):3188–3197, 2010.
  • Neyshabur et al. [2017] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
  • Nicolau et al. [2016] M. Nicolau, J. McDermott, et al. A hybrid autoencoder and density estimation model for anomaly detection. In International Conference on Parallel Problem Solving from Nature, pages 717–726, 2016.
  • Odena [2016] A. Odena. Semi-supervised learning with generative adversarial networks. arXiv:1606.01583, 2016.
  • Oliver et al. [2018] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235–3246, 2018.
  • Parzen [1962] E. Parzen.

    On Estimation of a Probability Density Function and Mode.

    The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
  • Pimentel et al. [2014] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
  • Raghu et al. [2017] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning, volume 70, pages 2847–2854, 2017.
  • Rasmus et al. [2015] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.
  • Rayana [2016] S. Rayana. ODDS library, 2016. URL http://odds.cs.stonybrook.edu.
  • Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, volume 32, pages 1278–1286, 2014.
  • Ruff et al. [2018] L. Ruff, R. A. Vandermeulen, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft. Deep one-class classification. In International Conference on Machine Learning, volume 80, pages 4390–4399, 2018.
  • Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing – Explorations in the Microstructure of Cognition, chapter 8, pages 318–362. MIT Press, 1986.
  • Sakurada and Yairi [2014] M. Sakurada and T. Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the 2nd MLSDA Workshop, page 4, 2014.
  • Saxe et al. [2018] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. In International Conference on Learning Representations, 2018.
  • Schölkopf et al. [2001] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7):1443–1471, 2001.
  • Scott and Nowak [2006] C. D. Scott and R. D. Nowak. Learning minimum volume sets. Journal of Machine Learning Research, 7(Apr):665–704, 2006.
  • Shannon [1948] C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
  • Shwartz-Ziv and Tishby [2017] R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • Slonim et al. [2005] N. Slonim, G. S. Atwal, G. Tkačik, and W. Bialek. Information-based clustering. Proceedings of the National Academy of Sciences, 102(51):18297–18302, 2005.
  • Tax and Duin [2004] D. M. J. Tax and R. P. W. Duin. Support Vector Data Description. Machine Learning, 54(1):45–66, 2004.
  • Tishby and Zaslavsky [2015] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop, pages 1–5, 2015.
  • Tishby et al. [1999] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
  • Vandermeulen and Scott [2013] R. Vandermeulen and C. Scott. Consistency of robust kernel density estimators. In Conference on Learning Theory, pages 568–591, 2013.
  • Wang et al. [2005] J. Wang, P. Neskovic, and L. N. Cooper. Pattern classification via single spheres. In International Conference on Discovery Science, pages 241–252. Springer, 2005.
  • Wiatowski and Bölcskei [2018] T. Wiatowski and H. Bölcskei.

    A mathematical theory of deep convolutional neural networks for feature extraction.

    IEEE Transactions on Information Theory, 64(3):1845–1866, 2018.
  • Wilcoxon [1945] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.
  • Zhai et al. [2016] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang.

    Deep structured energy based models for anomaly detection.

    In International Conference on Machine Learning, volume 48, pages 1100–1109, 2016.
  • Zhang et al. [2017] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
  • Zhao et al. [2017] S. Zhao, J. Song, and S. Ermon. InfoVAE: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
  • Zhu [2008] X. Zhu. Semi-supervised learning literature survey. Computer Sciences TR 1530, University of Wisconsin Madison, 2008.

Appendix A Optimization of Deep SAD

The Deep SAD objective is generally non-convex in the network weights which usually is the case in deep learning. For a computationally efficient optimization, we rely on (mini-batch) SGD to optimize the network weights using the backpropagation algorithm. For improved generalization, we add weight decay regularization with hyperparameter to the objective. Algorithm 1 summarizes the Deep SAD optimization routine.

1:
2:Unlabeled data:
3:Labeled data:
4:Hyperparameters:
5:SGD learning rate:
6:
7:Trained model:
8:
9:Initialize:
10:      Neural network weights:
11:      Hypersphere center:
12:for

 each epoch 

do
13:     for each mini-batch do
14:         Draw mini-batch
15:         
16:     end for
17:end for
Algorithm 1 Optimization of Deep SAD

Using SGD allows Deep SAD to scale with large datasets as the computational complexity scales linearly in the number of training batches and computations in each batch can be parallelized (e.g., by training on GPUs). Moreover, Deep SAD has low memory complexity as a trained model is fully characterized by the final network parameters and no data must be saved or referenced for prediction. Instead, the prediction only requires a forward pass on the network which usually is just a concatenation of simple functions. This enables fast predictions for Deep SAD.

Initialization of the network weights

We establish an autoencoder pre-training routine for initialization. That is, we first train an autoencoder that has an encoder with the same architecture as network on the reconstruction loss (mean squared error or cross-entropy loss). After training, we then initialize with the converged parameters of the encoder. Note that this is in line with the Infomax principle (2) for unsupervised representation learning.

Initialization of the center

After initializing the network weights , we fix hypersphere center as the mean of the network representations that we obtain from an initial forward pass on the data (excluding labeled anomalies). We found SGD convergence to be smoother and faster by fixing center in the neighborhood of the initial data representations as we already observed in Ruff et al. (2018). If some labeled normal examples are available, using only those examples for a mean initialization would be another strategy to minimize possible distortions from polluted unlabeled training data. Adding center to the optimization variables would allow a trivial “hypersphere collapse” solution for unsupervised Deep SVDD.

Preventing a hypersphere collapse

A “hypersphere collapse” describes the trivial solution that neural network converges to the constant function , i.e. the hypersphere collapses to a single point. In Ruff et al. (2018), we demonstrate theoretical network properties that prevent such a collapse which we adopt for Deep SAD. Most importantly, network

must have no bias terms and no bounded activation functions. We refer to

Ruff et al. (2018) for further details. If there are sufficiently many labeled anomalies available for training, however, hypersphere collapse is not a problem for Deep SAD due to the opposing labeled and unlabeled objectives.

Appendix B Network Architectures

We employ LeNet-type convolutional neural networks (CNNs) on MNIST, Fashion-MNIST, and CIFAR-10, where each convolutional module consists of a convolutional layer followed by leaky ReLU activations with leakiness

and

-max-pooling. On MNIST, we employ a CNN with two modules,

-filters followed by -filters, and a final dense layer of units. On Fashion-MNIST, we employ a CNN also with two modules, -filters and -filters, followed by two dense layers of and units respectively. On CIFAR-10, we employ a CNN with three modules, -filters, -filters, and -filters, followed by a final dense layer of units.

On the classic AD benchmark datasets, we employ standard MLP feed-forward architectures. On arrhythmia, a 3-layer MLP with -- units. On cardio, satellite, satimage-2, and shuttle a 3-layer MLP with -- units. On thyroid a 3-layer MLP with -- units.

Appendix C Details on Competing Methods

Oc-Svm/svdd

The OC-SVM and SVDD are equivalent for the Gaussian/RBF kernel we employ. As mentioned in the main paper, we deliberately grant the OC-SVM/SVDD an unfair advantage by selecting its hyperparameters to maximize AUC on a subset (10%) of the test set to establish a strong baseline. To do this, we consider the RBF scale parameter and select the best performing one. Moreover, we always repeat this over -parameter and then report the best final result.

Isolation Forest (IF)

We set the number of trees to and the sub-sampling size to , as recommended in the original work (Liu et al., 2008).

Kernel Density Estimator (KDE)

We select the bandwidth of the Gaussian kernel from via 5-fold cross-validation using the log-likelihood score following (Ruff et al., 2018).

Ssad

We also deliberately grant the state-of-the-art semi-supervised AD kernel method SSAD the unfair advantage of selecting its hyperparameters optimally to maximize AUC on a subset (10%) of the test set. To do this, we again select the scale parameter of the RBF kernel we use from and select the best performing one. Otherwise we set the hyperparameters as recommend by the original authors to , , , and (Görnitz et al., 2013).

(Convolutional) Autoencoder ((C)AE)

To create the (convolutional) autoencoders, we symmetrically construct the decoders w.r.t. the architectures reported in Appenidx B, which make up the encoder parts of the autoencoders. Here, we replace max-pooling with simple upsampling and convolutions with deconvolutions. We train the autoencoders on the MSE reconstruction loss that also serves as the anomaly score.

Hybrid Variants

To establish hybrid methods, we apply the OC-SVM, IF, KDE, and SSAD as outlined above to the resulting bottleneck representations given by the converged autoencoder.

Unsupervised Deep SVDD

We consider both variants, Soft-Boundary Deep SVDD and One-Class Deep SVDD as unsupervised baselines and always report the better performance as the unsupervised result. For Soft-Boundary Deep SVDD, we optimally solve for the radius on every mini-batch and run experiments for . We set the weight decay hyperparameter to . For Deep SVDD, we remove all bias terms from the network to prevent a hypersphere collapse as we recommended in the original work (Ruff et al., 2018).

Deep SAD

We set and equally weight the unlabeled and labeled examples by setting if not reported otherwise.

Ss-Dgm

We consider both the M2 and M1+M2 model and always report the better performing result. Otherwise we follow the settings as recommended in the original work (Kingma et al., 2014).

Supervised Deep Binary Classifier

To interpret AD as a binary classification problem, we rely on the typical assumption that most of the unlabeled training data is normal by assigning to all unlabeled examples. Already labeled normal examples and labeled anomalies retain their assigned labels of and respectively. We train the supervised classifier on the binary cross-entropy loss. Note that in scenario (i), in particular, the supervised classifier has perfect, unpolluted label information but still fails to generalize as there are novel anomaly classes at testing.

SGD Optimization Details for Deep Methods

We use the Adam optimizer with recommended default hyperparameters (Kingma and Ba, 2014) and apply Batch Normalization (Ioffe and Szegedy, 2015) in SGD optimization. For all deep approaches and on all datasets, we employ a two-phase (“searching” and “fine-tuning”) learning rate schedule. In the searching phase we first train with a learning rate for epochs. In the fine-tuning phase we train with for another epochs. We always use a batch size of 200. For the autoencoder, SS-DGM, and the supervised classifier, we initialize the network with uniform Glorot weights (Glorot and Bengio, 2010). For Deep SVDD and Deep SAD, we establish an unsupervised pre-training routine via autoencoder as explained in Appendix A. We set the network to be the encoder of the autoencoder that we train beforehand.

Appendix D Complete Experimental Results

Besides Tables LABEL:tab:1_knownLABEL:tab:odds_results_appendix that list the complete experimental results of all the methods, we provide AUC scatterplots of the best (1st) vs. second best (2nd) performing methods in the experimental scenarios (i)–(iii) on the most complex CIFAR-10 dataset. If many points fall above the identity line, this is a strong indication that the best method indeed significantly outperforms the second best, which is often the case for Deep SAD.

(a)
(b)
(c)
(d)
Figure 6: AUC scatterplot of best (1st) vs. second best (2nd) performing method in experimental scenario (i) on CIFAR-10, where we increase the ratio of labeled anomalies in the training set.
(a)
(b)
(c)
(d)
(e)
Figure 7: AUC scatterplot of best (1st) vs. second best (2nd) performing method in experimental scenario (ii) on CIFAR-10, where we pollute the unlabeled part of the training set with (unknown) anomalies with various ratios .
(a)
(b)
(c)
(d)
Figure 8: AUC scatterplot of best (1st) vs. second best (2nd) performing method in experimental scenario (iii) on CIFAR-10, where we increase the number of anomaly classes included in the labeled training data.