1 Introduction
Anomaly detection (AD) (Chandola et al., 2009; Pimentel et al., 2014)
is the task of identifying unusual samples in data. Because this task lacks a supervised learning objective AD methods typically formulate an unsupervised problem to find a “compact” description of the “normal” class. In oneclass classification
(Moya et al., 1993; Schölkopf et al., 2001; Tax and Duin, 2004; Ruff et al., 2018) for example the aim is to find a set of small measure which contains most of the data, and samples that deviate from this description are deemed anomalous. Shallow anomaly detectors such as the OneClass SVM (OCSVM) (Schölkopf et al., 2001), Support Vector Data Description (SVDD) (Tax and Duin, 2004), Isolation Forest (IF) (Liu et al., 2008), or Kernel Density Estimator (KDE) (Parzen, 1962; Kim and Scott, 2012; Vandermeulen and Scott, 2013) often require manual feature engineering to be effective on highdimensional data and are limited in their scalability to large datasets. These limitations have sparked great interest in developing novel unsupervised deep learning approaches to AD, a line of research which has already shown promising results (Sakurada and Yairi, 2014; Erfani et al., 2016; Zhai et al., 2016; Chen et al., 2017; Ruff et al., 2018; Deecke et al., 2018; Golan and ElYaniv, 2018; Hendrycks et al., 2019).Unlike the standard AD setting, in many realworld applications one may also have access to some verified (i.e., labeled) normal or anomalous examples in addition to the unlabeled data. Such samples could be hand labeled by a domain expert, for instance. Unsupervised approaches to AD ignore this valuable information whereas supervised approaches can overfit the training data and fail to generalize to outofdistribution anomalies. Figure 1 illustrates this situation with a toy example.
Semisupervised AD (Wang et al., 2005; Liu and Zheng, 2006; Blanchard et al., 2010; MuñozMarí et al., 2010; Görnitz et al., 2013)
aims to bridge the gap between unsupervised AD and supervised learning. These approaches do not assume a common pattern among the “anomaly class” and thus do not impose the typical cluster assumption semisupervised classifiers build upon
(Zhu, 2008; Chapelle et al., 2009). Instead, semisupervised approaches to AD aim to find a “compact description” of the data while also correctly classifying the labeled instances (Blanchard et al., 2010; Görnitz et al., 2013). Because of this, semisupervised AD methods do not overfit to the labeled anomalies and generalize well to novel anomalies (Görnitz et al., 2013). Existing work on deepsemisupervised learning has almost exclusively focused on classification (Kingma et al., 2014; Rasmus et al., 2015; Odena, 2016; Dai et al., 2017; Oliver et al., 2018); only a few deep semisupervised approaches have been proposed for AD and those tend to be domain or datatype specific (Ergen et al., 2017; Kiran et al., 2018; Min et al., 2018).In this work, we present Deep SAD (Deep SemiSupervised Anomaly Detection), an endtoend deep method for semisupervised AD. Deep SAD is a generalization of our recently introduced Deep SVDD (Ruff et al., 2018) to include labeled data. We show that our approach can be understood in informationtheoretic terms as learning a latent distribution of low entropy for the normal data, with the anomalous distribution having a heavier tailed, higher entropy distribution. To do this we formulate an informationtheoretic perspective on deep learning for AD.
2 An Informationtheoretic Perspective on Deep Anomaly Detection
The study of the theoretical foundations of deep learning is an active and ongoing research effort (Montavon et al., 2011; Tishby and Zaslavsky, 2015; Cohen et al., 2016; Eldan and Shamir, 2016; Neyshabur et al., 2017; Raghu et al., 2017; Zhang et al., 2017; Achille and Soatto, 2018; Arora et al., 2018; Belkin et al., 2018; Wiatowski and Bölcskei, 2018; Lapuschkin et al., 2019). One strong line of research that has emerged is rooted in information theory (Shannon, 1948).
In the supervised setting where one has input variable , latent variable (e.g., the final layer of a deep network), and output variable (i.e., the label), the wellknown Information Bottleneck principle (Tishby et al., 1999; Tishby and Zaslavsky, 2015; ShwartzZiv and Tishby, 2017; Alemi et al., 2017; Saxe et al., 2018) is an explanation of representation learning as a tradeoff between finding a minimal compression of the input while retaining the informativeness of for predicting the label . Put formally: supervised deep learning seeks to minimize the mutual information between the input and the latent representation while maximizing the mutual information between and the task , i.e.
(1) 
where
is modeled by a deep network and the hyperparameter
controls the tradeoff between compression (i.e., complexity) and classification accuracy.For unsupervised deep learning, due to the absence of labels and thus the lack of an obvious task, other informationtheoretic learning principles have been formulated. Of these, the Infomax principle (Linsker, 1988; Bell and Sejnowski, 1995; Hjelm et al., 2019) is one of the most prevalent and widely used principles. In contrast to (1), the objective of Infomax is to maximize the mutual information between the data and its representation
(2) 
This is typically done using some additional constraint or regularization on the representation with hyperparameter
to obtain statistical properties desired for some specific downstream task. Examples in which the Infomax principle has been applied have a long history and include unsupervised tasks such as independent component analysis
(Bell and Sejnowski, 1995), clustering (Slonim et al., 2005; Ji et al., 2018), generative modeling (Chen et al., 2016; Hoffman and Johnson, 2016; Zhao et al., 2017; Alemi et al., 2018), and unsupervised representation learning in general (Hjelm et al., 2019).We observe that the Infomax principle has also been implicitly applied in previous deep representations for AD. For example autoencoding models
(Rumelhart et al., 1986; Hinton and Salakhutdinov, 2006), which make up the predominant class of approaches to deep AD (Hawkins et al., 2002; Sakurada and Yairi, 2014; Andrews et al., 2016; Erfani et al., 2016; Zhai et al., 2016; Chen et al., 2017; Chalapathy and Chawla, 2019), can be understood as implicitly maximizing the mutual information via the reconstruction objective under some regularization of the latent code . Choices for regularization include sparsity (Makhzani and Frey, 2014), the distance to some prior latent distribution, e.g. measured via the KL divergence (Kingma and Welling, 2013; Rezende et al., 2014), an adversarial loss (Makhzani et al., 2015), or simply a bottleneck in dimensionality. Such restrictions for AD share the idea that the latent representation of the normal data should be in some sense “compact”.As illustrated in Figure 1, a supervised approach to AD only learns to recognize anomalies similar to those seen in training. However, anything not normal is by definition an anomaly and there is no explicit distribution of the “anomaly class”. This makes supervised learning principles such as (1) illdefined for AD. We instead build upon principle (2) to derive a deep method for semisupervised AD, where we include the label information through a novel representation learning regularizer that is based on entropy.
3 Deep Semisupervised Anomaly Detection
In the following, we introduce Deep SAD, a deep method for semisupervised AD. To formulate our objective, we first briefly review the unsupervised Deep SVDD method (Ruff et al., 2018) and show its connection to entropy minimization. We then generalize the method to the semisupervised AD setting.
3.1 Unsupervised Deep SVDD
For input space and output space , let
be a neural network with
hidden layers and corresponding set of weights . The objective of Deep SVDD is to train a neural network to learn a transformation that minimizes the volume of a dataenclosing hypersphere in output space centered on a predetermined point . Given (unlabeled) training samples , the OneClass Deep SVDD objective is defined as:(3) 
The Deep SVDD penalizes the mean squared distance of the mapped data points to the center of the sphere. This forces the network to extract those common factors of variation which are most stable within a dataset. As a consequence, normal data points tend to get mapped near the hypersphere center, whereas anomalies are mapped further away (Ruff et al., 2018). The second term is a weight decay regularizer on the network weights with , where denotes the Frobenius norm.
The unsupervised Deep SVDD can be optimized via SGD using backpropagation. For initialization, the authors first pretrain an autoencoder and then initialize the network
with the converged weights of the encoder. After initializing the network weights , the hypersphere center is fixed as the mean of the network representations obtained from an initial forward pass on the training data (Ruff et al., 2018).The anomaly score of a test point finally is given by its distance to the center of the hypersphere:
(4) 
where are the network weights of a trained model.
3.2 Deep SVDD and Entropy Minimization
We now show that Deep SVDD may not only be understood in terms of minimum volume estimation
(Scott and Nowak, 2006), but also in terms of entropy minimization over the latent distribution. For a (continuous) latent random variable
with pdf and support , its (differential) entropy is given by(5) 
Assuming has finite covariance , it follows that
(6) 
with equality if and only if is jointly Gaussian (Cover and Thomas, 2012). Thus, if follows an isotropic Gaussian, , with , then
(7) 
i.e. for a fixed dimensionality , the entropy of
is proportional to its logvariance.
Now observe that the unsupervised Deep SVDD objective (3) (disregarding weight decay regularization) is equivalent to minimizing the empirical variance thus minimizing an approximate upper bound for the entropy of the latent distribution.
3.3 Deep SAD
We are happy to now introduce our Deep SAD method. Assume that, in addition to the unlabeled samples with , we have access to labeled samples and . We denote for known normal examples and for known anomalies.
Following the insights above, we formulate our deep semisupervised AD objective under the idea that the latent distribution of the normal data, , should have low entropy, whereas the latent distribution of anomalies, , should have high entropy. By this, we do not impose any additional assumption on the anomalygenerating distribution , such as a manifold or cluster assumption that supervised or semisupervised classification approaches commonly make (Zhu, 2008; Chapelle et al., 2009). We argue that such a model better captures the nature of anomalies, which can be thought of as being generated from an infinite mixture of all distributions that are different from the normal data distribution, indubitably a distribution that has high entropy. We can express this idea in terms of principle (2) with respective entropy regularization of the latent distribution:
(8) 
Based on the connection between Deep SVDD and entropy minimization we have shown in Section 3.2, we define our Deep SAD objective as
(9) 
with hyperparameters and . We again impose a quadratic loss on the distances of the mapped points to the fixed center , for both the unlabeled as well as the labeled normal examples, thus intending to learn a latent distribution with low entropy for the normal data. This also incorporates the assumption common in AD that most of the unlabeled data is normal. In contrast, for the labeled anomalies we penalize the inverse of the distances such that anomalies must be mapped further away from the center.^{1}^{1}1To ensure numerical stability, we add a machine epsilon (eps ) to the denominator of the inverse. That is, we penalize low variance and thus the network must attempt to map known anomalies to a heavytailed distribution that has high entropy. To maximize the mutual information in (8), we also rely on autoencoder pretraining.
The hyperparameter controls the balance between the labeled and unlabeled terms, where emphasizes the unlabeled and the labeled objective. For , the two terms are weighted equally. The last term is a weight decay regularizer. Note that we recover the unsupervised Deep SVDD (3) formulation as the special case where only unlabeled data is available (). As an anomaly score, we again take the distance of the latent representation to the center as in (4). We optimize the generally nonconvex Deep SAD objective (9) via SGD using backpropagation. Appendix A in the supplementary material provides further details.
4 Experiments
We evaluate Deep SAD on MNIST, FashionMNIST, and CIFAR10 as well as classic anomaly detection benchmark datasets. We compare to shallow, hybrid, as well as deep unsupervised, semisupervised and supervised competitors. We refer to other recent works (Ruff et al., 2018; Golan and ElYaniv, 2018; Hendrycks et al., 2019) for further comprehensive comparisons solely between unsupervised deep AD methods.^{2}^{2}2Our code is available at: https://github.com/lukasruff/DeepSADPyTorch
4.1 Competing Methods
We consider the OCSVM (Schölkopf et al., 2001) and SVDD (Tax and Duin, 2004) with Gaussian kernel (which are in this case equivalent), Isolation Forest (Liu et al., 2008), and KDE (Parzen, 1962) as shallow unsupervised baselines. For unsupervised deep competitors, we consider the wellestablished autoencoder and the stateoftheart unsupervised Deep SVDD method (Ruff et al., 2018). For semisupervised approaches, we consider the shallow stateoftheart semisupervised AD method of SSAD (Görnitz et al., 2013) with Gaussian kernel. As mentioned previously, there are no deep methods for semisupervised AD that are applicable to the general multivariate data setting. However, we add the wellknown SemiSupervised Deep Generative Model (SSDGM) (Kingma et al., 2014) to make a comparison with a deep semisupervised classifier. To complete the full learning spectrum, we also compare to a fully supervised deep classifier trained on the binary crossentropy loss. Finally, in addition to training the shallow detectors on the raw input features, we also consider all their hybrid variants of applying them to the bottleneck representation given by the autoencoder (Erfani et al., 2016; Nicolau et al., 2016).
In our experiments we deliberately grant the shallow and hybrid methods an unfair advantage by selecting their hyperparameters to maximize AUC on a subset (10%) of the test set to establish strong baselines. To control for architectural effects between the competing deep methods, we always employ the same (LeNettype) deep networks. Full details on network architectures and hyperparameter selection can be found in Appendices B and C of the supplementary material. Due to space constraints, in the main text we only report results for methods which showed competitive performance and defer results for the underperforming methods in Appendix D.
4.2 Experimental Scenarios on MNIST, FashionMNIST, and CIFAR10
Semisupervised anomaly detection setup
The MNIST, FashionMNIST, and CIFAR10 datasets all have ten classes from which we derive ten AD setups on each dataset. In every setup, we set one of the ten classes to be the normal class and let the remaining nine classes represent anomalies. We use the original training data of the respective normal class as the unlabeled part of our training set. Thus we start with a clean anomaly detection setting that fulfills the assumption that most (in this case all) unlabeled samples are normal. The training data of the respective nine anomaly classes then forms the data pool from which we draw anomalies for training to create different scenarios. We compute the AUC metric on the original respective test sets using ground truth labels to make a quantitative comparison, i.e. for the normal class and for the respective nine anomaly classes. We rescale pixels to via minmax feature scaling as the only data preprocessing step.
Experimental scenarios
We examine three scenarios in which we vary the following three experimental parameters: (i) the ratio of labeled training data , (ii) the ratio of pollution in the unlabeled training data with (unknown) anomalies, and (iii) the number of anomaly classes included in the labeled training data.
(i) Adding labeled anomalies
In this scenario, we investigate the effect that including labeled anomalies into training has on detection performance and potential advantage of using a semisupervised AD method over other paradigms. To do this we increase the ratio of labeled training data adding more and more known anomalies with to the training set. We add the labeled anomalies from anomaly class (out of the nine remaining ones). For testing, we then consider all nine remaining classes as anomalies, i.e. there are eight novel classes at testing time. We do this to simulate the unpredictable nature of anomalies. For the unlabeled part of the training set, we keep the training data of the respective normal class, which we leave unpolluted for now, i.e. . We iterate this training set generation process per AD setup always over all the nine respective anomaly classes and report the average results over the ten AD setups nine anomaly classes, i.e. over 90 experiments per labeled ratio .
(ii) Polluted training data
Here we investigate the robustness of the different methods to an increasing pollution ratio of the training set with unknown anomalies. To do so we pollute the unlabeled part of the training set with anomalies drawn from all nine respective anomaly classes in each AD setup. We fix the ratio of labeled training samples at where we again draw samples only from anomaly class in this scenario. We repeat this training set generation process per AD setup over all the nine respective anomaly classes and report the average results over the resulting 90 experiments per pollution ratio . We hypothesize that the semisupervised approach alleviates the negative impact pollution has on detection performance, since labeled anomalies should help to “filter out” similar unknown anomalies.
(iii) Number of known anomaly classes
In the last scenario, we compare the detection performance at various numbers of known anomaly classes. In scenarios (i) and (ii), we have always sampled labeled anomalies only from out of the nine anomaly classes. In this scenario, we now increase the number of anomaly classes included in the labeled part of the training set. Since we have a limited number of anomaly classes (nine) in each AD setup, we expect the supervised classifier to catch up at some point. We fix the overall ratio of labeled training examples again at and consider a pollution ratio of for the unlabeled training data in this scenario. We repeat this training set generation process for ten seeds in every of the ten AD setups and report the average results over the resulting 100 experiments per number . For every seed, the classes are drawn uniformly at random out of the nine respective anomaly classes.
Results
The results of the scenarios (i)–(iii) are shown in Figures 2–4
. In addition to reporting the average AUC with standard deviation, we always conduct Wilcoxon signedrank tests
(Wilcoxon, 1945) between the best and second best performing method and indicate statistically significant () differences. Figure 2 demonstrates the advantage of a semisupervised approach to AD especially on the most complex CIFAR10 dataset, where Deep SAD performs best. Moreover, Figure 2 confirms that a supervised approach is vulnerable to novel anomalies at testing when only little labeled training data is available. In comparison, our Deep SAD generalizes to novel anomalies while also taking advantage of the labeled examples. Note that the hybrid SSAD, which has not yet been considered in the literature, also proves to be a sound baseline. Figure 3 shows that the detection performance of all methods decreases with increasing data pollution. Deep SAD proves to be most robust again especially on the most complex CIFAR10. Finally, Figure 4 shows that the more diverse the labeled anomalies in the training set are, the better the detection performance becomes. We also see that the supervised method is very sensitive to the number of attack classes but catches up at some point as suspected. Overall, we observe that Deep SAD is particularly advantageous on complex data.Hyperparameter sensitivity analysis
We run Deep SAD experiments on the ten AD setups from above on each dataset for to analyze the sensitivity of Deep SAD with respect to the hyperparameter . In this analysis, we set the experimental parameters to , , and and again iterate over all nine anomaly classes in every AD setup. The results shown in Figure 5 suggest that Deep SAD is fairly robust against changes of the hyperparameter .
OCSVM  OCSVM  Deep  SSAD  SSAD  Supervised  Deep  

Dataset  Raw  Hybrid  SVDD  Raw  Hybrid  Classifier  SAD 
arrhythmia  84.53.9  76.76.2  74.69.0  86.74.0  78.35.1  39.29.5  75.98.7 
cardio  98.50.3  82.89.3  84.83.6  98.80.3  86.35.8  83.29.6  95.01.6 
satellite  95.10.2  68.64.8  79.84.1  96.20.3  86.92.8  87.22.1  91.51.1 
satimage2  99.40.8  96.72.1  98.31.4  99.90.1  96.82.1  99.90.1  99.90.1 
shuttle  99.40.9  94.19.5  86.37.5  99.60.5  97.71.0  95.18.0  98.40.9 
thyroid  98.30.9  91.24.0  72.09.7  97.91.9  95.33.1  97.82.6  98.60.9 
4.3 Classic Anomaly Detection Benchmark Datasets
In the last experiment, we examine the detection performance on some wellestablished AD benchmark datasets (Rayana, 2016) listed in Table 1. We do this to evaluate the deep against the shallow approaches also on nonimage, tabular datasets that are rarely considered in the deep AD literature. For the evaluation, we consider random traintotest set splits of 60:40 while maintaining the original proportion of anomalies in each set. We then run experiments for 10 seeds with and , i.e. 1% of the training set are labeled anomalies and the unlabeled training data is unpolluted. Since there are no specific different anomaly classes, we set . We standardize features to have zero mean and unit variance as the only preprocessing step.
Table 2 shows the results. We observe that the shallow kernel methods seem to perform slightly better on the rather small, lowdimensional benchmarks. Deep SAD proves competitive though and the small differences might be explained by the strong advantage we deliberately grant the shallow methods in the selection of their hyperparameters. The results in section 4.2 and other recent works (Ruff et al., 2018; Golan and ElYaniv, 2018; Hendrycks et al., 2019) demonstrate that deep methods are especially superior on complex data with hierarchical structure. Unlike other deep approaches (Ergen et al., 2017; Kiran et al., 2018; Min et al., 2018; Deecke et al., 2018; Golan and ElYaniv, 2018), however, our Deep SAD method is not domain or datatype specific. Due to its strong performance using both deep and shallow networks we expect Deep SAD to extend well to other data types.
5 Conclusion
We have introduced Deep SAD, a deep method for semisupervised anomaly detection. To derive our method, we formulated an informationtheoretic perspective on deep anomaly detection. Our experiments demonstrate that Deep SAD improves detection performance especially on more complex datasets already with only small amounts of labeled data. Our results suggest that semisupervised approaches to anomaly detection should always be preferred in applications whenever some labeled information is available.
Acknowledgments
LR acknowledges support from the German Ministry of Education and Research (BMBF) in the project ALICE III (FKZ: 01IS18049B). MK and RV acknowledge support by the German Research Foundation (DFG) award KL 2698/21 and by the German Ministry of Education and Research (BMBF) awards 031L0023A, 01IS18051A, and 031B0770E. Part of the work was done while MK was a sabbatical visitor of the DASH Center at the University of Southern California. AB is grateful for support by the Singapore Ministry of Education grant MOE2016T22154. This work was supported by the German Ministry for Education and Research (BMBF) as Berlin Big Data Center (01IS14013A) and Berlin Center for Machine Learning (01IS18037I). Partial funding by DFG is acknowledged (EXC 2046/1, projectID: 390685689). This work was also supported by the Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017000451, No. 2017001779).
References
 Achille and Soatto [2018] A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(1):1947–1980, 2018.
 Alemi et al. [2017] A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. In International Conference on Learning Representations, 2017.
 Alemi et al. [2018] A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken ELBO. In International Conference on Machine Learning, volume 80, pages 159–168, 2018.
 Andrews et al. [2016] J. T. A. Andrews, E. J. Morton, and L. D. Griffin. Detecting Anomalous Data Using AutoEncoders. International Journal of Machine Learning and Computing, 6(1):21, 2016.
 Arora et al. [2018] S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, volume 80, pages 244–253, 2018.
 Belkin et al. [2018] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 540–548, 2018.
 Bell and Sejnowski [1995] A. J. Bell and T. J. Sejnowski. An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159, 1995.

Blanchard et al. [2010]
G. Blanchard, G. Lee, and C. Scott.
Semisupervised novelty detection.
Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.  Chalapathy and Chawla [2019] R. Chalapathy and S. Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019.
 Chandola et al. [2009] V. Chandola, A. Banerjee, and V. Kumar. Anomaly Detection: A Survey. ACM Computing Surveys, 41(3):1–58, 2009.
 Chapelle et al. [2009] O. Chapelle, B. Schölkopf, and A. Zien. Semisupervised learning. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
 Chen et al. [2017] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 90–98, 2017.
 Chen et al. [2016] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.

Cohen et al. [2016]
N. Cohen, O. Sharir, and A. Shashua.
On the expressive power of deep learning: A tensor analysis.
In International Conference on Algorithmic Learning Theory, volume 49, pages 698–728, 2016.  Cover and Thomas [2012] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
 Dai et al. [2017] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov. Good semisupervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pages 6510–6520, 2017.
 Deecke et al. [2018] L. Deecke, R. A. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft. Image anomaly detection with generative adversarial networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 3–17, 2018.
 Eldan and Shamir [2016] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In International Conference on Algorithmic Learning Theory, volume 49, pages 907–940, 2016.
 Erfani et al. [2016] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie. Highdimensional and largescale anomaly detection using a linear oneclass SVM with deep learning. Pattern Recognition, 58:121–134, 2016.
 Ergen et al. [2017] T. Ergen, A. H. Mirza, and S. S. Kozat. Unsupervised and semisupervised anomaly detection with LSTM neural networks. arXiv:1710.09207, 2017.

Glorot and Bengio [2010]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
International Conference on Artificial Intelligence and Statistics
, pages 249–256, 2010.  Golan and ElYaniv [2018] I. Golan and R. ElYaniv. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems, pages 9758–9769, 2018.
 Görnitz et al. [2013] N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld. Toward supervised anomaly detection. Journal of Artificial Intelligence Research, 46:235–262, 2013.
 Hawkins et al. [2002] S. Hawkins, H. He, G. Williams, and R. Baxter. Outlier Detection Using Replicator Neural Networks. In International Conference on Data Warehousing and Knowledge Discovery, volume 2454, pages 170–180, 2002.
 Hendrycks et al. [2019] D. Hendrycks, M. Mazeika, and T. G. Dietterich. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
 Hinton and Salakhutdinov [2006] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.
 Hjelm et al. [2019] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.

Hoffman and Johnson [2016]
M. D. Hoffman and M. J. Johnson.
ELBO surgery: yet another way to carve up the variational evidence
lower bound.
In
NIPS Workshop in Advances in Approximate Bayesian Inference
, 2016.  Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning, pages 448–456, 2015.
 Ji et al. [2018] X. Ji, J. F. Henriques, and A. Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.

Kim and Scott [2012]
J. Kim and C. D. Scott.
Robust kernel density estimation.
Journal of Machine Learning Research, 13(Sep):2529–2565, 2012.  Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. [2014] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 Kiran et al. [2018] B. Kiran, D. Thomas, and R. Parakkal. An overview of deep learning based methods for unsupervised and semisupervised anomaly detection in videos. Journal of Imaging, 4(2):36, 2018.
 Lapuschkin et al. [2019] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.R. Müller. Unmasking clever hans predictors and assessing what machines really learn. Nature Communications, 10(1):1096, 2019.
 Linsker [1988] R. Linsker. Selforganization in a perceptual network. IEEE Computer, 21(3):105–117, 1988.
 Liu et al. [2008] F. T. Liu, K. M. Ting, and Z.H. Zhou. Isolation Forest. In International Conference on Data Mining, pages 413–422, 2008.
 Liu and Zheng [2006] Y. Liu and Y. F. Zheng. Minimum enclosing and maximum excluding machine for pattern description and discrimination. In International Conference on Pattern Recognition, volume 3, pages 129–132, 2006.
 Makhzani and Frey [2014] A. Makhzani and B. Frey. Ksparse autoencoders. In International Conference on Learning Representations, 2014.
 Makhzani et al. [2015] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In International Conference on Learning Representations, 2015.
 Min et al. [2018] E. Min, J. Long, Q. Liu, J. Cui, Z. Cai, and J. Ma. SUIDS: A semisupervised and unsupervised framework for network intrusion detection. In International Conference on Cloud Computing and Security, pages 322–334, 2018.
 Montavon et al. [2011] G. Montavon, M. L. Braun, and K.R. Müller. Kernel analysis of deep networks. Journal of Machine Learning Research, 12(Sep):2563–2581, 2011.
 Moya et al. [1993] M. M. Moya, M. W. Koch, and L. D. Hostetler. Oneclass classifier networks for target recognition applications. In Proceedings World Congress on Neural Networks, pages 797–801, 1993.

MuñozMarí et al. [2010]
J. MuñozMarí, F. Bovolo, L. GómezChova, L. Bruzzone, and
G. CampValls.
SemiSupervised OneClass Support Vector Machines for Classification of Remote Sensing Sata.
IEEE Transactions on Geoscience and Remote Sensing, 48(8):3188–3197, 2010.  Neyshabur et al. [2017] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
 Nicolau et al. [2016] M. Nicolau, J. McDermott, et al. A hybrid autoencoder and density estimation model for anomaly detection. In International Conference on Parallel Problem Solving from Nature, pages 717–726, 2016.
 Odena [2016] A. Odena. Semisupervised learning with generative adversarial networks. arXiv:1606.01583, 2016.
 Oliver et al. [2018] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow. Realistic evaluation of deep semisupervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235–3246, 2018.

Parzen [1962]
E. Parzen.
On Estimation of a Probability Density Function and Mode.
The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.  Pimentel et al. [2014] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
 Raghu et al. [2017] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning, volume 70, pages 2847–2854, 2017.
 Rasmus et al. [2015] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.
 Rayana [2016] S. Rayana. ODDS library, 2016. URL http://odds.cs.stonybrook.edu.
 Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, volume 32, pages 1278–1286, 2014.
 Ruff et al. [2018] L. Ruff, R. A. Vandermeulen, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft. Deep oneclass classification. In International Conference on Machine Learning, volume 80, pages 4390–4399, 2018.
 Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing – Explorations in the Microstructure of Cognition, chapter 8, pages 318–362. MIT Press, 1986.
 Sakurada and Yairi [2014] M. Sakurada and T. Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the 2nd MLSDA Workshop, page 4, 2014.
 Saxe et al. [2018] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. In International Conference on Learning Representations, 2018.
 Schölkopf et al. [2001] B. Schölkopf, J. C. Platt, J. ShaweTaylor, A. J. Smola, and R. C. Williamson. Estimating the Support of a HighDimensional Distribution. Neural Computation, 13(7):1443–1471, 2001.
 Scott and Nowak [2006] C. D. Scott and R. D. Nowak. Learning minimum volume sets. Journal of Machine Learning Research, 7(Apr):665–704, 2006.
 Shannon [1948] C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
 ShwartzZiv and Tishby [2017] R. ShwartzZiv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
 Slonim et al. [2005] N. Slonim, G. S. Atwal, G. Tkačik, and W. Bialek. Informationbased clustering. Proceedings of the National Academy of Sciences, 102(51):18297–18302, 2005.
 Tax and Duin [2004] D. M. J. Tax and R. P. W. Duin. Support Vector Data Description. Machine Learning, 54(1):45–66, 2004.
 Tishby and Zaslavsky [2015] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop, pages 1–5, 2015.
 Tishby et al. [1999] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
 Vandermeulen and Scott [2013] R. Vandermeulen and C. Scott. Consistency of robust kernel density estimators. In Conference on Learning Theory, pages 568–591, 2013.
 Wang et al. [2005] J. Wang, P. Neskovic, and L. N. Cooper. Pattern classification via single spheres. In International Conference on Discovery Science, pages 241–252. Springer, 2005.

Wiatowski and Bölcskei [2018]
T. Wiatowski and H. Bölcskei.
A mathematical theory of deep convolutional neural networks for feature extraction.
IEEE Transactions on Information Theory, 64(3):1845–1866, 2018.  Wilcoxon [1945] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.

Zhai et al. [2016]
S. Zhai, Y. Cheng, W. Lu, and Z. Zhang.
Deep structured energy based models for anomaly detection.
In International Conference on Machine Learning, volume 48, pages 1100–1109, 2016.  Zhang et al. [2017] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
 Zhao et al. [2017] S. Zhao, J. Song, and S. Ermon. InfoVAE: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
 Zhu [2008] X. Zhu. Semisupervised learning literature survey. Computer Sciences TR 1530, University of Wisconsin Madison, 2008.
Appendix A Optimization of Deep SAD
The Deep SAD objective is generally nonconvex in the network weights which usually is the case in deep learning. For a computationally efficient optimization, we rely on (minibatch) SGD to optimize the network weights using the backpropagation algorithm. For improved generalization, we add weight decay regularization with hyperparameter to the objective. Algorithm 1 summarizes the Deep SAD optimization routine.
Using SGD allows Deep SAD to scale with large datasets as the computational complexity scales linearly in the number of training batches and computations in each batch can be parallelized (e.g., by training on GPUs). Moreover, Deep SAD has low memory complexity as a trained model is fully characterized by the final network parameters and no data must be saved or referenced for prediction. Instead, the prediction only requires a forward pass on the network which usually is just a concatenation of simple functions. This enables fast predictions for Deep SAD.
Initialization of the network weights
We establish an autoencoder pretraining routine for initialization. That is, we first train an autoencoder that has an encoder with the same architecture as network on the reconstruction loss (mean squared error or crossentropy loss). After training, we then initialize with the converged parameters of the encoder. Note that this is in line with the Infomax principle (2) for unsupervised representation learning.
Initialization of the center
After initializing the network weights , we fix hypersphere center as the mean of the network representations that we obtain from an initial forward pass on the data (excluding labeled anomalies). We found SGD convergence to be smoother and faster by fixing center in the neighborhood of the initial data representations as we already observed in Ruff et al. (2018). If some labeled normal examples are available, using only those examples for a mean initialization would be another strategy to minimize possible distortions from polluted unlabeled training data. Adding center to the optimization variables would allow a trivial “hypersphere collapse” solution for unsupervised Deep SVDD.
Preventing a hypersphere collapse
A “hypersphere collapse” describes the trivial solution that neural network converges to the constant function , i.e. the hypersphere collapses to a single point. In Ruff et al. (2018), we demonstrate theoretical network properties that prevent such a collapse which we adopt for Deep SAD. Most importantly, network
must have no bias terms and no bounded activation functions. We refer to
Ruff et al. (2018) for further details. If there are sufficiently many labeled anomalies available for training, however, hypersphere collapse is not a problem for Deep SAD due to the opposing labeled and unlabeled objectives.Appendix B Network Architectures
We employ LeNettype convolutional neural networks (CNNs) on MNIST, FashionMNIST, and CIFAR10, where each convolutional module consists of a convolutional layer followed by leaky ReLU activations with leakiness
andmaxpooling. On MNIST, we employ a CNN with two modules,
filters followed by filters, and a final dense layer of units. On FashionMNIST, we employ a CNN also with two modules, filters and filters, followed by two dense layers of and units respectively. On CIFAR10, we employ a CNN with three modules, filters, filters, and filters, followed by a final dense layer of units.On the classic AD benchmark datasets, we employ standard MLP feedforward architectures. On arrhythmia, a 3layer MLP with  units. On cardio, satellite, satimage2, and shuttle a 3layer MLP with  units. On thyroid a 3layer MLP with  units.
Appendix C Details on Competing Methods
OcSvm/svdd
The OCSVM and SVDD are equivalent for the Gaussian/RBF kernel we employ. As mentioned in the main paper, we deliberately grant the OCSVM/SVDD an unfair advantage by selecting its hyperparameters to maximize AUC on a subset (10%) of the test set to establish a strong baseline. To do this, we consider the RBF scale parameter and select the best performing one. Moreover, we always repeat this over parameter and then report the best final result.
Isolation Forest (IF)
We set the number of trees to and the subsampling size to , as recommended in the original work (Liu et al., 2008).
Kernel Density Estimator (KDE)
We select the bandwidth of the Gaussian kernel from via 5fold crossvalidation using the loglikelihood score following (Ruff et al., 2018).
Ssad
We also deliberately grant the stateoftheart semisupervised AD kernel method SSAD the unfair advantage of selecting its hyperparameters optimally to maximize AUC on a subset (10%) of the test set. To do this, we again select the scale parameter of the RBF kernel we use from and select the best performing one. Otherwise we set the hyperparameters as recommend by the original authors to , , , and (Görnitz et al., 2013).
(Convolutional) Autoencoder ((C)AE)
To create the (convolutional) autoencoders, we symmetrically construct the decoders w.r.t. the architectures reported in Appenidx B, which make up the encoder parts of the autoencoders. Here, we replace maxpooling with simple upsampling and convolutions with deconvolutions. We train the autoencoders on the MSE reconstruction loss that also serves as the anomaly score.
Hybrid Variants
To establish hybrid methods, we apply the OCSVM, IF, KDE, and SSAD as outlined above to the resulting bottleneck representations given by the converged autoencoder.
Unsupervised Deep SVDD
We consider both variants, SoftBoundary Deep SVDD and OneClass Deep SVDD as unsupervised baselines and always report the better performance as the unsupervised result. For SoftBoundary Deep SVDD, we optimally solve for the radius on every minibatch and run experiments for . We set the weight decay hyperparameter to . For Deep SVDD, we remove all bias terms from the network to prevent a hypersphere collapse as we recommended in the original work (Ruff et al., 2018).
Deep SAD
We set and equally weight the unlabeled and labeled examples by setting if not reported otherwise.
SsDgm
We consider both the M2 and M1+M2 model and always report the better performing result. Otherwise we follow the settings as recommended in the original work (Kingma et al., 2014).
Supervised Deep Binary Classifier
To interpret AD as a binary classification problem, we rely on the typical assumption that most of the unlabeled training data is normal by assigning to all unlabeled examples. Already labeled normal examples and labeled anomalies retain their assigned labels of and respectively. We train the supervised classifier on the binary crossentropy loss. Note that in scenario (i), in particular, the supervised classifier has perfect, unpolluted label information but still fails to generalize as there are novel anomaly classes at testing.
SGD Optimization Details for Deep Methods
We use the Adam optimizer with recommended default hyperparameters (Kingma and Ba, 2014) and apply Batch Normalization (Ioffe and Szegedy, 2015) in SGD optimization. For all deep approaches and on all datasets, we employ a twophase (“searching” and “finetuning”) learning rate schedule. In the searching phase we first train with a learning rate for epochs. In the finetuning phase we train with for another epochs. We always use a batch size of 200. For the autoencoder, SSDGM, and the supervised classifier, we initialize the network with uniform Glorot weights (Glorot and Bengio, 2010). For Deep SVDD and Deep SAD, we establish an unsupervised pretraining routine via autoencoder as explained in Appendix A. We set the network to be the encoder of the autoencoder that we train beforehand.
Appendix D Complete Experimental Results
Besides Tables LABEL:tab:1_known–LABEL:tab:odds_results_appendix that list the complete experimental results of all the methods, we provide AUC scatterplots of the best (1^{st}) vs. second best (2^{nd}) performing methods in the experimental scenarios (i)–(iii) on the most complex CIFAR10 dataset. If many points fall above the identity line, this is a strong indication that the best method indeed significantly outperforms the second best, which is often the case for Deep SAD.
Comments
There are no comments yet.