A PyTorch implementation of Deep SAD, a deep Semi-supervised Anomaly Detection method.
Deep approaches to anomaly detection have recently shown promising results over shallow approaches on high-dimensional data. Typically anomaly detection is treated as an unsupervised learning problem. In practice however, one may have---in addition to a large set of unlabeled samples---access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. Semi-supervised approaches to anomaly detection make use of such labeled data to improve detection performance. Few deep semi-supervised approaches to anomaly detection have been proposed so far and those that exist are domain-specific. In this work, we present Deep SAD, an end-to-end methodology for deep semi-supervised anomaly detection. Using an information-theoretic perspective on anomaly detection, we derive a loss motivated by the idea that the entropy for the latent distribution of normal data should be lower than the entropy of the anomalous distribution. We demonstrate in extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10 along with other anomaly detection benchmark datasets that our approach is on par or outperforms shallow, hybrid, and deep competitors, even when provided with only few labeled training data.READ FULL TEXT VIEW PDF
A PyTorch implementation of Deep SAD, a deep Semi-supervised Anomaly Detection method.
Transferring audio features to build models for rare conditions with scarce data
Deep SAD model with customized datasets. Source: https://github.com/lukasruff/Deep-SAD-PyTorch
Team Move Brick. Reproducibility of DeepSAD
is the task of identifying unusual samples in data. Because this task lacks a supervised learning objective AD methods typically formulate an unsupervised problem to find a “compact” description of the “normal” class. In one-class classification(Moya et al., 1993; Schölkopf et al., 2001; Tax and Duin, 2004; Ruff et al., 2018) for example the aim is to find a set of small measure which contains most of the data, and samples that deviate from this description are deemed anomalous. Shallow anomaly detectors such as the One-Class SVM (OC-SVM) (Schölkopf et al., 2001), Support Vector Data Description (SVDD) (Tax and Duin, 2004), Isolation Forest (IF) (Liu et al., 2008), or Kernel Density Estimator (KDE) (Parzen, 1962; Kim and Scott, 2012; Vandermeulen and Scott, 2013) often require manual feature engineering to be effective on high-dimensional data and are limited in their scalability to large datasets. These limitations have sparked great interest in developing novel unsupervised deep learning approaches to AD, a line of research which has already shown promising results (Sakurada and Yairi, 2014; Erfani et al., 2016; Zhai et al., 2016; Chen et al., 2017; Ruff et al., 2018; Deecke et al., 2018; Golan and El-Yaniv, 2018; Hendrycks et al., 2019).
Unlike the standard AD setting, in many real-world applications one may also have access to some verified (i.e., labeled) normal or anomalous examples in addition to the unlabeled data. Such samples could be hand labeled by a domain expert, for instance. Unsupervised approaches to AD ignore this valuable information whereas supervised approaches can overfit the training data and fail to generalize to out-of-distribution anomalies. Figure 1 illustrates this situation with a toy example.
aims to bridge the gap between unsupervised AD and supervised learning. These approaches do not assume a common pattern among the “anomaly class” and thus do not impose the typical cluster assumption semi-supervised classifiers build upon(Zhu, 2008; Chapelle et al., 2009). Instead, semi-supervised approaches to AD aim to find a “compact description” of the data while also correctly classifying the labeled instances (Blanchard et al., 2010; Görnitz et al., 2013). Because of this, semi-supervised AD methods do not overfit to the labeled anomalies and generalize well to novel anomalies (Görnitz et al., 2013). Existing work on deepsemi-supervised learning has almost exclusively focused on classification (Kingma et al., 2014; Rasmus et al., 2015; Odena, 2016; Dai et al., 2017; Oliver et al., 2018); only a few deep semi-supervised approaches have been proposed for AD and those tend to be domain or data-type specific (Ergen et al., 2017; Kiran et al., 2018; Min et al., 2018).
In this work, we present Deep SAD (Deep Semi-Supervised Anomaly Detection), an end-to-end deep method for semi-supervised AD. Deep SAD is a generalization of our recently introduced Deep SVDD (Ruff et al., 2018) to include labeled data. We show that our approach can be understood in information-theoretic terms as learning a latent distribution of low entropy for the normal data, with the anomalous distribution having a heavier tailed, higher entropy distribution. To do this we formulate an information-theoretic perspective on deep learning for AD.
The study of the theoretical foundations of deep learning is an active and ongoing research effort (Montavon et al., 2011; Tishby and Zaslavsky, 2015; Cohen et al., 2016; Eldan and Shamir, 2016; Neyshabur et al., 2017; Raghu et al., 2017; Zhang et al., 2017; Achille and Soatto, 2018; Arora et al., 2018; Belkin et al., 2018; Wiatowski and Bölcskei, 2018; Lapuschkin et al., 2019). One strong line of research that has emerged is rooted in information theory (Shannon, 1948).
In the supervised setting where one has input variable , latent variable (e.g., the final layer of a deep network), and output variable (i.e., the label), the well-known Information Bottleneck principle (Tishby et al., 1999; Tishby and Zaslavsky, 2015; Shwartz-Ziv and Tishby, 2017; Alemi et al., 2017; Saxe et al., 2018) is an explanation of representation learning as a trade-off between finding a minimal compression of the input while retaining the informativeness of for predicting the label . Put formally: supervised deep learning seeks to minimize the mutual information between the input and the latent representation while maximizing the mutual information between and the task , i.e.
is modeled by a deep network and the hyperparametercontrols the trade-off between compression (i.e., complexity) and classification accuracy.
For unsupervised deep learning, due to the absence of labels and thus the lack of an obvious task, other information-theoretic learning principles have been formulated. Of these, the Infomax principle (Linsker, 1988; Bell and Sejnowski, 1995; Hjelm et al., 2019) is one of the most prevalent and widely used principles. In contrast to (1), the objective of Infomax is to maximize the mutual information between the data and its representation
This is typically done using some additional constraint or regularization on the representation with hyperparameter
to obtain statistical properties desired for some specific downstream task. Examples in which the Infomax principle has been applied have a long history and include unsupervised tasks such as independent component analysis(Bell and Sejnowski, 1995), clustering (Slonim et al., 2005; Ji et al., 2018), generative modeling (Chen et al., 2016; Hoffman and Johnson, 2016; Zhao et al., 2017; Alemi et al., 2018), and unsupervised representation learning in general (Hjelm et al., 2019).
We observe that the Infomax principle has also been implicitly applied in previous deep representations for AD. For example autoencoding models(Rumelhart et al., 1986; Hinton and Salakhutdinov, 2006), which make up the predominant class of approaches to deep AD (Hawkins et al., 2002; Sakurada and Yairi, 2014; Andrews et al., 2016; Erfani et al., 2016; Zhai et al., 2016; Chen et al., 2017; Chalapathy and Chawla, 2019), can be understood as implicitly maximizing the mutual information via the reconstruction objective under some regularization of the latent code . Choices for regularization include sparsity (Makhzani and Frey, 2014), the distance to some prior latent distribution, e.g. measured via the KL divergence (Kingma and Welling, 2013; Rezende et al., 2014), an adversarial loss (Makhzani et al., 2015), or simply a bottleneck in dimensionality. Such restrictions for AD share the idea that the latent representation of the normal data should be in some sense “compact”.
As illustrated in Figure 1, a supervised approach to AD only learns to recognize anomalies similar to those seen in training. However, anything not normal is by definition an anomaly and there is no explicit distribution of the “anomaly class”. This makes supervised learning principles such as (1) ill-defined for AD. We instead build upon principle (2) to derive a deep method for semi-supervised AD, where we include the label information through a novel representation learning regularizer that is based on entropy.
In the following, we introduce Deep SAD, a deep method for semi-supervised AD. To formulate our objective, we first briefly review the unsupervised Deep SVDD method (Ruff et al., 2018) and show its connection to entropy minimization. We then generalize the method to the semi-supervised AD setting.
For input space and output space , let
be a neural network withhidden layers and corresponding set of weights . The objective of Deep SVDD is to train a neural network to learn a transformation that minimizes the volume of a data-enclosing hypersphere in output space centered on a predetermined point . Given (unlabeled) training samples , the One-Class Deep SVDD objective is defined as:
The Deep SVDD penalizes the mean squared distance of the mapped data points to the center of the sphere. This forces the network to extract those common factors of variation which are most stable within a dataset. As a consequence, normal data points tend to get mapped near the hypersphere center, whereas anomalies are mapped further away (Ruff et al., 2018). The second term is a weight decay regularizer on the network weights with , where denotes the Frobenius norm.
The unsupervised Deep SVDD can be optimized via SGD using backpropagation. For initialization, the authors first pre-train an autoencoder and then initialize the networkwith the converged weights of the encoder. After initializing the network weights , the hypersphere center is fixed as the mean of the network representations obtained from an initial forward pass on the training data (Ruff et al., 2018).
The anomaly score of a test point finally is given by its distance to the center of the hypersphere:
where are the network weights of a trained model.
We now show that Deep SVDD may not only be understood in terms of minimum volume estimation(Scott and Nowak, 2006)
, but also in terms of entropy minimization over the latent distribution. For a (continuous) latent random variablewith pdf and support , its (differential) entropy is given by
Assuming has finite covariance , it follows that
with equality if and only if is jointly Gaussian (Cover and Thomas, 2012). Thus, if follows an isotropic Gaussian, , with , then
i.e. for a fixed dimensionality , the entropy of
is proportional to its log-variance.
Now observe that the unsupervised Deep SVDD objective (3) (disregarding weight decay regularization) is equivalent to minimizing the empirical variance thus minimizing an approximate upper bound for the entropy of the latent distribution.
We are happy to now introduce our Deep SAD method. Assume that, in addition to the unlabeled samples with , we have access to labeled samples and . We denote for known normal examples and for known anomalies.
Following the insights above, we formulate our deep semi-supervised AD objective under the idea that the latent distribution of the normal data, , should have low entropy, whereas the latent distribution of anomalies, , should have high entropy. By this, we do not impose any additional assumption on the anomaly-generating distribution , such as a manifold or cluster assumption that supervised or semi-supervised classification approaches commonly make (Zhu, 2008; Chapelle et al., 2009). We argue that such a model better captures the nature of anomalies, which can be thought of as being generated from an infinite mixture of all distributions that are different from the normal data distribution, indubitably a distribution that has high entropy. We can express this idea in terms of principle (2) with respective entropy regularization of the latent distribution:
Based on the connection between Deep SVDD and entropy minimization we have shown in Section 3.2, we define our Deep SAD objective as
with hyperparameters and . We again impose a quadratic loss on the distances of the mapped points to the fixed center , for both the unlabeled as well as the labeled normal examples, thus intending to learn a latent distribution with low entropy for the normal data. This also incorporates the assumption common in AD that most of the unlabeled data is normal. In contrast, for the labeled anomalies we penalize the inverse of the distances such that anomalies must be mapped further away from the center.111To ensure numerical stability, we add a machine epsilon (eps ) to the denominator of the inverse. That is, we penalize low variance and thus the network must attempt to map known anomalies to a heavy-tailed distribution that has high entropy. To maximize the mutual information in (8), we also rely on autoencoder pre-training.
The hyperparameter controls the balance between the labeled and unlabeled terms, where emphasizes the unlabeled and the labeled objective. For , the two terms are weighted equally. The last term is a weight decay regularizer. Note that we recover the unsupervised Deep SVDD (3) formulation as the special case where only unlabeled data is available (). As an anomaly score, we again take the distance of the latent representation to the center as in (4). We optimize the generally non-convex Deep SAD objective (9) via SGD using backpropagation. Appendix A in the supplementary material provides further details.
We evaluate Deep SAD on MNIST, Fashion-MNIST, and CIFAR-10 as well as classic anomaly detection benchmark datasets. We compare to shallow, hybrid, as well as deep unsupervised, semi-supervised and supervised competitors. We refer to other recent works (Ruff et al., 2018; Golan and El-Yaniv, 2018; Hendrycks et al., 2019) for further comprehensive comparisons solely between unsupervised deep AD methods.222Our code is available at: https://github.com/lukasruff/Deep-SAD-PyTorch
We consider the OC-SVM (Schölkopf et al., 2001) and SVDD (Tax and Duin, 2004) with Gaussian kernel (which are in this case equivalent), Isolation Forest (Liu et al., 2008), and KDE (Parzen, 1962) as shallow unsupervised baselines. For unsupervised deep competitors, we consider the well-established autoencoder and the state-of-the-art unsupervised Deep SVDD method (Ruff et al., 2018). For semi-supervised approaches, we consider the shallow state-of-the-art semi-supervised AD method of SSAD (Görnitz et al., 2013) with Gaussian kernel. As mentioned previously, there are no deep methods for semi-supervised AD that are applicable to the general multivariate data setting. However, we add the well-known Semi-Supervised Deep Generative Model (SS-DGM) (Kingma et al., 2014) to make a comparison with a deep semi-supervised classifier. To complete the full learning spectrum, we also compare to a fully supervised deep classifier trained on the binary cross-entropy loss. Finally, in addition to training the shallow detectors on the raw input features, we also consider all their hybrid variants of applying them to the bottleneck representation given by the autoencoder (Erfani et al., 2016; Nicolau et al., 2016).
In our experiments we deliberately grant the shallow and hybrid methods an unfair advantage by selecting their hyperparameters to maximize AUC on a subset (10%) of the test set to establish strong baselines. To control for architectural effects between the competing deep methods, we always employ the same (LeNet-type) deep networks. Full details on network architectures and hyperparameter selection can be found in Appendices B and C of the supplementary material. Due to space constraints, in the main text we only report results for methods which showed competitive performance and defer results for the under-performing methods in Appendix D.
The MNIST, Fashion-MNIST, and CIFAR-10 datasets all have ten classes from which we derive ten AD setups on each dataset. In every setup, we set one of the ten classes to be the normal class and let the remaining nine classes represent anomalies. We use the original training data of the respective normal class as the unlabeled part of our training set. Thus we start with a clean anomaly detection setting that fulfills the assumption that most (in this case all) unlabeled samples are normal. The training data of the respective nine anomaly classes then forms the data pool from which we draw anomalies for training to create different scenarios. We compute the AUC metric on the original respective test sets using ground truth labels to make a quantitative comparison, i.e. for the normal class and for the respective nine anomaly classes. We rescale pixels to via min-max feature scaling as the only data pre-processing step.
We examine three scenarios in which we vary the following three experimental parameters: (i) the ratio of labeled training data , (ii) the ratio of pollution in the unlabeled training data with (unknown) anomalies, and (iii) the number of anomaly classes included in the labeled training data.
In this scenario, we investigate the effect that including labeled anomalies into training has on detection performance and potential advantage of using a semi-supervised AD method over other paradigms. To do this we increase the ratio of labeled training data adding more and more known anomalies with to the training set. We add the labeled anomalies from anomaly class (out of the nine remaining ones). For testing, we then consider all nine remaining classes as anomalies, i.e. there are eight novel classes at testing time. We do this to simulate the unpredictable nature of anomalies. For the unlabeled part of the training set, we keep the training data of the respective normal class, which we leave unpolluted for now, i.e. . We iterate this training set generation process per AD setup always over all the nine respective anomaly classes and report the average results over the ten AD setups nine anomaly classes, i.e. over 90 experiments per labeled ratio .
Here we investigate the robustness of the different methods to an increasing pollution ratio of the training set with unknown anomalies. To do so we pollute the unlabeled part of the training set with anomalies drawn from all nine respective anomaly classes in each AD setup. We fix the ratio of labeled training samples at where we again draw samples only from anomaly class in this scenario. We repeat this training set generation process per AD setup over all the nine respective anomaly classes and report the average results over the resulting 90 experiments per pollution ratio . We hypothesize that the semi-supervised approach alleviates the negative impact pollution has on detection performance, since labeled anomalies should help to “filter out” similar unknown anomalies.
In the last scenario, we compare the detection performance at various numbers of known anomaly classes. In scenarios (i) and (ii), we have always sampled labeled anomalies only from out of the nine anomaly classes. In this scenario, we now increase the number of anomaly classes included in the labeled part of the training set. Since we have a limited number of anomaly classes (nine) in each AD setup, we expect the supervised classifier to catch up at some point. We fix the overall ratio of labeled training examples again at and consider a pollution ratio of for the unlabeled training data in this scenario. We repeat this training set generation process for ten seeds in every of the ten AD setups and report the average results over the resulting 100 experiments per number . For every seed, the classes are drawn uniformly at random out of the nine respective anomaly classes.
. In addition to reporting the average AUC with standard deviation, we always conduct Wilcoxon signed-rank tests(Wilcoxon, 1945) between the best and second best performing method and indicate statistically significant () differences. Figure 2 demonstrates the advantage of a semi-supervised approach to AD especially on the most complex CIFAR-10 dataset, where Deep SAD performs best. Moreover, Figure 2 confirms that a supervised approach is vulnerable to novel anomalies at testing when only little labeled training data is available. In comparison, our Deep SAD generalizes to novel anomalies while also taking advantage of the labeled examples. Note that the hybrid SSAD, which has not yet been considered in the literature, also proves to be a sound baseline. Figure 3 shows that the detection performance of all methods decreases with increasing data pollution. Deep SAD proves to be most robust again especially on the most complex CIFAR-10. Finally, Figure 4 shows that the more diverse the labeled anomalies in the training set are, the better the detection performance becomes. We also see that the supervised method is very sensitive to the number of attack classes but catches up at some point as suspected. Overall, we observe that Deep SAD is particularly advantageous on complex data.
We run Deep SAD experiments on the ten AD setups from above on each dataset for to analyze the sensitivity of Deep SAD with respect to the hyperparameter . In this analysis, we set the experimental parameters to , , and and again iterate over all nine anomaly classes in every AD setup. The results shown in Figure 5 suggest that Deep SAD is fairly robust against changes of the hyperparameter .
In the last experiment, we examine the detection performance on some well-established AD benchmark datasets (Rayana, 2016) listed in Table 1. We do this to evaluate the deep against the shallow approaches also on non-image, tabular datasets that are rarely considered in the deep AD literature. For the evaluation, we consider random train-to-test set splits of 60:40 while maintaining the original proportion of anomalies in each set. We then run experiments for 10 seeds with and , i.e. 1% of the training set are labeled anomalies and the unlabeled training data is unpolluted. Since there are no specific different anomaly classes, we set . We standardize features to have zero mean and unit variance as the only pre-processing step.
Table 2 shows the results. We observe that the shallow kernel methods seem to perform slightly better on the rather small, low-dimensional benchmarks. Deep SAD proves competitive though and the small differences might be explained by the strong advantage we deliberately grant the shallow methods in the selection of their hyperparameters. The results in section 4.2 and other recent works (Ruff et al., 2018; Golan and El-Yaniv, 2018; Hendrycks et al., 2019) demonstrate that deep methods are especially superior on complex data with hierarchical structure. Unlike other deep approaches (Ergen et al., 2017; Kiran et al., 2018; Min et al., 2018; Deecke et al., 2018; Golan and El-Yaniv, 2018), however, our Deep SAD method is not domain or data-type specific. Due to its strong performance using both deep and shallow networks we expect Deep SAD to extend well to other data types.
We have introduced Deep SAD, a deep method for semi-supervised anomaly detection. To derive our method, we formulated an information-theoretic perspective on deep anomaly detection. Our experiments demonstrate that Deep SAD improves detection performance especially on more complex datasets already with only small amounts of labeled data. Our results suggest that semi-supervised approaches to anomaly detection should always be preferred in applications whenever some labeled information is available.
LR acknowledges support from the German Ministry of Education and Research (BMBF) in the project ALICE III (FKZ: 01IS18049B). MK and RV acknowledge support by the German Research Foundation (DFG) award KL 2698/2-1 and by the German Ministry of Education and Research (BMBF) awards 031L0023A, 01IS18051A, and 031B0770E. Part of the work was done while MK was a sabbatical visitor of the DASH Center at the University of Southern California. AB is grateful for support by the Singapore Ministry of Education grant MOE2016-T2-2-154. This work was supported by the German Ministry for Education and Research (BMBF) as Berlin Big Data Center (01IS14013A) and Berlin Center for Machine Learning (01IS18037I). Partial funding by DFG is acknowledged (EXC 2046/1, project-ID: 390685689). This work was also supported by the Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0-00451, No. 2017-0-01779).
Semi-supervised novelty detection.Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.
On the expressive power of deep learning: A tensor analysis.In International Conference on Algorithmic Learning Theory, volume 49, pages 698–728, 2016.
International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
NIPS Workshop in Advances in Approximate Bayesian Inference, 2016.
Robust kernel density estimation.Journal of Machine Learning Research, 13(Sep):2529–2565, 2012.
Semi-Supervised One-Class Support Vector Machines for Classification of Remote Sensing Sata.IEEE Transactions on Geoscience and Remote Sensing, 48(8):3188–3197, 2010.
On Estimation of a Probability Density Function and Mode.The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
Deep structured energy based models for anomaly detection.In International Conference on Machine Learning, volume 48, pages 1100–1109, 2016.
The Deep SAD objective is generally non-convex in the network weights which usually is the case in deep learning. For a computationally efficient optimization, we rely on (mini-batch) SGD to optimize the network weights using the backpropagation algorithm. For improved generalization, we add weight decay regularization with hyperparameter to the objective. Algorithm 1 summarizes the Deep SAD optimization routine.
Using SGD allows Deep SAD to scale with large datasets as the computational complexity scales linearly in the number of training batches and computations in each batch can be parallelized (e.g., by training on GPUs). Moreover, Deep SAD has low memory complexity as a trained model is fully characterized by the final network parameters and no data must be saved or referenced for prediction. Instead, the prediction only requires a forward pass on the network which usually is just a concatenation of simple functions. This enables fast predictions for Deep SAD.
We establish an autoencoder pre-training routine for initialization. That is, we first train an autoencoder that has an encoder with the same architecture as network on the reconstruction loss (mean squared error or cross-entropy loss). After training, we then initialize with the converged parameters of the encoder. Note that this is in line with the Infomax principle (2) for unsupervised representation learning.
After initializing the network weights , we fix hypersphere center as the mean of the network representations that we obtain from an initial forward pass on the data (excluding labeled anomalies). We found SGD convergence to be smoother and faster by fixing center in the neighborhood of the initial data representations as we already observed in Ruff et al. (2018). If some labeled normal examples are available, using only those examples for a mean initialization would be another strategy to minimize possible distortions from polluted unlabeled training data. Adding center to the optimization variables would allow a trivial “hypersphere collapse” solution for unsupervised Deep SVDD.
A “hypersphere collapse” describes the trivial solution that neural network converges to the constant function , i.e. the hypersphere collapses to a single point. In Ruff et al. (2018), we demonstrate theoretical network properties that prevent such a collapse which we adopt for Deep SAD. Most importantly, network
must have no bias terms and no bounded activation functions. We refer toRuff et al. (2018) for further details. If there are sufficiently many labeled anomalies available for training, however, hypersphere collapse is not a problem for Deep SAD due to the opposing labeled and unlabeled objectives.
We employ LeNet-type convolutional neural networks (CNNs) on MNIST, Fashion-MNIST, and CIFAR-10, where each convolutional module consists of a convolutional layer followed by leaky ReLU activations with leakinessand
-max-pooling. On MNIST, we employ a CNN with two modules,-filters followed by -filters, and a final dense layer of units. On Fashion-MNIST, we employ a CNN also with two modules, -filters and -filters, followed by two dense layers of and units respectively. On CIFAR-10, we employ a CNN with three modules, -filters, -filters, and -filters, followed by a final dense layer of units.
On the classic AD benchmark datasets, we employ standard MLP feed-forward architectures. On arrhythmia, a 3-layer MLP with -- units. On cardio, satellite, satimage-2, and shuttle a 3-layer MLP with -- units. On thyroid a 3-layer MLP with -- units.
The OC-SVM and SVDD are equivalent for the Gaussian/RBF kernel we employ. As mentioned in the main paper, we deliberately grant the OC-SVM/SVDD an unfair advantage by selecting its hyperparameters to maximize AUC on a subset (10%) of the test set to establish a strong baseline. To do this, we consider the RBF scale parameter and select the best performing one. Moreover, we always repeat this over -parameter and then report the best final result.
We set the number of trees to and the sub-sampling size to , as recommended in the original work (Liu et al., 2008).
We select the bandwidth of the Gaussian kernel from via 5-fold cross-validation using the log-likelihood score following (Ruff et al., 2018).
We also deliberately grant the state-of-the-art semi-supervised AD kernel method SSAD the unfair advantage of selecting its hyperparameters optimally to maximize AUC on a subset (10%) of the test set. To do this, we again select the scale parameter of the RBF kernel we use from and select the best performing one. Otherwise we set the hyperparameters as recommend by the original authors to , , , and (Görnitz et al., 2013).
To create the (convolutional) autoencoders, we symmetrically construct the decoders w.r.t. the architectures reported in Appenidx B, which make up the encoder parts of the autoencoders. Here, we replace max-pooling with simple upsampling and convolutions with deconvolutions. We train the autoencoders on the MSE reconstruction loss that also serves as the anomaly score.
To establish hybrid methods, we apply the OC-SVM, IF, KDE, and SSAD as outlined above to the resulting bottleneck representations given by the converged autoencoder.
We consider both variants, Soft-Boundary Deep SVDD and One-Class Deep SVDD as unsupervised baselines and always report the better performance as the unsupervised result. For Soft-Boundary Deep SVDD, we optimally solve for the radius on every mini-batch and run experiments for . We set the weight decay hyperparameter to . For Deep SVDD, we remove all bias terms from the network to prevent a hypersphere collapse as we recommended in the original work (Ruff et al., 2018).
We set and equally weight the unlabeled and labeled examples by setting if not reported otherwise.
We consider both the M2 and M1+M2 model and always report the better performing result. Otherwise we follow the settings as recommended in the original work (Kingma et al., 2014).
To interpret AD as a binary classification problem, we rely on the typical assumption that most of the unlabeled training data is normal by assigning to all unlabeled examples. Already labeled normal examples and labeled anomalies retain their assigned labels of and respectively. We train the supervised classifier on the binary cross-entropy loss. Note that in scenario (i), in particular, the supervised classifier has perfect, unpolluted label information but still fails to generalize as there are novel anomaly classes at testing.
We use the Adam optimizer with recommended default hyperparameters (Kingma and Ba, 2014) and apply Batch Normalization (Ioffe and Szegedy, 2015) in SGD optimization. For all deep approaches and on all datasets, we employ a two-phase (“searching” and “fine-tuning”) learning rate schedule. In the searching phase we first train with a learning rate for epochs. In the fine-tuning phase we train with for another epochs. We always use a batch size of 200. For the autoencoder, SS-DGM, and the supervised classifier, we initialize the network with uniform Glorot weights (Glorot and Bengio, 2010). For Deep SVDD and Deep SAD, we establish an unsupervised pre-training routine via autoencoder as explained in Appendix A. We set the network to be the encoder of the autoencoder that we train beforehand.
Besides Tables LABEL:tab:1_known–LABEL:tab:odds_results_appendix that list the complete experimental results of all the methods, we provide AUC scatterplots of the best (1st) vs. second best (2nd) performing methods in the experimental scenarios (i)–(iii) on the most complex CIFAR-10 dataset. If many points fall above the identity line, this is a strong indication that the best method indeed significantly outperforms the second best, which is often the case for Deep SAD.