Monitoring of complex systems and processes often goes hand in hand with anomaly detection. Anomaly here means a representation of abnormal system behavior. Information on normal system behaviour is often available in abundance, compared to samples of abnormal behavior. In some cases the anomalies are rare, or distribution of anomalies is highly skewed. So given the high variability of anomalies, it leads to the fact that some types of anomalies are missing in the training data set. In other cases, when anomalous examples are obtained by means other than sampling target system, or when distribution of anomalies evolve over time, some types of anomalies might even be unknown in principle. A good realistic data set with anomalous behavior is provided by KDD-99 Cup(KDD, 1999), with certain families of cyber-attacks present only in the test sample.
Conventional approaches for anomaly detection often involve one-class classification methods (Chalapathy et al., 2018; Tax and Duin, 2001; Ruff et al., 2018; Liu et al., 2008; Schölkopf and Smola, 2002), which yield a soft boundary between the normal class region, and the rest of the feature space. Usually such methods are referred to as unsupervised, since those do not take into account labels of available data. As this piece of information might be important, those one-class methods potentially lead to the performance degradation for the cases with significant overlap between normal and abnormal samples in the feature space.
There is a rich profusion of two-class supervised classification methods that account for both class labels, leading to better results in the presence of labeled abnormal samples. However, those methods lack any guarantees for predictions outside of the regions of the feature space presented in the training data. It becomes especially problematic for incomplete anomalous samples, as a classifier might consistently make false-positive predictions for unseen anomalies.
Contribution In this study, we develop a method that is aimed at combining the best of the two, one-class and two-class approaches, which we refer to as -class classification (’one plus epsilon’ or OPE for short). In order to achieve that, we derive two one-class objectives and combine them with the binary cross-entropy loss. We compare these objectives with respect to computational effectiveness, and demonstrate performance on several data sets that are either collected for anomaly detection tasks (KDD, 1999), or artificially under-sampled to emulate these conditions (Baldi et al., 2014; LeCun et al., 1998; Krizhevsky and Hinton, 2009; Lake et al., 2015).
Notation We assume that an N-dimensional feature space (), contains samples of two classes: normal (positive) and abnormal (negative) . We are interested in identifying instances of the single class
. There are two principal approaches: one-class (unitary classification) and two-class (binary classification). The former might rely on estimation of the likelihood of the positive class, so then one can apply a threshold to make a final decision. We will refer to any solution of the form , where is a monotonous function, as a unitary classification solution. The latter relies on estimation of the posterior conditional distribution
, that is usually approximated through minimization of the cross-entropy loss function:
where denotes conditional expectation , and — classifier’s decision function.
Optimal binary decision function that minimizes , can be expressed with the help of Bayes’ rule as
is class prior probability,— posterior conditional distribution and — likelihood for the given class .
2 One plus epsilon method
Let’s consider a simple case:
is a uniform distribution, with the support covering that of . If we put this into Equation 1 (assuming equal class priors), we get
where — probability density of distribution . Note, that is a unitary classification solution, therefore, solution to a classification problem between a given class and a uniformly distributed one, yields a unitary classification solution.
2.1 Adding known negative samples
Let’s take into account known anomalous samples. We propose the following loss function — linear combination of one-class classification loss and cross-entropy loss :
where compensates for the difference in classes prior probabilities. Ideally, it should be set to , so that the first two terms match the cross-entropy loss. is a hyper-parameter, that allows to choose the trade-off between unitary and binary classification solutions. We call the loss OPE loss. It leads to the following solution:
An important observation can be made — for large capacity models, even close to , leads to a significantly different solution in comparison to the two-class classification solution (Equation 2). This effect can be seen on Figure 1.
One might consider the term as a regularization term, that biases solution towards 0 everywhere, but this effect is especially pronounced in points with . One distinguishing feature of the regularization term is that it acts directly on predictions, rather then on parameters111Technically, regularization in one-class SVM objective (Schölkopf et al., 2000) and similar methods can also be considered to act directly on predictions since these are linear models., which makes it applicable to any classifier model.
Estimating term in Equation 3 for low-dimensional feature-space is straightforward — if can be bounded by a simple set (e.g. a box), then can be estimated by directly sampling from . We refer to this class of OPE algorithms as brute-force OPE.
2.2 Energy-based regularization
For high-dimensional feature space, however, sampling directly from
might be problematic, due to a potentially high variance of the gradients produced by the regularization term. One possible strategy of reducing variance ofgradient estimates, is to sample from another distribution :
Jin et al. (2017) employs this method and uses distribution induced by the model
at the previous training epoch:
where: — normalization term, — indicator function. Hence, can be written as:
Sampling from is computationally expensive, and various methods can be used, e.g. Hamiltonian Monte-Carlo (Duane et al., 1987). However, this transformation merely transfers the computationally heavy integration part from uniform sampling to estimation of the normalization term . In order to avoid recomputing on each epoch, a two-stage training procedure is proposed by Jin et al. (2017) and Tu (2007):
freeze sampling distribution , estimate ;
using this frozen distribution perform a number of stochastic gradient descent steps.
Note, that as long as a regularization term shifts the decision function towards outside of , and has a small impact within, it suffices for the purposes of anomaly detection. With that idea in mind, we propose the following approximation of regularization term to avoid uniform sampling and integration.
Let’s introduce , where
In case of , using Jensen inequality, we can approximate upper bound of as follows:
which leads to the following one-class loss function:
then the corresponding energy OPE (EOPE) loss function is
Gradients of can be easily estimated (see e.g. Bengio et al. (2009)):
Note, that Equation 9
essentially describes the negative phase of contrastive divergence algorithm for energy-based models. Similar relations between the cross-entropy loss and contrastive divergence have also been mentioned byKim and Bengio (2016).
As discussed above, the main goal of regularization term is to enforce one-class properties, namely, make the solution to be a monotonous transformation of . The following theorem shows that, despite being just an approximation of regularization, loss always leads to a one-class solution.
Theorem 1 Let be a Banach space, — a continuous probability density function such that
— a continuous probability density function such thatis an open set in . If continuous function minimizes (defined by Equation 7) with , then there exists a strictly increasing function , such that . Moreover, (if ).
Intuitively, it is clear, that if the dependency between and is violated in some regions, energy can be exchanged between these regions with a total reduction in the loss. A similar argument can be made for the property: — energy of low-density regions can be transferred to a high-density region, leading to an improved solution. A more formal proof can be found in Appendix B.
3 Implementation details
While OPE and EOPE losses are independent from any particular choice of model , we consider only neural networks. We optimize all neural networks with a stochastic gradient method (namely, adam algorithm by to Kingma and Ba (2014)). Algorithms 1 and 2 outline proposed methods.
Estimation of is tightly linked to the negative phase of energy-based generative models. A traditional approach for sampling from is to employ Monte-Carlo (MC) methods, in this work we use Hamiltonian Monte-Carlo (HMC). Additionally, in our experiments we use persistent MC chains following Tieleman (2008). Nevertheless, usage of MC leads to a significant slow down of the training procedure, as in general, multiple passes through the network are required for generating negative samples.
Note, that for values of close to 1, both and have a significant impact only in the regions with low probability density . This suggests that solutions of Equations 3 and 8 are relatively robust to improper sampling procedures, and one might achieve a faster training without sacrificing much of quality, by employing fast approximate MC procedures. In our experiments we observed that the following highly degenerate instance of HMC is performing well:
where: denotes Hadamard product, is distributed normally with zero mean and unit covariance matrix,
controls the impact of the random noise. We refer to the methods utilising such sampling as RMSProp-EOPE, since the procedure resembles RMSProp optimization algorithm(Tieleman and Hinton, 2012).
A completely different approach to negative phase sampling is described by Kim and Bengio (2016). The authors suggest using a separate network (generator) to produce samples from the target distribution. We also implement this sampling procedure and refer to the methods employing it as Deep EOPE.
In our experiments, we observe that methods based on EOPE loss, quickly lead to steep functions which heavily interferes with the sampling procedures. Following Tieleman and Hinton (2009), we add a small regularization term for predictions in pseudo-negative points:
where is a small constant ( in our experiments).
4 Relation to other methods
The idea to perform one-class classification (and generative task) as ’one against everything’, appears in many studies. Tax and Duin (2001) propose constructing a hyper-sphere around positive samples, effectively separating it from the rest of the space; Ruff et al. (2018) and Chalapathy et al. (2018) extend this idea on deep neural networks. Ruff et al. (2018) rely on weight regularization, which acts in a similar manner to EOPE by limiting the area with high model output. OPE and EOPE methods depend only on the model’s output, which allows OPE and EOPE methods to avoid limiting number of layers (for example, Chalapathy et al., 2018), and does not restrict choice of network architecture (Ruff et al., 2018).
Tu (2007) and Jin et al. (2017) developed a method similar in its nature to OPE, in fact, it is easy to see, that term as it appears in Equation 4, corresponds to the loss function from Jin et al. (2017). In this work we demonstrate that this loss is equivalent to the cross-entropy loss between a given class and a uniform distribution covering its support. EOPE loss alleviates computational expenses associated with the estimation of the normalization term and RMSProp-like sampling procedure further accelerates training by reducing computational cost of sampling.
We evaluate proposed methods on the following data sets: MNIST (LeCun et al., 1998), CIFAR (Krizhevsky and Hinton, 2009), KDD-99 (KDD, 1999), Omniglot (Lake et al., 2015), SUSY and HIGGS (Baldi et al., 2014). In order to reflect assumptions behind our approach we derive multiple tasks from each data set by varying size of the anomalous subset.
As the proposed methods target problems intermediate between one-class and two-class problems, we compare our approaches against the following algorithms:
conventional two-class classification with the cross-entropy loss;
a semi-supervised method: dimensionality reduction by a deep AutoEncoder followed by a classifier with the cross-entropy loss;
Since not all of the evaluated algorithms allow for a probabilistic interpretation, ROC AUC metric is reported. As performance of certain algorithms (especially, two-class classification) varies significantly depending on the choice of negative class, we run each experiment multiple times, and report average and standard deviation of the metrics. The results are reported in Tables4, 5, 6, 7, 8, 9. Detailed description of the experimental setup can be found in Appendix A.
In these tables, columns represent tasks with varying numbers of negative samples presented in the training set: numbers in the header indicate either number of classes that form a negative class (in case of MNIST, CIFAR, Omniglot and KDD data sets), or number of negative samples used (HIGGS and SUSY); ‘one-class’ denotes absence of known anomalious samples. As one-class algorithms do not take into account negative samples, results of these are repeated for the tasks with known anomalies.
In our experiments, we make several observations. Firstly, proposed methods generally outperform baseline methods, especially on the problems with a significant overlap between classes (SUSY, HIGGS and, possibly, CIFAR), and consistently show comparable performance on test problems. Secondly, we observe increasing performance as more negative samples are included in training set, while being consistently above or similar to that of conventional two-class classification. Lastly, to our surprise, brute-force OPE performs relatively well even on high-dimensional problems, which might indicate that gradients produced by its regularization term have variance sufficiently low for a proper convergence.
The main drawback of the OPE and EOPE methods is a slow training, which is largely due to usage of Monte-Carlo methods. It is partially alleviated by fast approximation of Hamiltonian Monte-Carlo and usage of a generator (Kim and Bengio, 2016), and can potentially be improved further, by advanced Monte-Carlo techniques (for example, Levy et al., 2017).
We present a new family of anomaly detection algorithms which can be efficiently applied to the problems intermediate between one-class and two-class settings. Solutions produced by these methods combine the best features of one-class and two-class approaches. In contrast to conventional one-class approaches, proposed methods can effectively utilise any number of known anomalous examples, and, unlike conventional two-class classification, does not require a representative sample of anomalous data. Our experiments show better or comparable performance to conventional two-class and one-class algorithms. Our approach is especially beneficial for anomaly detection problems, in which anomalous data is non-representative, or might evolve over time.
The research leading to these results has received funding from Russian Science Foundation under grant agreement n 17-72-20127.
- KDD (1999) KDD cup 1999 dataset: Intrusion detection system, 1999. URL https://archive.ics.uci.edu/ml/datasets/kdd+cup+1999+data.
Baldi et al. (2014)
Pierre Baldi, Peter Sadowski, and Daniel Whiteson.
Searching for exotic particles in high-energy physics with deep learning.Nature communications, 5:4308, 2014.
Bengio et al. (2009)
Yoshua Bengio et al.
Learning deep architectures for ai.
Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
- Chalapathy et al. (2018) Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using one-class neural networks. arXiv preprint arXiv:1802.06360, 2018.
- Duane et al. (1987) Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
- Jin et al. (2017) Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, pages 823–833, 2017.
- Kim and Bengio (2016) Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation, 2016.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Levy et al. (2017) Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
- Liu et al. (2008) Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pages 413–422. IEEE, 2008.
- Ruff et al. (2018) Lukas Ruff, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert Vandermeulen, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In International Conference on Machine Learning, pages 4390–4399, 2018.
- Schölkopf et al. (2000) Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C Platt. In Advances in neural information processing systems, pages 582–588, 2000.
- Schölkopf and Smola (2002) B. Schölkopf and A.J. Smola. Support vector machines, regularization, optimization, and beyond. MIT Press, 2002.
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Tax and Duin (2001) David MJ Tax and Robert PW Duin. Uniform object generation for optimizing one-class classifiers. Journal of machine learning research, 2(Dec):155–173, 2001.
Training restricted boltzmann machines using approximations to the likelihood gradient.In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
- Tieleman and Hinton (2009) Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1033–1040. ACM, 2009.
- Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Tu (2007) Zhuowen Tu. Learning generative models via discriminative approaches. In
- Zhou and Paffenroth (2017) Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 665–674. ACM, 2017.
This section provides detailed description of the experimental setup.
In order to make a clear comparison between methods, network architectures are made as close as possible. For image data (MNIST, CIFAR, Omniglot) VGG-like networks ((Simonyan and Zisserman, 2014)) are used, for tabular data 5-layers dense networks are used222Implementation can be found at https://gitlab.com/mborisyak/ope..
We evaluate the following proposed methods:
All OPE and EOPE models are trained with . All MCMC chains are persistent (by analogy with (Tieleman and Hinton, 2009)) and 4 MCMC steps are performed for each gradient step. All networks are optimized by adam algorithm ((Kingma and Ba, 2014)) with learning rate , , .
In order to reflect assumptions behind proposed methods, we derive several tasks from each original data set considered. For SUSY, HIGGS and KDD-99 data set positive class is fixed according to data sets’ descriptions; for MNIST and CIFAR-10 data sets each class is considered as positive; for Omniglot data set we choose ‘Braille’, ‘Futurama’ and ‘Greek’ alphabets are chosen as positive classes.
In order to fully demonstrate advantages of OPE and EOPE methods we vary sample sizes for negative class: for SUSY and HIGGS data sets only a small number of negative examples is randomly selected (, , , and ); for multi-class data sets several classes are randomly selected (without replacement) and subsampled, for MNIST, CIFAR and Omniglot data sets , , and classes are selected with examples from each, for KDD-99 maximum number of samples per class is limited by .
Original train-test splits are respected when possible (for SUSY and HIGGS data sets splits are random and fixed for all derived tasks) — test sets are not modified in any way.
Here we provide a formal proof of Theorem 1 from the Section 2.2. For the sake of simplicity we split the proof into two lemmas.
Lemma 1Let be a Banach space, — a continuous probability density function such that is an open set in . If continuous function minimizes (defined by Equation 7) with , then there exists a strictly increasing function , such that .
Proof. Consider a continuous function . We show that if can not be represented as , then does not minimize . This is demonstrated by constructing another continuous function that achieves lower loss than .
If can not be represented as then a pair of points and can be found such that and .
Due to continuity of and , it is possible to find such neighborhoods of and , that the difference in probability densities remains large, while differences in values of become insignificant or negative. More formally, for every there exists such that open balls and , satisfy following properties:
We define function as , where , ; the exact form of is not important, nevertheless, for clarity, let
We restrict our attention to such values of and , that has the same normalization constant as :
Equation 12 implies that and do not intersect and, since for , consists of two non-zero terms:
For every there exist a unique such that is a solution for Equation 15. Notice also, that is a continuous, strictly increasing function and .
Notice, that for small values of and
Similarly to , can be split into two parts:
Note, that for a positive
where: , , , .
Now, our aim is to prove that has a solution in form :
Note, that for each , and for each , . In combination with Equation 16, this implies that Inequality 21 is satisfied for some and , therefore, and are simultaneously satisfied for some and . This implies, that function has the same normalization constant as the original one, and reduces value of , hence, does not minimize , which concludes this proof.
Lemma 2. For every function that satisfies Lemma 1:
Proof. Suppose that .
For every sufficiently small , we can pick points , radius and two open balls , such that
Now we can introduce the same definitions and constructs as in Lemma 1, applied for , and . Consider , such that (such values always exist due to Equation 16). Note that since , for every , (defined by Equation 19) is bounded from below