1 Introduction
Monitoring of complex systems and processes often goes hand in hand with anomaly detection. Anomaly here means a representation of abnormal system behavior. Information on normal system behaviour is often available in abundance, compared to samples of abnormal behavior. In some cases the anomalies are rare, or distribution of anomalies is highly skewed. So given the high variability of anomalies, it leads to the fact that some types of anomalies are missing in the training data set. In other cases, when anomalous examples are obtained by means other than sampling target system, or when distribution of anomalies evolve over time, some types of anomalies might even be unknown in principle. A good realistic data set with anomalous behavior is provided by KDD99 Cup
(KDD, 1999), with certain families of cyberattacks present only in the test sample.Conventional approaches for anomaly detection often involve oneclass classification methods (Chalapathy et al., 2018; Tax and Duin, 2001; Ruff et al., 2018; Liu et al., 2008; Schölkopf and Smola, 2002), which yield a soft boundary between the normal class region, and the rest of the feature space. Usually such methods are referred to as unsupervised, since those do not take into account labels of available data. As this piece of information might be important, those oneclass methods potentially lead to the performance degradation for the cases with significant overlap between normal and abnormal samples in the feature space.
There is a rich profusion of twoclass supervised classification methods that account for both class labels, leading to better results in the presence of labeled abnormal samples. However, those methods lack any guarantees for predictions outside of the regions of the feature space presented in the training data. It becomes especially problematic for incomplete anomalous samples, as a classifier might consistently make falsepositive predictions for unseen anomalies.
Contribution In this study, we develop a method that is aimed at combining the best of the two, oneclass and twoclass approaches, which we refer to as class classification (’one plus epsilon’ or OPE for short). In order to achieve that, we derive two oneclass objectives and combine them with the binary crossentropy loss. We compare these objectives with respect to computational effectiveness, and demonstrate performance on several data sets that are either collected for anomaly detection tasks (KDD, 1999), or artificially undersampled to emulate these conditions (Baldi et al., 2014; LeCun et al., 1998; Krizhevsky and Hinton, 2009; Lake et al., 2015).
Notation We assume that an Ndimensional feature space (), contains samples of two classes: normal (positive) and abnormal (negative) . We are interested in identifying instances of the single class
. There are two principal approaches: oneclass (unitary classification) and twoclass (binary classification). The former might rely on estimation of the likelihood of the positive class
, so then one can apply a threshold to make a final decision. We will refer to any solution of the form , where is a monotonous function, as a unitary classification solution. The latter relies on estimation of the posterior conditional distribution, that is usually approximated through minimization of the crossentropy loss function:
(1) 
where denotes conditional expectation , and — classifier’s decision function.
Optimal binary decision function that minimizes , can be expressed with the help of Bayes’ rule as
(2) 
where
is class prior probability,
— posterior conditional distribution and — likelihood for the given class .2 One plus epsilon method
Let’s consider a simple case:
is a uniform distribution
, with the support covering that of . If we put this into Equation 1 (assuming equal class priors), we getwhere — probability density of distribution . Note, that is a unitary classification solution, therefore, solution to a classification problem between a given class and a uniformly distributed one, yields a unitary classification solution.
2.1 Adding known negative samples
Let’s take into account known anomalous samples. We propose the following loss function — linear combination of oneclass classification loss and crossentropy loss :
(3)  
where compensates for the difference in classes prior probabilities. Ideally, it should be set to , so that the first two terms match the crossentropy loss. is a hyperparameter, that allows to choose the tradeoff between unitary and binary classification solutions. We call the loss OPE loss. It leads to the following solution:
An important observation can be made — for large capacity models, even close to , leads to a significantly different solution in comparison to the twoclass classification solution (Equation 2). This effect can be seen on Figure 1.
One might consider the term as a regularization term, that biases solution towards 0 everywhere, but this effect is especially pronounced in points with . One distinguishing feature of the regularization term is that it acts directly on predictions, rather then on parameters^{1}^{1}1Technically, regularization in oneclass SVM objective (Schölkopf et al., 2000) and similar methods can also be considered to act directly on predictions since these are linear models., which makes it applicable to any classifier model.
Estimating term in Equation 3 for lowdimensional featurespace is straightforward — if can be bounded by a simple set (e.g. a box), then can be estimated by directly sampling from . We refer to this class of OPE algorithms as bruteforce OPE.
2.2 Energybased regularization
For highdimensional feature space, however, sampling directly from
might be problematic, due to a potentially high variance of the gradients produced by the regularization term. One possible strategy of reducing variance of
gradient estimates, is to sample from another distribution :(4) 
Jin et al. (2017) employs this method and uses distribution induced by the model
at the previous training epoch:
where: — normalization term, — indicator function. Hence, can be written as:
Sampling from is computationally expensive, and various methods can be used, e.g. Hamiltonian MonteCarlo (Duane et al., 1987). However, this transformation merely transfers the computationally heavy integration part from uniform sampling to estimation of the normalization term . In order to avoid recomputing on each epoch, a twostage training procedure is proposed by Jin et al. (2017) and Tu (2007):

freeze sampling distribution , estimate ;

using this frozen distribution perform a number of stochastic gradient descent steps.
Note, that as long as a regularization term shifts the decision function towards outside of , and has a small impact within, it suffices for the purposes of anomaly detection. With that idea in mind, we propose the following approximation of regularization term to avoid uniform sampling and integration.
Note, that in Equation 5 matches the definition of (negative) energy used in energybased generative models (Bengio et al., 2009): .
In case of , using Jensen inequality, we can approximate upper bound of as follows:
(6) 
which leads to the following oneclass loss function:
(7)  
then the corresponding energy OPE (EOPE) loss function is
(8) 
Gradients of can be easily estimated (see e.g. Bengio et al. (2009)):
(9) 
Note, that Equation 9
essentially describes the negative phase of contrastive divergence algorithm for energybased models. Similar relations between the crossentropy loss and contrastive divergence have also been mentioned by
Kim and Bengio (2016).As discussed above, the main goal of regularization term is to enforce oneclass properties, namely, make the solution to be a monotonous transformation of . The following theorem shows that, despite being just an approximation of regularization, loss always leads to a oneclass solution.
Theorem 1 Let be a Banach space,
— a continuous probability density function such that
is an open set in . If continuous function minimizes (defined by Equation 7) with , then there exists a strictly increasing function , such that . Moreover, (if ).Intuitively, it is clear, that if the dependency between and is violated in some regions, energy can be exchanged between these regions with a total reduction in the loss. A similar argument can be made for the property: — energy of lowdensity regions can be transferred to a highdensity region, leading to an improved solution. A more formal proof can be found in Appendix B.
3 Implementation details
While OPE and EOPE losses are independent from any particular choice of model , we consider only neural networks. We optimize all neural networks with a stochastic gradient method (namely, adam algorithm by to Kingma and Ba (2014)). Algorithms 1 and 2 outline proposed methods.
Estimation of is tightly linked to the negative phase of energybased generative models. A traditional approach for sampling from is to employ MonteCarlo (MC) methods, in this work we use Hamiltonian MonteCarlo (HMC). Additionally, in our experiments we use persistent MC chains following Tieleman (2008). Nevertheless, usage of MC leads to a significant slow down of the training procedure, as in general, multiple passes through the network are required for generating negative samples.
Note, that for values of close to 1, both and have a significant impact only in the regions with low probability density . This suggests that solutions of Equations 3 and 8 are relatively robust to improper sampling procedures, and one might achieve a faster training without sacrificing much of quality, by employing fast approximate MC procedures. In our experiments we observed that the following highly degenerate instance of HMC is performing well:
(10)  
(11) 
where: denotes Hadamard product, is distributed normally with zero mean and unit covariance matrix,
controls the impact of the random noise. We refer to the methods utilising such sampling as RMSPropEOPE, since the procedure resembles RMSProp optimization algorithm
(Tieleman and Hinton, 2012).A completely different approach to negative phase sampling is described by Kim and Bengio (2016). The authors suggest using a separate network (generator) to produce samples from the target distribution. We also implement this sampling procedure and refer to the methods employing it as Deep EOPE.
In our experiments, we observe that methods based on EOPE loss, quickly lead to steep functions which heavily interferes with the sampling procedures. Following Tieleman and Hinton (2009), we add a small regularization term for predictions in pseudonegative points:
where is a small constant ( in our experiments).
4 Relation to other methods
The idea to perform oneclass classification (and generative task) as ’one against everything’, appears in many studies. Tax and Duin (2001) propose constructing a hypersphere around positive samples, effectively separating it from the rest of the space; Ruff et al. (2018) and Chalapathy et al. (2018) extend this idea on deep neural networks. Ruff et al. (2018) rely on weight regularization, which acts in a similar manner to EOPE by limiting the area with high model output. OPE and EOPE methods depend only on the model’s output, which allows OPE and EOPE methods to avoid limiting number of layers (for example, Chalapathy et al., 2018), and does not restrict choice of network architecture (Ruff et al., 2018).
Tu (2007) and Jin et al. (2017) developed a method similar in its nature to OPE, in fact, it is easy to see, that term as it appears in Equation 4, corresponds to the loss function from Jin et al. (2017). In this work we demonstrate that this loss is equivalent to the crossentropy loss between a given class and a uniform distribution covering its support. EOPE loss alleviates computational expenses associated with the estimation of the normalization term and RMSProplike sampling procedure further accelerates training by reducing computational cost of sampling.
5 Experiments
We evaluate proposed methods on the following data sets: MNIST (LeCun et al., 1998), CIFAR (Krizhevsky and Hinton, 2009), KDD99 (KDD, 1999), Omniglot (Lake et al., 2015), SUSY and HIGGS (Baldi et al., 2014). In order to reflect assumptions behind our approach we derive multiple tasks from each data set by varying size of the anomalous subset.
As the proposed methods target problems intermediate between oneclass and twoclass problems, we compare our approaches against the following algorithms:

conventional twoclass classification with the crossentropy loss;

a semisupervised method: dimensionality reduction by a deep AutoEncoder followed by a classifier with the crossentropy loss;
Since not all of the evaluated algorithms allow for a probabilistic interpretation, ROC AUC metric is reported. As performance of certain algorithms (especially, twoclass classification) varies significantly depending on the choice of negative class, we run each experiment multiple times, and report average and standard deviation of the metrics. The results are reported in Tables
4, 5, 6, 7, 8, 9. Detailed description of the experimental setup can be found in Appendix A.In these tables, columns represent tasks with varying numbers of negative samples presented in the training set: numbers in the header indicate either number of classes that form a negative class (in case of MNIST, CIFAR, Omniglot and KDD data sets), or number of negative samples used (HIGGS and SUSY); ‘oneclass’ denotes absence of known anomalious samples. As oneclass algorithms do not take into account negative samples, results of these are repeated for the tasks with known anomalies.
one class  100  1000  10000  1000000  

Robust AE  
Deep SVDD  
crossentropy    
semisupervised    
bruteforce OPE  
HMC EOPE  
RMSProp EOPE  
Deep EOPE 
one class  100  1000  10000  1000000  

Robust AE  
Deep SVDD  
crossentropy    
semisupervised    
bruteforce OPE  
HMC EOPE  
RMSProp EOPE  
Deep EOPE 
one class  1  2  4  8  

Robust AE  
Deep SVDD  
crossentropy    
semisupervised    
bruteforce OPE  
HMC EOPE  
RMSProp EOPE  
Deep EOPE 
one class  1  2  4  

Robust AE  
Deep SVDD  
crossentropy    
semisupervised    
bruteforce OPE  
HMC EOPE  
RMSProp EOPE  
Deep EOPE 
one class  1  2  4  

Robust AE  
Deep SVDD  
crossentropy    
semisupervised    
bruteforce OPE  
HMC EOPE  
RMSProp EOPE  
Deep EOPE 
one class  1  2  4  

Robust AE  
Deep SVDD  
crossentropy    
semisupervised    
bruteforce OPE  
HMC EOPE  
RMSProp EOPE  
Deep EOPE 
In our experiments, we make several observations. Firstly, proposed methods generally outperform baseline methods, especially on the problems with a significant overlap between classes (SUSY, HIGGS and, possibly, CIFAR), and consistently show comparable performance on test problems. Secondly, we observe increasing performance as more negative samples are included in training set, while being consistently above or similar to that of conventional twoclass classification. Lastly, to our surprise, bruteforce OPE performs relatively well even on highdimensional problems, which might indicate that gradients produced by its regularization term have variance sufficiently low for a proper convergence.
The main drawback of the OPE and EOPE methods is a slow training, which is largely due to usage of MonteCarlo methods. It is partially alleviated by fast approximation of Hamiltonian MonteCarlo and usage of a generator (Kim and Bengio, 2016), and can potentially be improved further, by advanced MonteCarlo techniques (for example, Levy et al., 2017).
6 Conclusion
We present a new family of anomaly detection algorithms which can be efficiently applied to the problems intermediate between oneclass and twoclass settings. Solutions produced by these methods combine the best features of oneclass and twoclass approaches. In contrast to conventional oneclass approaches, proposed methods can effectively utilise any number of known anomalous examples, and, unlike conventional twoclass classification, does not require a representative sample of anomalous data. Our experiments show better or comparable performance to conventional twoclass and oneclass algorithms. Our approach is especially beneficial for anomaly detection problems, in which anomalous data is nonrepresentative, or might evolve over time.
The research leading to these results has received funding from Russian Science Foundation under grant agreement n 177220127.
References
 KDD (1999) KDD cup 1999 dataset: Intrusion detection system, 1999. URL https://archive.ics.uci.edu/ml/datasets/kdd+cup+1999+data.

Baldi et al. (2014)
Pierre Baldi, Peter Sadowski, and Daniel Whiteson.
Searching for exotic particles in highenergy physics with deep learning.
Nature communications, 5:4308, 2014. 
Bengio et al. (2009)
Yoshua Bengio et al.
Learning deep architectures for ai.
Foundations and trends® in Machine Learning
, 2(1):1–127, 2009.  Chalapathy et al. (2018) Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using oneclass neural networks. arXiv preprint arXiv:1802.06360, 2018.
 Duane et al. (1987) Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
 Jin et al. (2017) Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, pages 823–833, 2017.
 Kim and Bengio (2016) Taesup Kim and Yoshua Bengio. Deep directed generative models with energybased probability estimation, 2016.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Levy et al. (2017) Daniel Levy, Matthew D Hoffman, and Jascha SohlDickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
 Liu et al. (2008) Fei Tony Liu, Kai Ming Ting, and ZhiHua Zhou. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pages 413–422. IEEE, 2008.
 Ruff et al. (2018) Lukas Ruff, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert Vandermeulen, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep oneclass classification. In International Conference on Machine Learning, pages 4390–4399, 2018.

Schölkopf et al. (2000)
Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John ShaweTaylor,
and John C Platt.
Support vector method for novelty detection.
In Advances in neural information processing systems, pages 582–588, 2000.  Schölkopf and Smola (2002) B. Schölkopf and A.J. Smola. Support vector machines, regularization, optimization, and beyond. MIT Press, 2002.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Tax and Duin (2001) David MJ Tax and Robert PW Duin. Uniform object generation for optimizing oneclass classifiers. Journal of machine learning research, 2(Dec):155–173, 2001.

Tieleman (2008)
Tijmen Tieleman.
Training restricted boltzmann machines using approximations to the likelihood gradient.
In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.  Tieleman and Hinton (2009) Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1033–1040. ACM, 2009.
 Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.

Tu (2007)
Zhuowen Tu.
Learning generative models via discriminative approaches.
In
2007 IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–8. IEEE, 2007.  Zhou and Paffenroth (2017) Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 665–674. ACM, 2017.
Appendix A.
This section provides detailed description of the experimental setup.
In order to make a clear comparison between methods, network architectures are made as close as possible. For image data (MNIST, CIFAR, Omniglot) VGGlike networks ((Simonyan and Zisserman, 2014)) are used, for tabular data 5layers dense networks are used^{2}^{2}2Implementation can be found at https://gitlab.com/mborisyak/ope..
We evaluate the following proposed methods:
All OPE and EOPE models are trained with . All MCMC chains are persistent (by analogy with (Tieleman and Hinton, 2009)) and 4 MCMC steps are performed for each gradient step. All networks are optimized by adam algorithm ((Kingma and Ba, 2014)) with learning rate , , .
In order to reflect assumptions behind proposed methods, we derive several tasks from each original data set considered. For SUSY, HIGGS and KDD99 data set positive class is fixed according to data sets’ descriptions; for MNIST and CIFAR10 data sets each class is considered as positive; for Omniglot data set we choose ‘Braille’, ‘Futurama’ and ‘Greek’ alphabets are chosen as positive classes.
In order to fully demonstrate advantages of OPE and EOPE methods we vary sample sizes for negative class: for SUSY and HIGGS data sets only a small number of negative examples is randomly selected (, , , and ); for multiclass data sets several classes are randomly selected (without replacement) and subsampled, for MNIST, CIFAR and Omniglot data sets , , and classes are selected with examples from each, for KDD99 maximum number of samples per class is limited by .
Original traintest splits are respected when possible (for SUSY and HIGGS data sets splits are random and fixed for all derived tasks) — test sets are not modified in any way.
Appendix B.
Here we provide a formal proof of Theorem 1 from the Section 2.2. For the sake of simplicity we split the proof into two lemmas.
Lemma 1Let be a Banach space, — a continuous probability density function such that is an open set in . If continuous function minimizes (defined by Equation 7) with , then there exists a strictly increasing function , such that .
Proof. Consider a continuous function . We show that if can not be represented as , then does not minimize . This is demonstrated by constructing another continuous function that achieves lower loss than .
If can not be represented as then a pair of points and can be found such that and .
Due to continuity of and , it is possible to find such neighborhoods of and , that the difference in probability densities remains large, while differences in values of become insignificant or negative. More formally, for every there exists such that open balls and , satisfy following properties:
(12)  
(13) 
where .
We define function as , where , ; the exact form of is not important, nevertheless, for clarity, let
(14) 
We restrict our attention to such values of and , that has the same normalization constant as :
(15) 
Equation 12 implies that and do not intersect and, since for , consists of two nonzero terms:
where:
For every there exist a unique such that is a solution for Equation 15. Notice also, that is a continuous, strictly increasing function and .
Notice, that for small values of and
therefore,
(16) 
Similarly to , can be split into two parts:
(17)  
where .
Note, that for a positive
therefore,
where:
(18)  
(19) 
hence,
Note, that
(20) 
where: , , , .
Now, our aim is to prove that has a solution in form :
(21) 
where:
Note, that for each , and for each , . In combination with Equation 16, this implies that Inequality 21 is satisfied for some and , therefore, and are simultaneously satisfied for some and . This implies, that function has the same normalization constant as the original one, and reduces value of , hence, does not minimize , which concludes this proof.
Lemma 2. For every function that satisfies Lemma 1:
Proof. Suppose that .
For every sufficiently small , we can pick points , radius and two open balls , such that