Anomaly (novelty/outlier) detection refers to the identification of abnormal or novel patterns embedded in a large amount of (nominal) data Miljković (2010). The goal of anomaly detection is to identify unusual system behaviors, which are not consistent with its typical state. Anomaly detection algorithms find application in fraud detection Phua et al. (2010), discovering failures in industrial domain Lavin and Ahmad (2015), detection of adversarial examples Roth et al. (2019), etc. Garcia-Teodoro et al. (2009); Shone et al. (2018); Goh et al. (2017).
In contrast to typical binary classification problems, where every class follows some probability distribution, an anomaly is a pattern that does not conform to the expected behavior. In other words, a completely novel type of outliers, which is not similar to any known anomalies, can occur at a test time. Moreover, in most cases, we do not have access to any anomalies at training time. In consequence, novelty detection is usually solved using unsupervised approaches, such as one-class classifiers, which focus on describing the behavior of nominal data (inliers)Schölkopf et al. (2001); Abati et al. (2019); Li et al. (2018); Wang et al. (2019b). Any observation, which deviates from this behavior, is labeled as an outlier.
Following the above motivation, we propose OneFlow – a deep one-class classifier based on flow models. In contrast to typical (generative) flow-based density models, which focus on density estimation, OneFlow does not depend strongly on the structure of outliers since it finds a bounding region with a minimal volume for a fixed portion of data, e.g. , see Figure 1
for a comparison. This is realized by finding a hypersphere with a minimal radius in the output space of a neural network, which containspercentage of data, see Figure 2. While minimum volume sets were considered previously in Scott and Nowak (2006); Zhao and Saligrama (2009) in the context of anomaly detection, these works were mainly devoted to theoretical aspects and the algorithms proposed there do not scale well to large datasets. On the other hand, in the case of kernel methods, such as Support Vector Data Description (SVDD) Tax and Duin (2004), the minimum volume problem has been reformulated to obtain a convex objective, which solves a slightly different problem. In particular, instead of enclosing fraction of data within a hypersphere, SVDD adds a penalty for data points outside the hypersphere. Making use of neural networks, we do not need to stick to convex optimization and, therefore, OneFlow solves the original problem directly. In consequence, the gradient of the cost function is propagated only through the points located near the decision boundary (a behaviour which is similar to that of support vectors in SVM), see Remark 1
. To the authors’ best knowledge, this is the first model, which applies this paradigm in deep learning without any simplifications.
OneFlow uses two important ingredients. The first one is the application of flow-based models Dinh et al. (2014); Kingma and Dhariwal (2018), which give an explicit formula for the inverse mapping and allow us to calculate a Jacobian of a neural network at every point. In consequence, minimizing the volume of the hypersphere in the feature space leads to the minimization of the volume of the corresponding bounding region in the input space. Moreover, making use of inverse mapping, we automatically get a parametric form for the corresponding bounding region in the input space, which is useful, for example, in describing shapes from 3D point clouds Yang et al. (2019); Spurek et al. (2020), see Figure 3 for details. The second ingredient is the Bernstein polynomial estimator Cheng (1995) of the upper -quantile, which is used in OneFlow loss to estimate a hypersphere for fraction of data.
Experiments performed on typical benchmark datasets show that OneFlow gives comparative or even better performance than state-of-the-art models for anomaly detection. In particular, OneFlow outperforms typical flow-based density models, which use log-likelihood objective as well as deep SVDD method.
Our contribution is summarized as follows:
We formulate a one-class classification problem as finding a bounding region for a fixed amount of data with a minimal volume.
We show that the combination of flow models with Bernstein quantile estimator allows us to estimate the volume of the bounding region in a closed-form.
We experimentally analyze the behavior of the proposed approach and compare it with state-of-the-art methods.
One of the most successful approaches to anomaly detection is based on one-class learning. One-class SVM (OCSVM) Schölkopf et al. (2001) and SVDD Tax and Duin (2004) are two well known kernel methods for one-class classification. While OCSVM directly uses SVM to separate the data from the origin (considered as the only negative sample), SVDD aims to enclose most of the data points inside a hypersphere with minimal volume and needs to be implemented using individual software. To provide a unique local minimum, SVDD relaxes this problem to the convex objective by penalizing data points outside the hypersphere. In a similar spirit, Chen et al. (2013) apply Ranking SVM based on rankings created from pairwise comparison of nominal data. In contrast to SVDD, which reformulates minimum volume problem, Scott and Nowak (2006); Zhao and Saligrama (2009) derived algorithms, which, under some assumptions on data density, are provably optimal. However, despite obtaining important theoretical results, these methods do not scale well to large datasets. In contrast to these works, we do not use any simplifications in the cost function, but solve the minimum volume set problem efficiently.
Recent research on anomaly detection is dominated by methods based on deep learning. Attempts for adapting SVDD to the case of neural networks are presented in Ruff et al. (2018); Kim et al. (2015); Ruff et al. (2019); Chong et al. (2020). However, the direct minimization of SVDD loss may lead to hypersphere collapse to a single point. To avoid this negative behavior, it is recommended that the center must be something other than the all-zero-weights solution, and the network should use only unbounded activations and omit bias terms Ruff et al. (2018). While the first two conditions can be accepted, omitting bias terms in a network may lead to a sub-optimal feature representation due to the role of bias in shifting activation values. To eliminate these restrictions a recent work Chong et al. (2020) proposes two regularizers, which prevent from hypersphere collapse, and use an adaptive weighting scheme to control the amount of penalization between the SVDD loss and the respective regularizer.
The vast majority of deep learning methods use neural network representation learning capability to generate a latent representation to preserve the details of the given class based on auto-encoder reconstruction error Abati et al. (2019); Dasgupta et al. (2018); Li et al. (2018). This line of research includes strictly unsupervised techniques Wang et al. (2019b) as well as supervised and semi-supervised methods Shu et al. (2018). Nevertheless, there is no theoretical justification that reconstruction error captures enough information to separate nominal data from outliers. GAN-based approaches were considered in Sabokrou et al. (2018); Perera et al. (2019). Another direction of related work is the discovery of out-of-distribution instances (which are basically anomalies). Various forms of thresholding are used on the classification output to detect anomalies Hendrycks and Gimpel (2016); Liang et al. (2017); DeVries and Taylor (2018). Wang et al. (2019a) defined a general approach for anomaly detection, which is based on thresholding multivariate quantile function. Analogically to our approach, they use flow-based models, but in the context of density estimation. Another use of density-based flow models is presented in Schmidt and Simic (2019).
Our goal is to find a bounding region with a minimal volume, which contains a fixed amount of data, e.g. . This refers to one of the typical ideas used in one-class classification Scott and Nowak (2006); Zhao and Saligrama (2009); Tax and Duin (2004), where we describe the behavior of nominal data. Ignoring a small number of data allows us to deal with anomalies in training as well as to focus only on the most important features. In contrast to density-based approaches, which estimate a density of the whole data, we solve the easier task by separating nominal data from abnormal examples.
Let be a density in , and let be a sample generated by . We assume that we are given a -value , which determines the percentage of possible outliers111 If not stated otherwise, we use , which is motivated by a typical approach used in hypothesis testing.. We say that is a -bounding region of if a data point generated from a density belongs to
with a probability, i.e. . Intuitively, -bounding region covers approximately percentage of data, which allows us to deal with outliers or noise in training data.
Our problem is formally formulated below:
Find a -bounding region with a minimal volume for a density generating data, i.e.
To allow for sufficient flexibility in defining the form of , we use deep neural networks. Given a neural network , we aim at finding such that , where denotes a ball centered at with radius . In other words, -bounding region is the inverse of a ball with radius in the feature space. While the computation of can be difficult for arbitrary neural networks, we restrict our attention to flow-based models, which give an explicit form of inverse mapping .
First, we demonstrate that the volume can be calculated efficiently for flow-based models. Next, we show that the application of Bernstein estimator allows us to find a hypersphere for percentage of data, which is the solution of our optimization problem.
Volume calculation using flow-based models.
Let us recall that a neural network is a flow-based model if the inverse mapping is given explicitly and the Jacobian determinant can be easily calculated. Flow-based models have been usually used in the case of generative models because a direct form of allows one to generate samples from the prior distribution, while the condition for Jacobian makes the optimization of log-likelihood function possible. Their direct application in the context of anomaly detection can be compared to the use of GMMs. Given a distribution of data, we discard percentage of data or a region of data space with a probability . Since we want to realize a different objective, we need to redefine the loss for flow-based models.
As mentioned, flow-based models are designed to calculate the Jacobian of effectively, which allows us to optimize the log-likelihood function in the case of neural networks directly. From this perspective, flow-based models can be divided into two natural classes. The first class, referred to as const-det flows, contains models where is constant (does not depend on ), e.g. NICE Dinh et al. (2014). The models from the second class, called here general flows, can change the derivative at different points, e.g. Real NVP Dinh et al. (2016).
We show that for const-det flows we can obtain the exact formula for , while for general flows its approximation can be derived. For this purpose, we introduce a notation:
Note that for const-det flows, is a constant function.
Let be a const-det flow model, i.e. is constant. The volume of is given by:
Since is a constant function, we get . ∎
If depends on , then the situation is more complex, but we can still obtain an approximation of the volume for general flows as
are points randomly chosen with respect to uniform distribution on. This is a type of the Monte Carlo sampling and it is generally difficult to control the accuracy of this estimation
We presented how to find formulas for computing the volume of the bounding region using flow-based models. Now, we apply this fact to construct the optimization procedure for computing -bounding region.
Let , be a mini-batch, be the weights of the flow model and be given -value. To apply the formula (1) for const-det flow, we need to find the radius of the ball, which contains percentage of data. We estimate this radius by first computing
and next applying the estimator of upper -quantile.
As a quantile estimator, we use Bernstein polynomial estimator Cheng (1995); Leblanc (2012); Zielinski (2004). Let us recall that given a sample drawn from the same distribution, the Bernstein estimation of -quantile , where , is constructed in the following way. First, we reorder so that . Then the Bernstein estimator of -quantile is defined by:
Bernstein polynomials are known to yield very smooth estimates, even from the small sample size, that typically have acceptable behavior at the boundaries.
Applying the above construction to our case, we do as follows:
the sequence is obtained by sorting in a descending order,
the Bernstein polynomial estimator of upper -quantile is given by
In consequence, the volume of the bounding region for const-det flows is given by
To avoid potential numerical problems in the cost function, one can minimize logarithm of the volume (instead of the volume itself):
For general flows, we use the formula (Volume calculation using flow-based models.) in the above calculations. Thus the estimation for the volume of the -bounding region is given by
where is a sequence of randomly chosen points from the uniform distribution on the unit ball . The final cost function in the logarithmic form equals:
We explain now that, in contrast to flow-based density models, the gradient of our loss function is propagated only over a small number of points, which are located close to the decision boundary. For(e.g. for and with and . Thus by the law, we obtain that numerically essential weights are only for examples, where . Consequently, we obtain that only the following percentage of samples from the batch obtain nonzero gradient:
where is the length of the above interval.
In this section, we experimentally examine OneFlow and compare it with several state-of-the-art approaches. OneFlow is implemented using the architecture of NICE flow model222We verified that more complex flow models such RealNVP Dinh et al. (2016) and Glow Kingma and Dhariwal (2018) give worse results, which could be caused by the fact that such models are too flexible for outlier detection tasks, which is purely unsupervised task. and (see Appendix for the experimental setting). Analysis of parameter is presented at the end of this section. If not stated otherwise, we consider a variant of const-det flow (Jacobian determinant is constant). Appendix contains additional experiments.
Box plots for rankings calculated on MNIST (left) and Fashion-MNIST (right) using AUC score. The median ranking is marked by a line, while the average ranking is marked with a number.
Benchmark data for anomaly detection.
First, we provide a quantitative assessment and take into account Thyroid333http://odds.cs.stonybrook.edu/thyroid-disease-dataset/ and KDDCUP444http://kdd.ics.uci.edu/databases/kddcup99/kddcup.testdata.unlabeled_10_percent.gz datasets, which are real-world benchmark datasets for anomaly detection. We use the standard training and test splits and follow exactly the same evaluation protocol as in Wang et al. (2019a). The performance was measured using F1 score, because this metric was reported for all methods considered.
We use two variants of OneFlow. The first one (OneFlow) uses constant Jacobian while the second (OneFlow-Gen) allows for changing the Jacobian at every point. Our models are compared with the following algorithms (see Appendix for more detailed description): (1) One-class SVM (OC-SVM) Schölkopf et al. (2001)
, (2) Deep structured energy-based models (DSEBM)Zhai et al. (2016) et al. (2018), (4) variants of MQT – multivariate quantile map (NLL, TQM, TQM, TQM) Wang et al. (2019a), (5) Deep Support Vector Data Description (DSVDD) Ruff et al. (2018), (6) two variants of log-likelihood flow model (LL-Flow and LL-Flow-Gen). Bounding region of LL-Flow is constructed by taking the smallest hypersphere in the latent space, which covers percentage of data (the hypersphere is determined by the prior Gaussian distribution).
The results presented in Table 1 show that both variants of our model perform almost equally on KDDCUP and they are better than all competitive methods. In the case of Thyroid, OneFlow presents the best performance, while OneFlow-Gen gives third best score. The most similar method, DSVDD, was not able to obtain similar performance on these datasets, which shows that a direct minimization of the bounding region implemented by our method is more beneficial. While the results of Triangular Quantile Maps (TQM and NLL) depends heavily on the assumed norm and cost function, there is no objective criteria for selecting these parameters. Other methods produce worse results.
To provide further experimental verification, we use two image datasets: MNIST and Fashion-MNIST. In contrast to the previous comparison, these two datasets are usually used for multiclass classification and thus need to be adapted to the problem of anomaly detection. For this purpose, each of the ten classes is deemed as the nominal class while the rest of the nine classes are deemed as the anomaly class, which results in 10 scenarios for each dataset. To be consistent with Wang et al. (2019a), we report AUC (area under ROC curve).
, Denoising autoencoder (DAE)Vincent et al. (2008), Generative probabilistic novelty detection (GPND) Pidhorskyi et al. (2018), Latent space autoregression (LSA) Abati et al. (2019). In contrast to previous experiment, we only use TQM and NLL as the only implementations of MTQ, because they output the highest value of AUC Wang et al. (2019a).
To present the results, we compute the ranking on each of 10 scenarios and summarize it using box plot, see Figure 4 (detailed results and analysis are included in Appendix). It is evident that OneFlow and OneFlow-Gen outperform related LL-Flow and LL-Flow-Gen, which confirms that the proposed loss function suits better for one-class classification problems than typical log-likelihood function. Our methods give also better scores than DSVDD, which implements a similar loss function. The overall ranking of OneFlow and OneFlow-Gen is comparative to the best performing methods on both datasets. It is difficult to clearly determine which method performs best, because of the high variation in the results. While GPND seems to outperform other methods on MNIST, its result on Fashion-MNIST is similar to OneFlow. We emphasize, however, that both MNIST and Fashion-MNIST do not represent typical anomaly detection datasets.
Next, we analyze which samples from the nominal class are localized close to or furthest from the center of bounding hypersphere. It is evident from Figure 5 that OneFlow maps images with regular structure, which are easy to recognize, in the hypersphere center. On the other hand, examples localized far from the center (outside the bounding region) do not look visually plausible and one cannot be sure about their class. It means that OneFlow gives results consistent with our intuition.
Test on Out-Of-Distribution dataset.
We now focus on comparing OneFlow with LL-Flow, which represents its natural baseline, and follow the experiment recently suggested in Nalisnick et al. (2018). In this setting, each model is trained on the Fashion-MNIST train set with . Next, we test these models on the data coming from test sets of both MNIST and Fashion-MNIST. We expect that the models will be able to classify MNIST examples as anomalies.
Figure 6 illustrates the distance of latent representations from the center of bounding hypersphere. As expected, in both cases, Fashion-MNIST (nominal) data are localized closer to the center than MNIST (outliers) data, which is correct behavior. However, OneFlow maps out-of-distribution data (MNIST) much further from the center than LL-Flow (see the range on x-axis). We verified that the percentage of correctly classified anomalies equals for OneFlow and for LL-Flow, which means that the discriminative power of OneFlow is higher than capabilities of LL-Flow. We also performed the second experiment when MNIST was considered as nominal data and Fashion-MNIST represented outliers and both models obtained almost perfect performance in this situation.
To find key differences between OneFlow and LL-Flow, we consider 2-dimensional examples, which are easy to visualize and represent a typical benchmark for comparing anomaly detection algorithms (additional illustrative examples on 3D point clouds are presented in Figure 3).
At first glance, both flow models give similar results in most cases, see Figure 7. However, a closer inspection reveals that the decision boundaries created by OneFlow are smoother and shorter than the ones resulted from LL-Flow. While minimizing the area of the bounding region should lead to a more accurate description of the nominal class, minimizing the length of the decision boundary reduces the model complexity. Making an analogy with typical supervised models, smooth, short, and simple decision boundaries usually increases the generalization performance of the model to unseen examples. To confirm this observation, we calculate the volume of the bounding region and the length of the corresponding decision boundary, see Table 2. It is evident that these quantities are smaller in the case of OneFlow.
A notable difference between both models can also be seen in the example shown in Figure 1, where a dataset consists of two diverse blobs – the one with 98% of data and the second with remaining 2% of data. While LL-Flow focuses on the whole data and considers a few examples from the smaller blob as nominal data, OneFlow directly solves a one-class problem and deems the whole smaller blob as anomalies. It shows that OneFlow is not very sensitive to the structure and the distribution of anomalies, because they are automatically ignored in a training phase. On the other hand, LL-Flow fits a prior density to the whole data, and, in consequence, a distribution of anomalies has an influence on the final results. Analogical behavior of both models was observed when we changed the proportions of clusters. Indeed, OneFlow always deemed smaller cluster as anomalies if it contains at most 5% of the whole data (larger clusters cannot be considered as anomalies in practice). In contrast, LL-Flow could not separate the small clusters from nominal data in this case.
Analysis of parameter .
Previous experiments were performed for OneFlow with . A natural question is: what is the influence of on the behavior of OneFlow?
To partially answer this question, we corrupt a training nominal data with anomalies. We consider 5 noise levels with: 0%, 0.1%, 1%, 5% and 10% of anomalies in a training set. In each case, we run OneFlow with . Evaluation on test set remains exactly the same as before. We take into account MNIST and Fashion-MNIST datasets.
It is clear from Figure 8 that the performance of OneFlow slightly deteriorates as the number of anomalies in training set increases regardless of the value of . Moreover, the model with a high value of is able to deal with a large number of anomalies in training better than the model with small . Indeed, OneFlow with leaves approximately of data outside the bounding region and thus can still provide a good description of nominal data as long as the number of anomalies does not exceed .
Another observation is that OneFlow with small works better for MNIST than for Fashion-MNIST when the number of anomalies is low (less than ). It may be explained by the fact that MNIST is a relatively simple dataset and almost all examples from each class are similar. In consequence, OneFlow with performs better than with for negligible amount of anomalies in training. On the other hand, the variation in each class of Fashion-MNIST is greater (in the noiseless case, AUC for Fashion-MNIST is 4 percentage points lower than for MNIST) so the bounding region created for is too loose. In the test phase, such a bounding region may contain too many anomalies.
The above analysis suggests that should be large if the underlying anomaly detection task is hard or we have many anomalies in training. Otherwise, we should keep low.
The paper introduced OneFlow, which realizes a well-known one-class paradigm using deep learning tools. Making use of a flow-based model and Bernstein quantile estimator, we find a minimal volume bounding region for a given percentage of data. On the one hand, the constructed bounding region does not depend on the structure of outliers as in the density-based models, while, on the other hand, the bounding region is given by an explicit parametric form. Experimental results demonstrate that OneFlow presents state-of-the-art performance.
- Latent space autoregression for novelty detection. In , pp. 481–490. Cited by: 11st item, Introduction, Related works, Image datasets..
UCI machine learning repository. Cited by: Appendix C.
- LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104. Cited by: 15th item.
- A new one-class svm for anomaly detection. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3567–3571. Cited by: Related works.
- The bernstein polynomial estimator of a smooth quantile function. Statistics & probability letters 24 (4), pp. 321–330. Cited by: Introduction, Optimization algorithm..
- Simple and effective prevention of mode collapse in deep one-class classification. arXiv preprint arXiv:2001.08873. Cited by: Related works.
- A neural data structure for novelty detection. Proceedings of the National Academy of Sciences 115 (51), pp. 13093–13098. Cited by: Related works.
- Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: Related works.
- Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: Appendix A, Introduction, Volume calculation using flow-based models..
- Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: Volume calculation using flow-based models., footnote 2.
- Anomaly-based network intrusion detection: techniques, systems and challenges. computers & security 28 (1-2), pp. 18–28. Cited by: Introduction.
Anomaly detection in cyber physical systems using recurrent neural networks. In 2017 IEEE 18th International Symposium on High Assurance Systems Engineering (HASE), pp. 140–145. Cited by: Introduction.
- Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems, pp. 9758–9769. Cited by: 7th item, Image datasets..
- PIDForest: anomaly detection via partial identification. In Advances in Neural Information Processing Systems, pp. 15809–15819. Cited by: 12nd item, Appendix C.
- Robust random cut forest based anomaly detection on streams. In International conference on machine learning, pp. 2712–2721. Cited by: 14th item.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: Related works.
- Deep learning with support vector data description. Neurocomputing 165, pp. 111–117. Cited by: Related works.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: 8th item, Image datasets..
- Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: Introduction, footnote 2.
- Evaluating real-time anomaly detection algorithms–the numenta anomaly benchmark. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 38–44. Cited by: Introduction.
- On estimating distribution functions using bernstein polynomials. Annals of the Institute of Statistical Mathematics 64 (5), pp. 919–943. Cited by: Optimization algorithm..
Anomaly detection with generative adversarial networks for multivariate time series. arXiv preprint arXiv:1809.04758. Cited by: Introduction, Related works.
- Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: Related works.
- Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (1), pp. 1–39. Cited by: 13rd item.
- Review of novelty detection methods. In The 33rd International Convention MIPRO, pp. 593–598. Cited by: Introduction.
- Do deep generative models know what they don’t know?. arXiv preprint arXiv:1810.09136. Cited by: Test on Out-Of-Distribution dataset..
- Ocgan: one-class novelty detection using gans with constrained latent representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2898–2906. Cited by: Related works.
- A comprehensive survey of data mining-based fraud detection research. arXiv preprint arXiv:1009.6119. Cited by: Introduction.
- Generative probabilistic novelty detection with adversarial autoencoders. In Advances in neural information processing systems, pp. 6822–6833. Cited by: 10th item, Image datasets..
The odds are odd: a statistical test for detecting adversarial examples. arXiv preprint arXiv:1902.04818. Cited by: Introduction.
- Deep semi-supervised anomaly detection. arXiv preprint arXiv:1906.02694. Cited by: Related works.
- Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: 5th item, Related works, Benchmark data for anomaly detection..
- Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3379–3388. Cited by: Related works.
- Normalizing flows for novelty detection in industrial time series data. arXiv preprint arXiv:1906.06904. Cited by: Related works.
- Estimating the support of a high-dimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: 1st item, Introduction, Related works, Benchmark data for anomaly detection..
- Learning minimum volume sets. Journal of Machine Learning Research 7 (Apr), pp. 665–704. Cited by: Introduction, Related works, Problem formulation..
- A deep learning approach to network intrusion detection. IEEE Transactions on Emerging Topics in Computational Intelligence 2 (1), pp. 41–50. Cited by: Introduction.
- Unseen class discovery in open-world classification. arXiv preprint arXiv:1801.05609. Cited by: Related works.
- Hypernetwork approach to generating point clouds. arXiv preprint arXiv:2003.00802. Cited by: Introduction.
- Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: Introduction, Related works, Problem formulation..
- OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15 (2), pp. 49–60. Cited by: Appendix C.
- Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: 9th item, Image datasets..
- Multivariate triangular quantile maps for novelty detection. In Advances in Neural Information Processing Systems, pp. 5061–5072. Cited by: 4th item, Related works, Benchmark data for anomaly detection., Benchmark data for anomaly detection., Image datasets., Image datasets..
- Effective end-to-end unsupervised outlier detection via inlier priority of discriminative network. In Advances in Neural Information Processing Systems, pp. 5960–5973. Cited by: Introduction, Related works.
- Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: 17th item.
- Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4541–4550. Cited by: Introduction.
- Deep structured energy based models for anomaly detection. arXiv preprint arXiv:1605.07717. Cited by: 2nd item, Benchmark data for anomaly detection..
- Anomaly detection with score functions based on nearest neighbor graphs. In Advances in neural information processing systems, pp. 2250–2258. Cited by: Introduction, Related works, Problem formulation..
- Optimal quantile estimators small sample approach. Polish Academy of Sciences. Institute of Mathematics. Cited by: Optimization algorithm..
- Deep autoencoding gaussian mixture model for unsupervised anomaly detection. Cited by: 3rd item, Benchmark data for anomaly detection..
Appendix A Experimental setting
In all our experiments, OneFlow and LL-Flow are implemented using the architecture of NICE flow model Dinh et al. (2014)
, with the following hyperparameters:
Number of flow layers: 4
Number of coupling layers: 4
Hidden dimension: 16
Number of epochs: 1000
Batch size: 1000
Learning rate: 0.001
Anomaly detection and image datasets from the main text
Number of flow layers: 4
Number of coupling layers: 4
Hidden dimension: 256
Number of epochs: 1000
Batch size: 1000
Learning rate: 0.001
Anomaly detection and image datasets from PIDForest benchmark
Number of flow layers: 2
Number of coupling layers: 6
Hidden dimension: 64
Number of epochs: 2000
Batch size: 1000
Learning rate: 0.001
Appendix B Description of comparative algorithms
Below, we give a brief description of algorithms used in the experimental section:
One-class SVM (OC-SVM) Schölkopf et al. (2001). It is a traditional kernel-based one-class classifier (we use the RBF kernel).
Deep structured energy-based models (DSEBM) Zhai et al. (2016). This model employs a deterministic deep neural network to output the energy function, such as negative log-likelihood, which is used to form the density of nominal data.
Deep autoencoding Gaussian mixture model (DAGMM) Zong et al. (2018). It combines deep autoencoder with a Gaussian mixture estimation network to output the joint density of the latent representations and some reconstruction features from the autoencoder.
Four variants of MQT – multivariate quantile map (NLL, TQM, TQM, TQM) Wang et al. (2019a). MQT is a general model, which thresholds a given score function to describe nominal data. As a score function, we use negative log-likelihood (NLL) as well as 1-norm, 2-norm, and infinity norm of quantile (TQM).
Deep Support Vector Data Description (DSVDD) Ruff et al. (2018). It is an implementation of SVDD using deep neural networks, which penalizes data points that lie outside the hypersphere.
Two variants of log-likelihood flow model (LL-Flow and LL-Flow-Gen). These are generative flow models based on log-likelihood function that mimic OneFlow and OneFlow-Gen, respectively.
Geometric transformation (GT) Golan and El-Yaniv (2018). It uses a multi-class model to discriminate between dozens of geometric transformations applied to examples from the nominal class. The scoring function is the conditional probability of the softmax responses of the classifier given the geometric transformations.
Variational autoencoder (VAE) Kingma and Welling (2013). The evidence lower bound is used as the scoring function.
Denoising autoencoder (DAE) Vincent et al. (2008). The reconstruction error is used as the scoring function.
Generative probabilistic novelty detection (GPND) Pidhorskyi et al. (2018). GPND, based on adversarial autoencoders, uses data density as the scoring function. Density is approximated by linearizing the manifold that nominal data resides on.
IsolationForest (iForest) Liu et al. (2012). A random forest based algorithm. iForest isolates observations by randomly selecting a feature and a split value. The number of splittings required to isolate a sample is the scoring function.
Robust Random Cut Forest (RRCF) Guha et al. (2016). An outlier detection algorithm that is based on a binary search tree. The scoring function is measured by its collusive displacement (CoDisp): if including a new point significantly changes the model complexity (i.e. bit depth), then that point is more likely to be an outlier.
Local Outlier Factor (LOF) Breunig et al. (2000). The scoring function is based on measuring the local deviation of a given data point with respect to its k-nearest neighbours.
k-Nearest Neighbour (kNN). The distance from k-nearest neighbours is considered as the scoring function.
Principal Component Analysis (PCA) Wold et al. (1987). The scoring function is calculated as the distance from the axes in feature space.
Appendix C PIDForest benchmark
We make additional benchmarks following the experimental setting of Gopalan et al. (2019). More specifically, we test OneFlow and OneFlow-Gen on the following eight datasets from the UCI (Asuncion and Newman, 2007), openML repository (Vanschoren et al., 2014) and KDD Cup 1999: Thyroid, Mammography, Seismic, Satimage-2, Vowels, Musk, http, smtp. Every method was trained on the whole dataset (outliers included) and AUC was report on the same set (Table 3 contains detailed scores, while Figure 9 presents box plot of ranks, which summarizes the experiment). While the performance of OneFlow is slightly worse than the best algorithms (PIDForest and iForest), OneFlow-Gen performs comparable to these methods.
Let us recall that AUC does not take into account a decision boundary, but only uses relative ordering of outliers and nominal data. In the production environment, we need to have a classification rule and verify which algorithm detects outliers and nominal data correctly. For this reason, we also test all algorithms in a classification setting, in which 5% of farthest examples (according to a given loss function) are deemed as anomalies. The results are evaluated using the F1 score. We report detailed scores in Table 4 and present rank plot in Figure 9. It is evident that OneFlow-Gen outperforms significantly comparative methods. While the performance of OneFlow is slightly worse, it is still better than other algorithms. This experiment confirms that the proposed method is better at finding outliers, which is not the same as ranking elements according to the loss function, but is crucial in practice.
Appendix D Detailed results for MNIST and Fashion-MNIST
To give a better insight, we calculate AUC for a every pair of classes, see Figures 10 and 11. More precisely, every entry of heat map shows AUC obtained for a given nominal class listed in a column and a given anomaly class listed in a row. For example, it occurs that OneFlow and OneFlow-Gen have the biggest problems with detecting anomalies represented by class "1" when trained on class "8". Generally, it is evident that the class "1" is the hardest to describe by our model. It may be explained by the fact that handwritten digit "1" can be written in various styles, which makes it similar to other classes.
examples from nominal class, which lie closest to the center of bounding hypersphere (1st column),
examples from nominal class, which are farthest from the center of bounding hypersphere (2nd column),
anomalies, which lie closest to the center of bounding hypersphere (3rd column),
anomalies, which are farthest from the center of bounding hypersphere (4th column),
Interestingly, the examples from the class "1" are frequently localized close to the center of bounding hypersphere. It partially explains the behavior observed in previous heatmaps. Another observation is that the examples closest to the center are very regular (first column), while examples farthest from the center look worse, and humans can make mistakes in classifying these images.