1 Introduction
The problem of ranking multivariate data by degree of abnormality, referred to as anomaly ranking, is of central importance for a wide variety of applications (e.g.
fraud detection, fleet monitoring, predictive maintenance). In the standard setup, the ’normal’ behavior of the system under study (in the sense of ’not abnormal’, without any link to the Gaussian distribution) is described by the (unknown) distribution
of a generic r.v. , valued in . The goal pursued is to build a scoring function that ranks any observations nearly in the same order as any increasing transform of the density would do. Ideally, the smaller the score of an observation in , the more abnormal it should be considered. In Clémençon and Thomas (2018), a functional criterion, namely a ProbabilityMeasure plot referred to as the MassVolume curve (the curve in abbreviated form), has been proposed to evaluate the anomaly ranking performance of any scoring rule . This performance measure can be viewed as the unsupervised version of the Receiver Operating Characteristic () curve, the gold standard measure to evaluate the accuracy of scoring functions in the bipartite ranking context, see e.g. Clémençon and Vayatis (2009). Beyond this approach, let us highlight that the problem of anomaly detection has also been studied
via various other modelings. For instance, the works of Bergman and Hoshen (2020) and Steinwart et al. (2005) are based on classification methods, while Liu et al. (2008) build on peeling, Breunig et al. (2000) on local averaging criteria, Frery et al. (2017) on ranking and Schölkopf et al. (2001) on plugin techniques.In this paper, we propose a novel twostage method for detecting and ranking abnormal instances, by means of scalar criteria summarizing the curve and extending the area under its curve, when has compact support. Briefly, starting from a sample of observations , we artificially generate an independent second sample that is used as a proxy for outliers. For theoretical reasons explained in the paper, the agnostic choice consists in sampling the ’s i.i.d. from the uniform law on a subset of , which ’s support is supposedly included in. We then learn to discriminate the ’s from the ’s thanks to a scoring function that maximizes twosample empirical counterparts of the aforementioned criteria, that are in particular robust to imbalanced datasets. The resulting scoring function allows to rank the ’s by degree of abnormality. This novel class of criteria is based on theoretical guarantees provided by Clémençon et al. (2021) on general classes of twosample linear rank processes, that incidentally circumvent the difficulty of optimizing the functional
criterion. Beyond the classical results of statistical learning theory for these processes,
Clémençon et al. (2021) obtain theoretical generalization guarantees for their empirical optimizers. The numerical results performed at the end of the paper also provide strong empirical evidence of the relevance of the approach promoted here.The article is structured as follows. In section 2, the formulation of the (unsupervised) anomaly ranking problem is recalled at length, together with the concept of curve. In section 3
, the anomaly ranking performance criteria proposed are introduced and their statistical estimation is discussed. Optimization of the statistical counterparts of the criteria introduced to build accurate anomaly scoring functions is also put forward therein. Finally, the relevance of this approach is illustrated by numerical results in section
4.2 Background and Preliminaries
We start off with recalling the formulation of the (unsupervised) anomaly ranking problem and introducing notations that shall be used here and throughout. By is meant the Lebesgue measure on , by the indicator function of any event
, while the generalized inverse of any cumulative distribution function
on is denoted by . We consider a r.v. valued in , , with distribution , modeling the ’normal’ behavior of the system under study. The observations at disposal , with , are independent copies of . Based on the ’s our goal is to learn a ranking rule for deciding among two observations and in which one is more ’abnormal’. The simplest way of defining a preorder^{1}^{1}1A preorder on a set is a reflexive and transitive binary relation on . It is said to be total, when either or else holds true, for all . on consists in transporting the natural order on onto it through a scoring function, i.e. a Borel measurable mapping : given two observations and in , is said to be more abnormal according to than when . The set of all anomaly scoring functions that are integrable with respect to Lebesgue measure is denoted by . The integrability condition is not restrictive since the preorder induced by any scoring function is invariant under strictly increasing transformation (i.e. the scoring function and its transform define the same preorder on provided that the Borel measurable transform is strictly increasing on the image of the r.v. , denoted by ). One wishes to build, from the ’normal’ observations only, a scoring function such that, ideally, the smaller , the more abnormal the observation . The set of optimal scoring rules in should be thus composed of strictly increasing transforms of the density function that are integrable to , namely:(1) 
The technical assumptions listed below are required to define a criterion, whose optimal elements coincide with .

The r.v. is continuous, i.e. , .

The density function is bounded: .
Measuring anomaly scoring accuracy  The curve.
Consider an arbitrary scoring function and denoted by , , its level sets. As is integrable, the measure is finite for any . Introduced in Clémençon and Thomas (2018), a natural measure of the anomaly ranking performance of any scoring function candidate is the ProbabilityMeasure plot, referred to as the MassVolume () curve:
(2) 
Connecting points corresponding to possible jumps, this parametric curve can be viewed as the plot of the continuous mapping , starting at and reaching in the case where the support of the distribution is compact, or having the vertical line ’’ as an asymptote otherwise. A typical curve is depicted in Fig. 1.
Let . Denoting by the cumulative distribution function of the r.v. , we have:
(3) 
when . This functional criterion is invariant by increasing transform and induces a partial order over the set . Let , the ordering defined by is said to be more accurate than the one induced by when:
As summarized by the result stated below, the curve criterion is adequate to measure the accuracy of scoring functions with respect to anomaly ranking.
It reveals in particular that optimal scoring functions are those whose curve is minimum everywhere.
(Clémençon and Thomas (2018)) Let the assumptions be fulfilled. The elements of the class have the same (convex) curve and provide the best possible preorder on w.r.t. the curve criterion:
(4) 
where for all .
Equation (4) reveals that the lowest the curve (everywhere) of a scoring function , the closer the preorder defined by is to that induced by . Favorable situations are those where the curve increases slowly and rises more rapidly when coming closer to the ’one’ value: this correponds to the case where is much concentrated around its modes, takes its highest values near the latter and its lowest values are located in the tail region of the distribution . Incidentally, observe that the optimal curve somehow measures the spread of the distribution in particular for large values of extremal observations (e.g. a light tail behavior corresponds to the situation where increases rapidly when approaching ), whereas it should be examined for small values of when modes of the underlying distributions are investigated (a flat curve near indicates a high degree of concentration of near its modes).
Statistical estimation. In practice, the curve of a scoring function is generally unknown, just like the distribution , and it must be estimated. A natural empirical counterpart can be obtained by plotting the stepwise graph of the mapping:
(5) 
where denotes the empirical of the r.v. and its generalized inverse. In Clémençon and Thomas (2018), for a fixed , consistency and asymptotic Gaussianity (in norm) of the estimator (5) has been established, together with the asymptotic validity of a smoothed bootstrap procedure to build confidence regions in the space. However, depending on the geometry of the superlevel sets of , it can be far from simple to compute the volumes. In the case where has compact support, included in say for simplicity, and from now on it is assumed it is the case, they can be estimated by means of MonteCarlo simulation. Indeed, if one generates a synthetic i.i.d. sample , independent from the
’s and drawn from the uniform distribution on
, which we denote by , a natural estimator of the volume is:(6) 
Minimization of the empirical area under the curve.
Thanks to the curve criterion, it is possible to develop a statistical theory for the anomaly scoring problem. From a statistical learning angle, the goal is to build from training data a scoring function with curve as close as possible to . Whereas the closeness between (continuous) curves can be measured in many ways, the distance offers crucial advantages. Indeed, we have:
Notice that , , is not a distance between the scoring functions and but measures the dissimilarity between the preorders they define and that minimizing boils down to minimizing the scalar quantity , the area under the curve. From a practical perspective, one may then learn an anomaly scoring rule by minimizing the empirical quantity:
This boils down to maximizing the ranksum (or Wilcoxon MannWhithney) statistic (see Wilcoxon (1945)) given by:
(7) 
where is the rank of among the pooled sample : . Indeed, just like the empirical area under the curve can be related to the ranksum statistic, we have:
(8) 
In the next section, we introduce more general empirical summaries of the curve that are of the form of twosample rank statistics, just like (7), and propose to solve the anomaly ranking problem through the maximization of the latter.
3 Measuring and Optimizing Anomaly Ranking Performance
In this section, a class of anomaly ranking performance criteria are introduced, which can be estimated by twosample rank statistics. We also emphasize that a natural approach to anomaly ranking consists in maximizing such empirical scalar criteria.
3.1 Scalar Criteria of Performance and Twosample Rank Statistics
Here we develop the statistical learning framework we propose for anomaly ranking. Let , we assume that observations are available: ’normal’ i.i.d. observations taking their values in for simplicity drawn from and i.i.d. realizations of the uniform distribution , independent from the ’s. Hence, represents the ’theoretical’ proportion of ’normal’ observations among the pooled sample. Let a class of scoring functions such that, for all , we consider the mixture distribution and its empirical counterpart . Notice that since as tends to infinity, the quantity above is a natural estimator of the c.d.f. . We refer to the scored random samples for and . Therefore, motivated by Eq. (8), Definition 3.1 below provides the class of performance criteria we consider in the subsequent procedure. Let be a nondecreasing function. The ’ranking performance criterion’ with ’scoregenerating function’ based on the mixture cdf is given by:
(9) 
One can naturally relate this generalized form to the curve, justifying this choice of scalar performance criteria as summaries of the curve, through the equality:
(10) 
Equipped with the two random samples, the following Definition 3.1 provides an empirical counterpart, that generalizes the empirical summaries of the curve via collections of twosample linear rank statistics. Precisely, for a given mapping , we allow to weight the sequence of ’normal ranks’ the ranks of the scored ’normal’ instances among the pooled sample, by means of a scoregenerating function.
(Twosample linear rank statistics) Let be a nondecreasing function. The twosample linear rank statistics with ’scoregenerating function’ based on the random samples and is given by:
(11) 
where .
Optimality.
Briefly, we refer to the comprehensive analysis of the general class of criteria in Clémençon et al. (2021), that establishes the theoretical guarantees for the consistency of the twostage procedure we detail in the following subsection. Importantly, the set of optimal maximizers of the empirical criteria coincides with the nondecreasing transforms of the likelihood ratio, just like for the curves, as shown thourgh the Eq. (10).
The optimal set derived in Eq. (1) underlines the implicit characterization that inherits an outlier: the lower the scalar score is and the likelier anomalous the observation can be considered. Also, the notion of distance induced by the rankbased criteria is in fact directly related to the distribution of the ’normal’ sample compared to the Uniform one.
Choosing .
As foreshadowed above, the choice of the scoregenerating function is an asset of this class of criteria as it provides a flexibility the weighting of the area under the curve. Indeed, its minimization directly implies the maximization of the criterion (see Eq. (10)), recalling the nondecreasing variation of . Therefore, one can hope to recover at best the curve by the right choice of , especially when the initial sample is noisy. Additionally, when going back to the problem of learning to rank the (possible abnormal) instances, it is an advantage to weight the ranks accordingly.
First, we recall the simplest uniform weighting of each ’normal’ rank with . It parenthetically yields to Eq. (8), of continuous version: , where the area under the curve is clearly computed. Other functions were introduced in the literature related to classic univariate twosample rank statistics. Figure 2 gathers classical nondecreasing scoregenerating functions broadly used for twosample statistical tests (refer to Hájek (1962)).
3.2 The TwoStage Procedure
In this paragraph, we detail the twostage procedure, where we assume that both the framework and assumptions detailed in the previous subsection are adopted. We define the test sample as the set of i.i.d.random variables , with , a priori drawn from . The goal pursued is to distinguish among the test sample, the instances the most likelier to be anomalous. In particular, we propose a first step that outputs an optimal ranking rule , in the sense of the maximization of the rank statistics of Eq. (3.1). Then, in the second step and equipped with this rule, the instances of the test sample are optimally ranked by increasing order of similarity w.r.t. the ’s. We also choose to watch a number of worst ranked instances i.e. of lowest empirical score. The procedure is detailed in the following Fig. 3. By means of the recalled theoretical guarantees proved in Clémençon et al. (2021), it results to the asymptotic consistency of step as well as its nonasymptotic consistency with high probability, under some technical assumptions.
4 Numerical Experiments
In this section, we illustrate the procedure promoted along the paper through numerical experiments on imbalanced synthetic data. As these experiments are mainly here to support our methodology, we propose for the step to learn the empirical maximizer by means of a regularized classification algorithm. At a technical level, we would ideally like to replace usual loss criterion such as the BCE (Binary CrossEntropy) loss by our tailored objective . Unfortunately, the latter is not smooth and of highly correlated terms, which results in many challenges regarding its optimization. In order to incorporate and still keeping good performances, we (i) use a regularized proxy of it and (ii) incorporate the regularized criterion in a penalization term. The second point allows to drive the learning with a usual BCE loss, which asymptotically amounts to estimate the conditional probability , while considering .
Data generating process.
We generated the ’positive’ sample by i.i.d. Gaussian variables , , in dimension , centered and with covariance matrix (where
is the identity matrix). We chose the Gaussian law for its attractive structure and in particular for its symmetry, it can be a reasonable choice in many situations where the data at hand are indeed well structured. We then sampled the ’negative’ sequence of
i.i.d. r.v. , , from the following radial law, expressed in terms of its density in polar coordinates:where are two tunable parameters, is the unit sphere, and where . In other words, is uniformly sampled in the unit sphere and has Beta law with parameters and . Notice that corresponds to the Uniform law and that, when , the law puts more mass around as increases. In our experiment, we choose and . Denoting by , we finally obtained ’synthetic outliers’ defined by , with . To simplify the notations, we denote by the concatenation of the ’s and the ’s. We also denote by the labels, where we choose to assign the label (resp. ) to the ’positive’ (resp. ’negative’) sample. Figure 4 illustrates both data generating processes. For the test set, we generated similarly a sequence of i.i.d. Gaussian r.v. from the same Gaussian law as the ’positive’ sample, and a i.i.d. random sequence , , drawn from the law , with and , dilated by a factor .
(a) Train data. . (b) Test data. . 
Metrics.
Once the algorithm that learns a (renormalized) optimal scoring function has been trained ( step ), we score the test data with and compute the proportion of true outliers among the points having lowest scores ( step ). We let varies in . Formally, if denote the points and sorted by scores, the ordered sequence based on , we compute the following accuracy:
(12) 
Neural Network.
We trained a neural network
mlp composed of one hidden layer of size, a ReLu activation function and whose last layer is a Sigmoid function, computing the desired score. For each
epochs, we use the following training scheme:
Each sample of is individually passed through the network, the BCE loss is computed^{2}^{2}2Remember it is given by , where .
and a backpropagation step is performed,

At the end of each epoch, the whole batch of the training dataset is passed through the network and we computed the Binary Cross Entropy loss, denoted by , and the following proxy of :
In our experiments, we choose and with , as defined in section 3.1. We then compute the regularized loss , where
is a hyperparameter in
.
The training procedure of the Neural Net is summarized in the Algorithm 4.
[ht!]
Data: .
Input: Network mlp, number of epochs , penalization strength .
Result: Trained network.
for do
compute , backpropagate and zero_grad ;
compute and ;
compute the regularized loss , backpropagate and zero_grad ;
Repetitions.
We repeat times the procedure, each time computing the accuracy metric defined above.
Visualization and results.
In this section, we only display the results obtained with since they are very similar to the one obtained with . This is probably due to the very simple framework adopted for the data generating process and further investigations would be of interest.
For the first learning loop, we saved the evolution of the BCE losses, for all values of , computed at each epoch together with the proxy and the accuracy metric for . As displayed in Figure 5, one can see that the incorporation of the empirical criterion in the penalization term improves the performances for a well chosen parameter . For instance, output the best results in this setting.
At the end of the training, we select the network having the highest empirical score, which here corresponds to choosing . We then score the initial observations and display in Figure 6 the points with an intensity varying from red to blue as the score increases from to . The fact that the red points are on the sides of the dataset empirically validates our methodology. We represent in Fig. 7
the averaged mass volume curve together with standard deviation computed for
over repetitions. Table 1 gathers the results averaged over repetitions. Notice that these results support the soundness of our approach. Indeed, the area under the curve is minimized and the proportion of detected outliers is high even when increases.25  50  75  100  

(a) and . (b) and . 
5 Conclusion
In this paper, we promoted a binary classification approach to the problem of learning to rank anomalies. We established a clear theoretical link between these two machine learning tasks through the study of the massvolume curve. In particular, our procedure is robust with respect to imbalanced datasets through the choice of the parameter that is chosen initially in practice. Previous results (see Clémençon et al. (2021)) support the effectiveness of our methodology. Moreover, we illustrate our method with numerical experiments of synthetic data.
We thank Yannick Guyonvarch for his insightful comments. Moreover, we are greatly indebted to the chair DSAIDIS of Telecom Paris and to the Région IledeFrance for the support.
References
 Bergman and Hoshen (2020) L. Bergman and Y. Hoshen. ClassificationBased Anomaly Detection for General Data. arXiv:2005.02359, 2020.
 Breunig et al. (2000) M.M. Breunig, H.P. Kriegel, R.T. Ng, and J. Sander. Lof: identifying densitybased local outliers. In ACM sigmod record, volume 29, pages 93–104, 2000.
 Clémençon and Vayatis (2009) S. Clémençon and N. Vayatis. Treebased ranking methods. IEEE Transactions on Information Theory, 55(9):4316–4336, 2009.
 Clémençon et al. (2021) S. Clémençon, M. Limnios, and N. Vayatis. Concentration Inequalities for TwoSample Rank Processes with Application to Bipartite Ranking. arXiv:2104.02943, 2021.
 Clémençon and Thomas (2018) S. Clémençon and A. Thomas. Mass volume curves and anomaly ranking. Electronic Journal of Statistics, 12(2):2806 – 2872, 2018.

Frery et al. (2017)
J. Frery, A. Habrard, M. Sebban, O. Caelen, and L. HeGuelton.
Efficient top rank optimization with gradient boosting for supervised anomaly detection.
In European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’17), 2017.  Hájek (1962) J. Hájek. Asymptotically most powerful rankorder tests. The Annals of Mathematical Statistics, 33(3):112–1147, 09 1962.
 Liu et al. (2008) F.T. Liu, K.M. Ting, and Z.H. Zhou. Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on Data Mining, pages 413–422, 2008.
 Schölkopf et al. (2001) B. Schölkopf, J. Platt, A. J. ShaweTaylor, J. Smola, and R. C. Williamson. Estimating the support of a highdimensional distribution. Neural Computation, 13(7), 2001.
 Steinwart et al. (2005) I. Steinwart, D. Hush, and C. Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(8):211–232, 2005.
 Wilcoxon (1945) F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.
5 Conclusion
In this paper, we promoted a binary classification approach to the problem of learning to rank anomalies. We established a clear theoretical link between these two machine learning tasks through the study of the massvolume curve. In particular, our procedure is robust with respect to imbalanced datasets through the choice of the parameter that is chosen initially in practice. Previous results (see Clémençon et al. (2021)) support the effectiveness of our methodology. Moreover, we illustrate our method with numerical experiments of synthetic data.
We thank Yannick Guyonvarch for his insightful comments. Moreover, we are greatly indebted to the chair DSAIDIS of Telecom Paris and to the Région IledeFrance for the support.
References
 Bergman and Hoshen (2020) L. Bergman and Y. Hoshen. ClassificationBased Anomaly Detection for General Data. arXiv:2005.02359, 2020.
 Breunig et al. (2000) M.M. Breunig, H.P. Kriegel, R.T. Ng, and J. Sander. Lof: identifying densitybased local outliers. In ACM sigmod record, volume 29, pages 93–104, 2000.
 Clémençon and Vayatis (2009) S. Clémençon and N. Vayatis. Treebased ranking methods. IEEE Transactions on Information Theory, 55(9):4316–4336, 2009.
 Clémençon et al. (2021) S. Clémençon, M. Limnios, and N. Vayatis. Concentration Inequalities for TwoSample Rank Processes with Application to Bipartite Ranking. arXiv:2104.02943, 2021.
 Clémençon and Thomas (2018) S. Clémençon and A. Thomas. Mass volume curves and anomaly ranking. Electronic Journal of Statistics, 12(2):2806 – 2872, 2018.

Frery et al. (2017)
J. Frery, A. Habrard, M. Sebban, O. Caelen, and L. HeGuelton.
Efficient top rank optimization with gradient boosting for supervised anomaly detection.
In European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’17), 2017.  Hájek (1962) J. Hájek. Asymptotically most powerful rankorder tests. The Annals of Mathematical Statistics, 33(3):112–1147, 09 1962.
 Liu et al. (2008) F.T. Liu, K.M. Ting, and Z.H. Zhou. Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on Data Mining, pages 413–422, 2008.
 Schölkopf et al. (2001) B. Schölkopf, J. Platt, A. J. ShaweTaylor, J. Smola, and R. C. Williamson. Estimating the support of a highdimensional distribution. Neural Computation, 13(7), 2001.
 Steinwart et al. (2005) I. Steinwart, D. Hush, and C. Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(8):211–232, 2005.
 Wilcoxon (1945) F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.
References
 Bergman and Hoshen (2020) L. Bergman and Y. Hoshen. ClassificationBased Anomaly Detection for General Data. arXiv:2005.02359, 2020.
 Breunig et al. (2000) M.M. Breunig, H.P. Kriegel, R.T. Ng, and J. Sander. Lof: identifying densitybased local outliers. In ACM sigmod record, volume 29, pages 93–104, 2000.
 Clémençon and Vayatis (2009) S. Clémençon and N. Vayatis. Treebased ranking methods. IEEE Transactions on Information Theory, 55(9):4316–4336, 2009.
 Clémençon et al. (2021) S. Clémençon, M. Limnios, and N. Vayatis. Concentration Inequalities for TwoSample Rank Processes with Application to Bipartite Ranking. arXiv:2104.02943, 2021.
 Clémençon and Thomas (2018) S. Clémençon and A. Thomas. Mass volume curves and anomaly ranking. Electronic Journal of Statistics, 12(2):2806 – 2872, 2018.

Frery et al. (2017)
J. Frery, A. Habrard, M. Sebban, O. Caelen, and L. HeGuelton.
Efficient top rank optimization with gradient boosting for supervised anomaly detection.
In European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’17), 2017.  Hájek (1962) J. Hájek. Asymptotically most powerful rankorder tests. The Annals of Mathematical Statistics, 33(3):112–1147, 09 1962.
 Liu et al. (2008) F.T. Liu, K.M. Ting, and Z.H. Zhou. Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on Data Mining, pages 413–422, 2008.
 Schölkopf et al. (2001) B. Schölkopf, J. Platt, A. J. ShaweTaylor, J. Smola, and R. C. Williamson. Estimating the support of a highdimensional distribution. Neural Computation, 13(7), 2001.
 Steinwart et al. (2005) I. Steinwart, D. Hush, and C. Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(8):211–232, 2005.
 Wilcoxon (1945) F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.