In a wide variety of applications, ranging from the monitoring of aircraft engines in aeronautics to non destructive control quality in the industry through fraud detection, network intrusion surveillance or system management in data centers (see for instance (Viswanathan et al., 2012)), anomaly detection
is of crucial importance. In most common situations, anomalies correspond to ‘rare’ observations and must be automatically detected based on an unlabeled dataset. In practice, the very purpose of anomaly detection techniques is to rank observations by degree of abnormality/novelty, rather than simply assigning them a binary label, ‘abnormal’vs ‘normal’. In the case of univariate observations, abnormal values are generally those which are extremes, i.e. ‘too large’ or ‘too small’ in regard to central quantities such as the mean or the median, and anomaly detection may then derive from standard tail distribution analysis: the farther in the tail the observation lies, the more ‘abnormal’ it is considered. In contrast, it is far from easy to formulate the issue in a multivariate context. In the present paper, motivated by applications such as those aforementioned, we place ourselves in the situation where (unlabeled) observations take their values in a possibly very high-dimensional space, with
say, making approaches based on nonparametric density estimation or multivariate (heavy-) tail modeling hardly feasible, if not unfeasible, due to the curse of dimensionality. In this framework, a variety of statistical techniques, relying on the concept ofminimum volume set investigated in the seminal contributions of Einmahl and Mason (1992) and Polonik (1997) (see Section 2), have been developed in order to split the feature space into two halves and decide whether observations should be considered as ‘normal’ (namely when lying in the minimum volume set estimated on the basis of the dataset available) or not (when lying in the complementary set ) with a given confidence level . One may also refer to (Scott and Nowak, 2006) and to (Koltchinskii, 1997) for closely related notions. The problem considered here is of different nature, the goal pursued is not to assign to all possible observations a label ‘normal’ vs ‘abnormal’, but to rank them according to their level of ‘abnormality’. The most natural way to define a preorder on the feature space is to transport the natural order on the real line through some (measurable) scoring function : the ‘smaller’ the score , the more likely the observation is viewed as ‘abnormal’. This problem shall be here referred to as anomaly scoring. It can be somehow related to the literature dedicated to statistical depth functions in nonparametric statistics and operations research, see (Zuo and Serfling, 2000) and the references therein. Such parametric functions are generally proposed ad hoc in order to define a ‘center’ for the probability distribution of interest and a notion of distance to the latter, e.g. the concept of ‘center-outward ordering’ induced by halfspace depth in (Tukey, 1975) or (Donoho and Gasko, 1992). The angle embraced in this paper is quite different, its objective is indeed twofold: 1) propose a performance criterion for the anomaly scoring problem so as to formulate it in terms of -estimation 2) investigate the accuracy of scoring rules which optimize empirical estimates of the criterion thus tailored.
Due to the global nature of the ranking problem, the criterion we promote is functional, just like the Receiver Operating Characteristic (ROC) and Precision-Recall curves in the supervised ranking setup (i.e. when a binary label, e.g. ‘normal’ vs ‘abnormal’, is assigned to the sampling data), and shall be referred to as the Mass Volume curve (MV curve in abbreviated form). The latter induces a partial preorder on the set of scoring functions: the collection of optimal elements is defined as the set of scoring functions whose MV curve is minimum everywhere. Such optimal scoring functions are proved to coincide to strictly increasing transforms of the underlying probability density almost everywhere on the support of this underlying probability density (see Section 3 for the exact definition). In the unsupervised setting, curve analysis is shown to play a role quite comparable to that of curve analysis for supervised anomaly detection. The issue of estimating the curve of a given scoring function based on sampling data is then tackled and a smooth bootstrap method for constructing confidence regions is analyzed. A statistical methodology to build a nearly optimal scoring function is next described, which works as follows: first, a piecewise constant approximant of the optimal curve is estimated by solving a few minimum volume set estimation problems where confidence levels are chosen adaptively from the data to adjust to the variations of the optimal curve; second, a piecewise constant scoring function is built based on the sequence of estimated minimum volume sets. The curve of the scoring rule thus produced can be related to a stepwise approximant of the (unknown) optimal curve, which permits to establish the generalization ability of the algorithm through rate bounds in terms of norm in the space.
The rest of the article is structured as follows. Section 2
describes the mathematical framework, sets out the main notations and recalls the crucial notions related to anomaly/novelty detection on which the analysis carried out in the paper relies. Section3 first provides an informal description of the anomaly scoring problem and then introduces the curve criterion, dedicated to evaluate the performance of any scoring function. The set of optimal elements is described and statistical results related to the estimation of the curve of a given scoring function are stated in Section 4. Statistical learning of an anomaly scoring function is then formulated as a functional -estimation problem in Section 5, while Section 6 is devoted to the study of a specific algorithm for the design of nearly optimal anomaly scoring functions. Technical proofs are deferred to the Appendix section.
We finally point out that a very preliminary version of this work has been presented in the conference AISTATS 2013 (see (Clémençon and Jakubowicz, 2013)). The present article gives a better characterization of the optimal scoring functions and investigates much deeper the statistical assessment of the performance of a given scoring function. In particular, it shows how to construct confidence regions for the curve of a scoring function in a computationally feasible fashion, using the (smooth) bootstrap methodology for which we state a consistency result. This consistency result gives a rate of convergence which promotes the use of a smooth bootstrap approach, rather than a naive bootstrap technique. Additionally, in Section 6, the confidence levels of the minimum volume sets to be estimated are chosen in a data-driven way (instead of considering a regular subdivision of the interval fixed in advance), giving to the statistical estimation and learning procedures proposed in this paper crucial adaptivity properties, with respect to the (unknown) shape of the optimal curve. Eventually, we give an example showing that the nature of the problem tackled here is very different than that of density estimation and we also give a simpler formula of the derivative of the optimal curve (that of the underlying density) compared to the one originally given in (Clémençon and Jakubowicz, 2013).
2 Background and Preliminaries
As a first go, we start off with describing the mathematical setup and recalling key concepts in anomaly detection involved in the subsequent analysis.
2.1 Framework and Notations
Here and throughout, we suppose that we observe independent and identically distributed realizations of an unknown continuous probability distribution
, copies of a generic random variable, taking their values in a (possibly very high dimensional) feature space , with . The density of the random variable with respect to , the Lebesgue measure on , is denoted by , its support by and the indicator function of any event by . For any set , its complementary is denoted by . The sup norm of any real valued function is denoted by and the Dirac mass at any point by . The notation
is used to mean boundedness in probability. The quantile functionof any real valued random variable is defined by for all . For any real valued random variable , the generalized inverse of the decreasing function is defined by for all . For any function , denotes the cumulative distribution function of the random variable . In addition, for any , denotes the quantile at level of the distribution of . We also set for all . Finally, constants in inequalities are denoted by either or and may vary at each occurrence.
A natural way of defining preorders on is to map its elements onto and use the natural order on the real half-line.
(Scoring function) A scoring function is any measurable function that is integrable with respect to the Lebesgue measure.
The set of all scoring functions is denoted by and we denote the level sets of any scoring function by:
Observe that the family is decreasing (for the inclusion, as increases from to ):
and that and 111Recall that a sequence of subsets of an ensemble is said to converge if and only if . In such a case, one defines its limit, denoted by as ..
The following quantities shall also be used in the sequel. For any scoring function and threshold level , define:
The quantity is referred to as the mass of the level set , while is generally termed the volume (with respect to the Lebesgue measure).
The integrability of a scoring function (see Definition 1) implies that the volumes are finite on : for any ,
Incidentally, we point out that, in the specific case , the set is the density contour cluster of the density function at level , see (Polonik, 1995) for instance. Such a set is also referred to as a density level set. Observe in addition that, using the terminology introduced in (Liu et al., 1999), is the region enclosed by the contour of depth when is a depth function.
2.2 Minimum Volume Sets
The notion of minimum volume sets has been introduced and investigated in the seminal contributions of Einmahl and Mason (1992) and Polonik (1997) in order to describe regions where a multivariate random variable takes its values with highest/smallest probability. Let , a minimum volume set of mass at least is any solution of the constrained minimization problem
where the minimum is taken over all measurable subsets of . Application of this concept includes in particular novelty/anomaly detection: for large values of
, abnormal observations (outliers) are those which belong to the complementary set. In the continuous setting, it can be shown that there exists a threshold value equal to such that the level set is a solution of the constrained optimization problem above. The (generalized) quantile function is then defined by:
The definition of can be extended to the interval by setting and . The following assumptions shall be used in the subsequent analysis.
The density is bounded: .
The density has no flat parts, i.e. for any ,
Under the assumptions above, for any , there exists a unique minimum volume set (equal to up to subsets of null -measure), whose mass is equal to exactly. Additionally, the mapping is continuous on and uniformly continuous on for all (when the support of is compact, uniform continuity holds on the whole interval ).
From a statistical perspective, estimates of minimum volume sets are built by replacing the unknown probability distribution by its empirical version and restricting optimization to a collection of borelian subsets of , supposed rich enough to include all density level sets (or reasonable approximations of the latter). In (Polonik, 1997), functional limit results are derived for the generalized empirical quantile process under certain assumptions for the class (stipulating in particular that is a Glivenko-Cantelli class for ). In (Scott and Nowak, 2006), it is proposed to replace the level by where plays the role of tolerance parameter. The latter should be chosen of the same order as the supremum roughly, complexity of the class being controlled by the VC dimension or by means of the concept of Rademacher averages, so as to establish rate bounds at fixed.
Alternatively, so-termed plug-in techniques, consisting in computing first an estimate of the density and considering next level sets of the resulting estimator have been investigated in several papers, among which (Cavalier, 1997; Tsybakov, 1997; Rigollet and Vert, 2009; Cadre, 2006; Cadre et al., 2013). Such an approach however yields significant computational issues even for moderate values of the dimension, inherent to the curse of dimensionality phenomenon.
3 Ranking Anomalies
In this section, the issue of scoring observations depending on their level of novelty/abnormality is first described in an informal manner and next formulated quantitatively, as a functional optimization problem, by means of a novel concept, termed the Mass Volume curve.
3.1 Overall Objective
The problem considered in this article is to learn a scoring function , based on training data
, so as to describe the extremal behavior of the (high-dimensional) random vectorby that of the univariate variable , which can be summarized by its tail behavior near : hopefully, the smaller the score , the more abnormal/rare the observation should be considered. Hence, an optimal scoring function should naturally permit to rank observations by increasing order of magnitude of (see Section 3.2 for the precise definition of the set of optimal scoring functions). The preorder on induced by such an optimal scoring function could then be used to rank observations by their degree of abnormality: for any optimal scoring function , an observation is more abnormal than an observation if and only if .
3.2 A Functional Criterion: the Mass Volume Curve
We now introduce the concept of Mass Volume curve and shows that it is a natural criterion to evaluate the accuracy of decision rules in regard to anomaly scoring.
(True Mass Volume curve) Let . Its Mass Volume curve ( curve in abbreviated form) with respect to the probability distribution of the random variable is the plot of the function
If the scoring function is upper bounded, exists and is defined on .
(Parametric Mass Volume Curve) If is invertible, and therefore is the ordinary inverse of , the Mass Volume curve of the scoring function can also be defined as the parametric curve:
Scoring functions and their associated Mass Volume curves when considering a truncated Gaussian distribution with density.
(Connections to ROC/concentration analysis)
We point out that the curve resembles a receiver operating characteristic (ROC) curve of the test/diagnostic function (see (Egan, 1975)), except that the distribution under the alternative hypothesis is not a probability measure but the image of Lebesgue measure on by the function , while the distribution under the null hypothesis is the probability distribution of the random variable differs from that of a uniform distribution. Observe also that, in the case where the support of
, while the distribution under the null hypothesis is the probability distribution of the random variable. Hence, the curvature of the graph somehow measures the extent to which the spread of the distribution of
differs from that of a uniform distribution. Observe also that, in the case where the support ofcoincides with the unit square , the curve of any scoring function corresponds to the concentration function introduced in (Polonik, 1999), where is the uniform probability distribution on and , to compare the concentration of with that of with respect to the class of sublevel sets of the function . Notice finally that, when is a depth function, is a scale curve, as defined in (Liu et al., 1999).
This functional criterion induces a partial order over the set of all scoring functions. Let and be two scoring functions on , the ordering provided by is better than that induced by when
Typical curves are illustrated in Fig. 1. A desirable curve increases slowly and rises near , just like the lowest curve of Fig. 1. This corresponds to the situation where the distribution of the random variable is much concentrated around its modes and the highest values (respectively, the lowest values) taken by are located near the modes of (respectively, in the tail region of ). The curve of the scoring function is then close to the right lower corner of the Mass Volume space. We point out that, in certain situations, some parts of the MV curve may be of interest solely, corresponding to large values of when focus is on extremal observations (the tail region of the random variable ) and to small values of when modes of the underlying distribution are investigated.
We define the set of optimal scoring functions as follows. We recall here that for any set , its complementary is denoted by .
The set of optimal scoring functions is the set of scoring functions such that there exist and a function such that
For all , ,
is strictly increasing,
For -almost all , .
One can note that, as expected, the density and strictly increasing transforms of the density belong to . More generally and roughly speaking, an optimal scoring function is a strictly increasing transform of the density -almost everywhere on the support of the density. Note also that on the one hand, if , thanks to , and , we have for almost all (because up to subsets of null measure as shown in Lemma 1). Therefore -almost everywhere on . On the other hand, thanks to , for almost all and therefore -almost everywhere on . Thus the values of on are -almost everywhere lower than the values of on , where is almost everywhere a strictly increasing transform of the density .
The result below shows that optimal scoring functions are those whose curves are minimum everywhere.
From now on we will thus use to denote the curve of elements of . The proof of Proposition 1 can be found in Appendix A. Incidentally, notice that, equipped with the notations introduced in 2.2, we have for all . We also point out that bound (3) reveals that the pointwise difference between the optimal curve and that of a scoring function candidate is controlled by the error made in recovering the specific minimum volume set through .
In the framework we develop, anomaly scoring boils down to recovering the decreasing collection of all level sets of the density function , , without necessarily disposing of the corresponding levels. Indeed, one may check that any scoring function of the form
where is an arbitrary finite positive measure dominating the Lebesgue measure on , belongs to . Observe that corresponds to the case where is chosen to be the Lebesgue measure on in Eq. (4). The anomaly scoring problem can be thus cast as an overlaid collection of minimum volume set estimation problems. This observation shall turn out to be very useful when designing practical statistical learning strategies in Section 6.
We point out that the optimal curve provides a measure of the mass concentration of the random variable : the lower the curve , the more concentrated the distribution . Indeed, the excess mass functional introduced in (Muller and Sawitzki, 1991) can be expressed as . Hence, we have for all . One may refer to (Polonik, 1995) for results related to the statistical estimation of density contour clusters and applications to multimodality testing using the excess mass functional.
(Univariate Gaussian distribution) Let be random variable with Gaussian distribution , we have , where is the cumulative distribution function of .
(Multivariate Gaussian distribution) Let be a multivariate random variable with Gaussian distribution . We assume that is a diagonal matrix and we denote by its diagonal coefficients. The density level set is exactly the set with . It is well known that , the distribution with degrees of freedom. Thus , denoting the quantile of order of the distribution. The set is exactly the ellipsoid with semi-principal axes of length . The formula of the volume of an ellipsoid gives where is the gamma function.
The following result reveals that the optimal curve is convex and provides a closed analytical form for its derivative. The proof is given in Appendix A.
The convexity of might be explained as follows: when considering density level sets with increasing probabilities , their volumes increase as well, but they increase more and more quickly since the mass of the distribution becomes less and less concentrated.
Elementary properties of curves are summarized in the following proposition.
(Properties of curves) For any , the following assertions hold true.
Invariance. For any strictly increasing function , we have .
Monotonicity. The mapping is increasing.
Assertion may be proved using the following property of the quantile function (see e.g. Property 2.3 in (Embrechts and Hofert, 2013)): for any cumulative distribution function , any and any , if and only if . Assertion derives from the fact that the quantile function is increasing. Details are left to the reader.
4 Statistical Estimation
In practice, curves are unknown, just like the probability distribution of the random variable , and must be estimated based on the observed sample . Replacing the mass of each level set by its statistical counterpart in Definition 2 leads to define the notion of empirical curve. We set, for all ,
where denotes the empirical distribution of the . Notice that takes its values in the set .
(Empirical curve) Let . By definition, the empirical curve of is the graph of the (piecewise constant) function
(On volume estimation). Except for very specific choices of the scoring function (e.g. when is piecewise constant and the volumes of the subsets of on which is constant can be explicitly computed), no closed analytic form for the volume is available in general and Monte-Carlo procedures should be used to estimate it (which may be practically challenging in a high-dimensional setting). Computation of volumes in high dimensions is an active topic of research (Lovász and Vempala, 2006) and for simplicity, this is not taken into account in the subsequent analysis.
In order to obtain a smoothed version , a typical strategy consists in replacing the empirical distribution estimate involved in (5) by the continuous distribution with density , with where is a regularizing Parzen-Rosenblatt kernel (i.e. a bounded square integrable function such that ) and is the smoothing bandwidth (see for instance (Wand and Jones, 1994)).
4.1 Consistency and Asymptotic Normality
The theorem below reveals that, under mild assumptions, the empirical curve is a consistent and asymptotically Gaussian estimate of the curve, uniformly over any subinterval of . It involves the assumptions listed below.
The scoring function is bounded: .
The random variable has a continuous cumulative distribution function . Let and . The distribution function is twice differentiable on and
There exists such that
has a limit when tends towards from the left:
The mapping is of class .
Assumption 4.1 implies that exists and . Note also that if is not bounded, one can always consider . Indeed takes its values in and as is strictly increasing thanks to Proposition 3. Assumptions 4.1 and 4.1 are common assumptions for the strong approximation of the quantile process (see for instance (Csörgő and Révész, 1978)). Condition (6) and assumption 4.1 are respectively equivalent to for all and
Eventually, assumption 4.1 is equivalent to: has a limit when tends towards . The density can therefore be extended by continuity to the interval . As , (6) can then be replaced by for all which also gives for all . One can also note that is thus invertible on with inverse .
In order to state part of the following theorem, we assume that the probability space is rich enough in the sense that an infinite sequence of Brownian bridges can be defined on it.
(Consistency) With probability one, we have uniformly over :
(Strong approximation) There exists a sequence of Brownian bridges such that we almost surely have, uniformly over the compact interval : as ,
The technical proof is given in Appendix A. One can note from the proof of assertion that this assertion does not require assumption 4.1 to be satisfied. The proof of assertion relies on standard strong approximation results for the quantile process (Csörgő and Révész, 1978, 1981; Csörgő, 1983). Assertion means that the fluctuation process converges in the space of càd-làg functions on equipped with the sup norm, to the law of a Gaussian stochastic process .
Assumption 4.1 is required in order to state the results of assertions and on the interval . However this assumption is restrictive as it implies that is discontinuous at since whereas for all . Instead of assumption 4.1, if one assumes that is decreasing on an interval to the left of , then the result of assertion can be obtained on the interval with the same rate of convergence if and with the same rate of convergence up to and factors if . In this case, assertion also holds on (see for instance (Csörgő and Révész, 1978)).
4.2 Confidence Regions in the Mass Volume Space
The true curve of a given scoring function is unknown in practice and its performance must be statistically assessed based on a data sample. Beyond consistency of the empirical curve in sup norm and the asymptotic normality of the fluctuation process, we now tackle the question of constructing confidence bands in the space.
Based on a sample , a (random) confidence region for the curve of a given scoring function at confidence level is any borelian set of the space (possibly depending on ) that covers the curve with probability larger than :
In practice, confidence regions shall be of the form of balls in the Skorohod’s space of càd-làg functions on with respect to the norm and with the estimate introduced in Definition 4 as center for some fixed :
Constructing confidence regions based on the approximation stated in Theorem 1 would require to know the density . Hence a bootstrap approach (Efron, 1979) should be preferred. Following in the footsteps of Silverman and Young (1987), it is recommended to implement a smoothed bootstrap procedure. The asymptotic validity of such a resampling method derives from a strong approximation result similar to the one of Theorem 1. Let be the fluctuation process defined for all by
The bootstrap approach suggests to consider, as an estimate of the law of the fluctuation process , the conditional law given the original sample of the naive bootstrapped fluctuation process
where, given , is the empirical curve of the scoring function based on a sample of i.i.d. random variables with distribution . The difficulty is twofold. First, the target is a distribution on a path space, namely a subspace of the Skorohod’s space equipped with the norm. Second, is a functional of the quantile process . The naive bootstrap, which consists in resampling from the raw empirical distribution , generally provides bad approximations of the distribution of empirical quantiles: the rate of convergence for a given quantile is indeed of order (Falk and Reiss, 1989) whereas the rate of the Gaussian approximation is (see (25) in Appendix A). The same phenomenon may be naturally observed for curves. In a similar manner to what is usually recommended for empirical quantiles, a smoothed version of the bootstrap algorithm shall be implemented in order to improve the approximation rate of the distribution of , namely, to resample the data from a smoothed version of the empirical distribution . We thus consider the smooth boostrapped fluctuation process
where, given , is the empirical curve of the scoring function based on a sample of i.i.d. random variables with distribution and where is the smooth version of the empirical curve, being the generalized inverse of . The algorithm for building a confidence band at level in the space from sampling data is described in Algorithm 1.
Before turning to the theoretical analysis of this algorithm, its description calls a few comments. From a computational perspective, the smoothed bootstrap distribution must be approximated in its turn, by means of a Monte-Carlo approximation scheme. Based on the bootstrap fluctuation processes obtained, with , the radius then coincides with the empirical quantile at level of the statistical population . Concerning the number of bootstrap replications, picking does not modify the rate of convergence. However, choosing of magnitude comparable to so that is an integer may be more appropriate: the -quantile of the approximate bootstrap distribution is then uniquely defined and this does not impact the rate of convergence neither (Hall, 1986).
4.2.1 Bootstrap consistency
The next theorem reveals the asymptotic validity of the bootstrap estimate proposed above where we assume that the smoothed version of the distribution function is computed at step 1 of Algorithm 1 using a kernel . It requires the following assumptions.
The density is bounded and of class .
The bandwidth decreases towards as in a way that , and .
The kernel has finite support and satisfies the following conditions: , and .
The kernel is such that and is of the form , being a polynomial and a bounded real function of bounded variation.
The kernel is differentiable with derivative such that and of the form , being a polynomial and a bounded real function of bounded variation.
The bandwidth is such that tends to 0 as .
Assumptions 4.2.1 and 4.2.1 are needed to control the biases and . Assumption 4.2.1 on the bandwidth and assumption 4.2.1 (respectively assumption 4.2.1) are needed to control (respectively ) thanks to the result of Giné and Guillou (2002). Assumption 4.2.1 ensures that tends to almost surely as . Assumptions 4.2.1-4.2.1 are fulfilled, for instance, by the biweight kernel defined for all by: