A Rank-SVM Approach to Anomaly Detection

05/02/2014 ∙ by Jing Qian, et al. ∙ Boston University 0

We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on rank-SVM. Data points are first ranked based on scores derived from nearest neighbor graphs on n-point nominal data. We then train a rank-SVM using this ranked data. A test-point is declared as an anomaly at alpha-false alarm level if the predicted score is in the alpha-percentile. The resulting anomaly detector is shown to be asymptotically optimal and adaptive in that for any false alarm rate alpha, its decision region converges to the alpha-percentile level set of the unknown underlying density. In addition we illustrate through a number of synthetic and real-data experiments both the statistical performance and computational efficiency of our anomaly detector.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Anomaly detection is the problem of identifying statistically significant deviations of data from expected normal behavior. It has found wide applications in many areas such as credit card fraud detection, intrusion detection for cyber security, sensor networks and video surveillance [1][2]

. In this paper we focus on the setting of point-wise anomaly detection. At training time only a set of nominal examples drawn i.i.d. from an unknown “nominal” distribution are given. Note that nominal density or distribution refer to the distribution of nominal samples in this paper, not the Gaussian distribution. Our objective is to learn an anomaly detector that maximizes the detection power subject to some false-alarm rate constraint.

Existing approaches of point-wise anomaly detection can be divided into two categories, namely parametric and non-parametric methods. Classical parametric methods [3]

for anomaly detection assume a family of functions that characterize the nominal density. Parameters are then estimated from training data by minimizing some loss function. While these methods provide a statistically justifiable solution when the assumptions hold true, they are likely to suffer from model mismatch and lead to poor performance.

Popular non-parametric approaches include one-class support vector machines (SVM)

[4], and various density-based algorithms [5, 6, 7, 8, 9]. The kernel one-class SVM algorithm attempts to find a decision boundary by mapping the nominal data into a high-dimensional kernel space, separating the image from the origin with maximum margin. While one-class SVM is computationally efficient it does not directly incorporate density information, and can exhibit poor control over false alarm rates. Density-based methods such as minimum volume set estimation [5, 10] and geometric entropy minimization (GEM) [6] involve explicitly approximating high-dimensional quantities such as the multivariate density function or the minimum volume set boundaries. This can be computationally prohibitive for high dimensional problems. Nearest neighbor-based approaches [7, 9, 8] propose to estimate the -value function through some statistic based on nearest neighbor (-NN) distances within the graph constructed from nominal points. While allowing flexible control over false alarm rates and often providing better performance than one-class SVM, these approaches usually require expensive computations at test stage, such as calculating the -NN distance of the test point, which makes them inapplicable for tasks requiring real-time processing.

In this paper we propose a novel RankSVM based anomaly detection algorithm that combines the computational efficiency of the simple one-class SVM approach with the statistical performance of nearest neighbor-based methods. Our approach learns a ranker over nominal samples through a “supervised” learning-to-rank step, for which we adopt the pair-wise RankSVM algorithm. Nominal density information based on

-NN distances is directly incorporated as input pairs in this pair-wise learning-to-rank step. For each input pair a binary value is assigned with zero denoting the fact that point

is more of an outlier relative to point

and one representing the opposite scenario. The learning step trains a ranker which predicts how anomalous a point is. At the test stage our detector labels a test point as an anomaly if the predicted rank score is in the -percentile among the training data. We then show asymptotic consistency and present a finite-sample generalization result of our ranking-based anomaly detection framework. We conduct experiments on both synthetic and real data sets to demonstrate superior performance in comparison to other methods.

We summarize the advantages of our proposed anomaly detection approach below.

  • Computational Efficiency: During test stage, our approach only needs to evaluate an SVM-type function on the test point, similar to the simple one-class SVM approach. In contrast nearest neighbor-based methods generally require distances between the test point and training points to determine whether or not the test point is anomalous.

  • Statistical Performance: Our discriminative learning-to-rank step leads to a ranking-based anomaly detection framework that is asymptotically consistent. We can also guarantee a finite sample bound on the empirical false alarm rate of our decision rule.

  • Adaptivity: Our method adapts to any false alarm level because it can asymptotically approximate different level-sets of the underlying density function. While the threshold parameter can be modified for one-class SVM to obtain detectors for different false alarm levels this does not often result in optimal performance.

The rest of the paper is organized as follows. In Section 2 we introduce the problem setting and the motivation. Detailed algorithms are described in Section 3. The asymptotic and finite-sample analyses are provided in Section 4. Synthetic and real experiments are reported in Section 5. Section 6 concludes the paper.

Ii Problem Setting and Motivation

We closely follow the setting of [4]. Let , , be a given set of nominal points sampled i.i.d from an unknown density with compact support in . Let

be the corresponding probability measure. We are interested in estimating the

-percentile minimum volume set with reference measure :

(1)

The most common choice of is Lebesgue measure [4], in which case represents the minimum volume set that captures at least a fraction of the probability mass. Its meaning in outlier/anomaly detection is that for a given test point , the following decision rule naturally maximizes the detection power 111 Such a rule maximizes the detection rate with respect to the reference measure . Usually no anomalous samples are available during training; the common choice of Lebesgue measure as corresponds to assuming uniform anomaly distribution. It is shown that the decision rule 2 is the uniformly most powerful (Neyman-Pearson) test [7]. while controlling the false alarm rate at , ,

(2)

The following lemma restates and simplifies the above optimal decision rule in terms of the -value function, which will be the main focus of this paper.

Lemma 1.

Assume the density has no “flat region” on the support, . Then the decision rule Eq.(2) is equivalent to:

(3)

where is the -value function defined as:

(4)

The proof is straight forward. [5, 11] has shown that given the non-flat assumption, the minimum volume set coincides with the

-quantile density level set:

(5)

thus yields , which leads to Eq.(3).

The main concern now is to estimate . It is worth mentioning that instead of estimating one particular minimum volume set as in oc-svm [4], we aim to estimate -value function which yields different decision regions with different values of . This point will be illustrated later. We are motivated to learn a ranker on nominal samples based on the following lemma:

Lemma 2.

Assume is any function that gives the same order relationship as the density: , such that . Then as , the rank of , , converges to :

(6)

By assumption we have

; the proof is just an application of the law of large numbers.

The lemma shows that if some statistic preserves the ordering of the density on the data set, the rank asymptotically converges to the -value function . There are a number of choices for the statistic . One could of course compute the density estimates as and plug into Eq.(6). On the other hand, nearest neighbor-based approaches [7, 9, 8] propose simple statistics based on the -nearest neighbor (-NN) distance or the -neighborhood size among the nominal set as surrogates, which are shown to be asymptotically consistent as in Eq.(6). Other techniques, such as -NN geodesic distance of Isomap [12] or -NN distance after Local Linear Embedding (LLE) [13], can adapt to intrinsic dimension instead of ambient dimension. Recently -NN approaches that is customized to each data point has also been shown to achieve this [14]. It should be noted that although choosing is important, this is not our main focus. In fact our method would work in conjunction with these techniques. For simplicity of exposition we restrict ourselves to -NN statistic as in this paper.

The main issue is that we would like to avoid computing the -NN distance statistic (or other complicated ) for during test stage, because the complexity would grow as , which can be prohibitive for real-time applications. To this end, we are compelled to learn a simple scoring function , that best respects the observed pair-wise preference relationship given by , . This is achieved via a supervised pair-wise learning-to-rank framework. The inputs are preference pairs encoding the ground-truth order relationship information. In our setting, we generate preference pairs based on the average -NN distance statistic, which has been shown [8] to be robust and to have good convergence rates. In this way, nominal density information is incorporated in the input pairs.

Now that we have the preference pairs, the next step is to learn a ranker based on minimizing the pair-wise disagreement loss function. In our work we adopt the rank-SVM method to obtain a ranker . Intuitively, scores how “nominal” the point is, and is simple to evaluate relative to density-based counterparts. This leads to reduced complexity during the test stage. is then adopted in place of to compute the rank according to Eq.(6), which only requires a bisection search among sorted of nominal samples.

Iii Rank-Based Anomaly Detection Algorithm

Iii-a Anomaly Detection Algorithm

We describe the main steps of the algorithm:

Iii-A1 Rank Computation

For each nominal training sample , let be the distance to its th nearest neighbor. Then,

(7)

for some suitable . We plug into Eq.(6) and compute ranks of nominal points.

Iii-A2 Learning-to-Rank

From Step 1 our training set is now . Next we want to learn a ranker so that it outputs an ordinal value for each which maximally preserves the ordering of . We adopt the pairwise learning-to-rank framework, where the input is a collection of preference pairs, , where each input pair represents . The goal is to minimize some loss function, for example, the weighted pairwise disagreement loss (WPDL),

(8)

We adopt the rank-SVM algorithm to train our ranker with equal weight for all pairs , and solve the following optimization problem:

(9)

where is a mapping into a reproducing kernel Hilber space with inner product . Rank-SVM minimizes WPDL with indicator replaced by hinge loss. Details about rank SVM can be found in [15].

Remark 1

Given the ranks of nominal samples, practically we find that generating all preference pairs for the rank-SVM algorithm often leads to poor detection performance due to overfitting, not to mention the high training complexity. In our experiments the following scheme is adopted: we first quantize all the ranks to levels . A preference pair is generated for every , indicating that samples near are “less nominal”, thus “more anomalous” than those near . This scheme with a relatively small significantly reduces the number of input pairs to the rank-SVM algorithm and the training time. It results in better empirical performance as well. While this raises the question of choosing we find works fairly well in practice and we fix this in all of our experiments in Sec.V. Other similar schemes can be used to select preference pairs, for example, only pairs with significant rank differences are input to ranking-SVM.

Remark 2

We adopt the RBF kernel for rank-SVM. The algorithm parameter and RBF kernel bandwidth can be selected through cross validation, since this rank-SVM step is a supervised learning procedure based on input pairs.

Iii-A3 Prediction

At test time, the ordinal value for , is first computed. Then the rank is estimated using Eq.(6) by replacing with . If falls under the false alarm level , anomaly is declared.

Our algorithm is summarized as follows:

  Algorithm 1: Ranking Based Anomaly Detection (rankAD)

  1. Input:

nominal training data , desired false alarm level , and test point

2. Training Stage:

(a) Calculate ranks and thus for each nominal sample , using Eq.(7) and Eq.(6).
(b) Quantize the ranks into levels: . Generate preference pairs whenever their quantized levels satisfy .
(c) Train a ranker through RankSVM.
(d) Compute of , and sort these values.

3. Testing Stage:

(a) Evaluate for test point .
(b) Compute the rank according to Eq.(6), replacing with .
(c) Declare as anomalous if .

 

Iii-B Comparison With State-of-the-art Algorithms

We provide comparison of our approach against one-class SVM and density-based algorithms in terms of false alarm control and test stage complexity.

Iii-B1 False alarm control

One-class SVM does not have any natural control over the false alarm rate. Usually the parameter is varied for a different false alarm level, requiring re-solving the optimization problem. This is because one-class SVM aims at approximating one level set at a time. While our method also involves SVM learning step, our approach is substantially different from one-class SVM. Our ranker from the rank-SVM step simultaneously approximates multiple level sets. The normalized score Eq.(6), takes values in , and converges to the

-value function. Therefore we get a handle on the false alarm rate. So null hypothesis can be rejected at different levels simply by thresholding

.

(a) Level curves of oc-svm (b) Level curves of rank-SVM
Fig. 1: Level curves of one-class SVM and rank-SVM. 1000 i.i.d. samples are drawn from a 2-component Gaussian mixture density. Let are one-class SVM () and our rankSVM () predictors respectively. (a) shows level curves obtained by varying the offset for . Only the outmost curve () approximates the oracle density level set well while the inner curves () appeared to scaled version of outermost curve. (b) shows level curves obtained by varying for . Notice that the inner most curve approximates peaks of the mixture density.

Toy Example: We present a simple example in Fig.1 to demonstrate this point. The nominal density . One-class SVM using RBF kernels () is trained with parameter , to yield a decision function . The standard way is to claim anomaly when , corresponding to the outmost orange curve in (a). We then plot different level curves by varying for , which appear to be scaled versions of the orange curve. Intuitively, this is because one-class SVM with parameter aims to separate approximately fraction of nominal points from the origin in RKHS with maximum-margin principle, and only focuses on points near the boundary. Asymptotically it is known to approximate one density level set well [4]. For a different , one-class SVM needs re-training with a different . On the other hand, we also train rank-SVM with and obtain the ranker . We then vary for to obtain various level curves shown in (b), all of which approximate the corresponding density level sets well. This is because the input preference pairs to rank-SVM incorporate density ordering information all over the support of . Asymptotically preserves the ordering of density as will be shown in Sec.IV. This property of allows flexible false alarm control and does not need any re-training.

Iii-B2 Time Complexity

For training, the rank computation step requires computing all pair-wise distances among nominal points , followed by sorting for each point . So the training stage has the total time complexity , where denotes the time of the pair-wise learning-to-rank algorithm. At test stage, our algorithm only evaluates the SVM-type on and does a binary search among . The complexity is , where is the number of support vectors, similar to of one-class SVM, while nearest neighbor-based algorithms, K-LPE, aK-LPE or BP-KNNG [7, 8, 9], require for testing one point. It is worth noting that comes from the “support pairs” within the input preference pair set, and is usually larger than . Practically we observe that for most data sets is much smaller than in the experiment section, leading to significantly reduced test time compared to aK-LPE, as shown in Table.1.

Iv Analysis

In this section we present some theoretical analysis of our ranking-based anomaly detection approach. We first show that our approach is asymptotically consistent in that converges to the -value as sample size approaches infinity. We then provide a finite-sample generalization bound on the false alarm rate of our approach.

Iv-a Asymptotic Consistency

Our asymptotic analysis consists of three parts, respectively corresponding to the three main steps of our algorithm described in Sec.

III.

(1) Consistency of Rank Computation

The rank of nominal samples based on average -NN distance has been shown previously to converge to the -value function [8]:

Theorem 3.

Suppose the rank is computed according to Eq.(6) based on the average -NN distance statistic Eq.(7) among . With appropriately chosen, as ,

(10)

This theorem establishes that asymptotically, the preference pairs generated as input to the learning-to-rank step are reliable, in the sense that any generated pair has the “correct” order as , or equivalently .

(2) Consistency of Rank SVM

For simplicity we assume the preference pair set contains all pairs over these samples. Let be the optimal solution to the Rank SVM Eq.(9). If denotes the hinge loss, then this optimal solution satisfies

(11)

where is given by

(12)

Let denote a ball of radius in . Let with the rbf kernel associated to . Given , we let be the covering number of by disks of radius (see appendix). We first show that with appropriately chosen , as , is consistent in the following sense.

Theorem 4.

Let be appropriately chosen such that and , as . Then we have

(13)

We then establish that under mild conditions on the surrogate loss function, the solution minimizing the expected surrogate loss will asymptotically recover the correct preference relationships given by the density .

Theorem 5.

Let be a non-negative, non-increasing convex surrogate loss function that is differentiable at zero and satisfies . If

then will correctly rank the samples according to their density, i.e. , where , .

The hinge-loss satisfies the conditions in the above theorem. Combining Theorem 4 and 5, we establish that asymptotically, the rank-SVM step yields a ranker that preserves the preference relationship on nominal samples given by the nominal density .

(3) Consistency of Test Stage Prediction

Corollary 6.

Assume the non-flat condition of Lemma 1 holds. For a test point , let denote the rank computed by Eq.(6) using the optimal solution of the rank-SVM step as , as described in Algorithm 1. Then for a given false alarm level , the decision region given in Algorithm 1 asymptotically converges to the ()-percentile minimum volume set decision region as in Eq.(2).

Thm.5 and Lemma 2 yields the asymptotic consistency of . Lemma 1 finishes the proof.

Iv-B Finite-Sample Generalization Result

Based on nominal samples , our approach learns a ranker , and computes the values . Let be the ordered permutation of these values. For a test point , we evaluate and compute according to Eq.(6). For a prescribed false alarm level , we define the decision region for claiming anomaly by

where denotes the ceiling integer of .

We give a finite-sample bound on the probability that a newly drawn nominal point lies in . In the following Theorem, denotes a real-valued function class of kernel based linear functions (solutions to an SVM-type problem) equipped with the norm over a finite sample :

Moreover, denotes a covering number of with respect to this norm (see [4] for details).

Theorem 7.

Fix a distribution on and suppose are generated iid from . For let be the ordered permutation of . Then for such an -sample, with probability , for any , and sufficiently small ,

where , .

Remark

To interpret the theorem notice that the LHS is precisely the probability that a test point drawn from the nominal distribution has a score below the percentile. We see that this probability is bounded from above by plus an error term that asymptotically approaches zero. This theorem is true irrespective of and so we have shown that we can simultaneously approximate multiple level sets. This theorem is similar to Theorem 12 of [4] in the second term of the upper bound. However, the generalization result for our approach applies to different quantiles , or different values of , thus is a uniform upper bound on the empirical false alarm probability for different levels , while the result in [4] only applies for , corresponding to one particular level set. This point is also illustrated in Fig.1.

V Experiments

In this section, we carry out point-wise anomaly detection experiments on synthetic and real-world data sets. We compare our ranking-based approach against density-based methods BP-KNNG [9] and aK-LPE [8], one-class SVM [4], and another two state-of-art methods based on random sub-sampling, isolated forest [16] (iForest) and massAD [17].

V-a Implementation Details

In our simulations, the Euclidean distance is used as distance metric for all candidate methods. For one-class SVM the lib-SVM codes [18] are used. The algorithm parameter and the RBF kernel parameter for one-class SVM are set using the same configuration as in [17]. For iForest and massAD, we use the codes from the websites of the authors, with the same configuration as in [17]. The statistic for our approach and aK-LPE is the average -NN distance Eq.(7) with fixed . For BP-KNNG, the same is used and other parameters are set according to [9].

For the rank-SVM step, we adapt the linear Ranking-SVM routine from [19] to a kernelized version. To generate preference pairs, we quantize the ranks of nominal points into =3 levels and generate pairs whenever . We vary the rank-SVM parameter of Eq.(9), , and the RBF kernel parameter , where is the average -NN distance over nominal samples. We choose the parameter configuration through a 4-fold cross validation, and train a ranker with these parameters on the whole nominal set. Since anomalous data is unavailable at training time, we favor rankers that violate the preference relationships less on the nominal set. This is then adopted for test stage prediction. All AUC performances are averaged over 5 runs.

V-B Synthetic Data sets

We first apply our method to a Gaussian toy problem, where the nominal density is:

The anomalous density

is the uniform distribution within

.

The empirical ROC curves of our method and one-class SVM along with the optimal Bayesian detector are shown in Fig.2. We can see from (a) that our approach performs fairly close to the optimal Bayesian classifier and much better than one-class SVM. Fig.2 (b) shows the level curves for the estimated ranks on the test data. The empirical level curves of rankAD approximate the level sets of the underlying nominal density quite well.

(a) ROC for 2-component Gaussian (b) Level curves of rankAD
Fig. 2: Performance on a synthetic data set: (a) ROC curve on a two-component Gaussian Mixture data. (b) Level sets for the estimated ranks. 600 training points are used for training. For test 500 nominal and 1000 anomalous points are used.

V-C Real-world data sets

We first illustrate ROC curves on two real-world data sets: the banknote authentication data set and the magic gamma telescope data set from the UCI repository [20]. We compare our approach to the two typical class of methods: density-based aK-LPE and one-class SVM. For the banknote data set, class with label 2 is regarded as nominal and other classes are anomalous. The testing time for 872 test points of aK-LPE, one-class SVM and our method are 0.078s, 0.02s and 0.031s (with 162/500 support vectors) respectively. As shown in Fig.3(a), our algorithm clearly outperforms one-class SVM. In fact, our method achieves 100% true detection rate at a false positive rate of around 20%, while one-class SVM achieves 100% true detection rate at a false positive rate of 70%.

(a) Banknote (b) Magic Gamma
Fig. 3: The ROC curves for aK-LPE, one-class SVM and the proposed method on different data sets. (a) Banknote authentication, class “2” (nominal) vs. others, 5-dim, 500 training points, 872 test points (262 nominal). (b) Magic Gamma Telescope, gamma particles (nominal) vs. background, 10-dim, 1500 training points, 4000 test points (1000 nominal).

The Magic gamma telescope data set is an image data set used to classify high energy gamma particles from cosmic rays in an atmospheric telescope. 10 attributes of the observed images are used as input features. Here we regard all gamma particles as nominal data and background cosmic rays as anomaly. The testing time for 4000 test points of aK-LPE, one-class SVM and our method are 0.42s, 0.01s and 0.01s (with 41/1500 support vectors) respectively. Fig.3(b) demonstrates that our method significantly outperforms one-class SVM and is comparable to aK-LPE but with significantly smaller test time.

data sets anomaly class
Annthyroid 6832 6 classes 1,2(7%)
Forest Cover 286048 10 class 4(0.9%) vs. class 2
HTTP 567497 3 attack(0.4%)
Mamography 11183 6 class 1(2%)
Mulcross 262144 4 2 clusters(10%)
Satellite 6435 36 3 smallest classes(32%)
Shuttle 49097 9 classes 2,3,5,6,7(7%)
SMTP 95156 3 attack (0.03%)
TABLE I: Data characteristics of the data sets used in experiments. is the total number of instances. the dimension of data. The percentage in brackets indicates the percentage of anomalies among total instances.

We then conduct experiments on several other real data sets used in [16] and [17], including 2 network intrusion data sets HTTP and SMTP from [21], Annthyroid, Forest Cover Type, Satellite, Shuttle from UCI repository [20], Mammography and Mulcross from [22]. Table I illustrates the characteristics of these data sets.

Data Sets rankAD oc-svm BP-KNNG aK-LPE iForest massAD
AUC Annthyroid 0.844 0.681 0.823 0.753 0.856 0.789
Forest Cover 0.932 0.869 0.859 0.876 0.853 0.895
HTTP 0.999 0.998 0.995 0.999 0.986 0.995
Mamography 0.909 0.863 0.868 0.879 0.891 0.701
Mulcross 0.998 0.970 0.994 0.998 0.971 0.998
Satellite 0.885 0.774 0.872 0.884 0.812 0.692
Shuttle 0.996 0.975 0.985 0.995 0.992 0.992
SMTP 0.934 0.751 0.892 0.900 0.869 0.859
test time Annthyroid 0.338 0.281 2.171 0.917 1.384 0.030
Forest Cover 1.748 1.638 2.185 13.41 7.239 0.483
HTTP 0.187 0.376 2.391 11.04 5.657 0.384
Mamography 0.237 0.223 0.281 1.443 1.721 0.044
Mulcross 2.732 2.272 3.772 13.75 7.864 0.559
Satellite 0.393 0.355 0.776 1.199 1.435 0.030
Shuttle 1.317 1.318 2.404 7.169 4.301 0.186
SMTP 1.116 1.105 1.912 11.76 5.924 0.411
TABLE II: Anomaly detection AUC performance and test stage time of various methods.

We randomly sample 2000 nominal points for training. The rest of the nominal data and all of the anomalous data are held for testing. Due to memory constraint, at most 80000 nominal points are used at test time. The time for testing all test points and the AUC performance are reported in Table II.

We observe that while being faster than BP-KNNG, aK-LPE and iForest, and comparable to oc-SVM during test stage, our approach also achieves very good performance for all data sets. The density based aK-LPE has somewhat good performance, but its test-time degrades significantly with training set size. The other density based BP-KNNG has less test time compared to aK-LPE since it uses a subset of the training samples, however its performance is not comparable to rankAD. massAD is fast at test stage, but has poor performance for several data sets. Overall, our approach is competitive in both AUC performance and test time compared to other state-of-art algorithms.

Vi Conclusions

In this paper, we propose a novel anomaly detection framework based on RankSVM. We combine statistical density information with a discriminative ranking procedure. Our scheme learns a ranker over all nominal samples based on the -NN distances within the graph constructed from these nominal points. This is achieved through a pair-wise learning-to-rank step, where the inputs are preference pairs . The preference relationship for takes a value one if the nearest neighbor based score for is larger than that for . Asymptotically this preference models the situation that data point is located in a higher density region relative to under nominal distribution. We then show the asymptotic consistency of our approach, which allows for flexible false alarm control during test stage. We also provide a finite-sample generalization bound on the empirical false alarm rate of our approach. Experiments on synthetic and real data sets demonstrate our approach has superior performance as well as low test time complexity.

[Supplementary: Proofs of Theorems]

We fix an RKHS on the input space with an RBF kernel . Let be a set of objects to be ranked in with labels . Here denotes the label of , and . We assume

to be a random variable distributed according to

, and deterministic. Throughout denotes the hinge loss.

The following notation will be useful in the proof of Theorem 4. Define the - of as

where

and is some positive weight function such as . The smallest possible -risk in is denoted . The regularized - is

(14)

. If is the empirical measure with respect to , , we write and for the associated risks:

-a Proof of Theorem 4

Proof   Let us outline the argument. In [23], the author shows that there exists a minimizing (14):

Lemma 1.

For all Borel probability measures on and all , there is an with

such that .

Next, a simple argument shows that

Finally, we will need a concentration inequality to relate the -risk of with the empirical -risk of . We then derive consistency using the following argument:

where is an appropriately chosen sequence , and is large enough. The second and fourth inequality holds due to Concentration Inequalities, and the last one holds since .

We now prove the appropriate concentration inequality [24]. Recall is an RKHS with smooth kernel ; thus the inclusion is compact, where is given the -topology. That is, the “hypothesis space” is compact in , where denotes the ball of radius in . We denote by the covering number of with disks of radius . We prove the following inequality:

Lemma 2.

For any probability distribution

on ,

where .

Proof   Since is compact, it has a finite covering number. Now suppose is any finite covering of . Then

so we restrict attention to a disk in of appropriate radius .

Suppose . We want to show that the difference

is also small. Rewrite this quantity as

Since , for small enough we have

Here is the feature map, . Combining this with the Cauchy-Schwarz inequality, we have

where . From this inequality it follows that

We thus choose to cover with disks of radius , centered at . Here is the covering number for this particular radius. We then have

Therefore,

The probabilities on the RHS can be bounded using McDiarmid’s inequality.

Define the random variable , for fixed . We need to verify that has bounded differences. If we change one of the variables, , in to , then at most summands will change:

Using that , McDiarmid’s inequality thus gives


We are now ready to prove Theorem 4. Take and apply this result to :

Since , the RHS converges to 0 so long as as . This completes the proof of Theorem 4.

-B Proof of Theorem 5

Proof   Our proof follows similar lines of Theorem 4 in [25]. Assume that , and define a function such that , , and for all . We have , where

Using the requirements of the weight function and the assumption that is non-increasing and non-negative, we see that all six sums in the above equation for are negative. Thus , so , contradicting the minimality of . Therefore .

Now we assume that . Since , we have and , where