Learning rates for classification with Gaussian kernels

02/28/2017 ∙ by Shao-Bo Lin, et al. ∙ 0

This paper aims at refined error analysis for binary classification using support vector machine (SVM) with Gaussian kernel and convex loss. Our first result shows that for some loss functions such as the truncated quadratic loss and quadratic loss, SVM with Gaussian kernel can reach the almost optimal learning rate, provided the regression function is smooth. Our second result shows that, for a large number of loss functions, under some Tsybakov noise assumption, if the regression function is infinitely smooth, then SVM with Gaussian kernel can achieve the learning rate of order m^-1, where m is the number of samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Support vector machine (SVM) is by definition the Tikhonov regularization associated with some loss function over a reproducing kernel Hilbert space (RKHS). Due to its clear statistical properties (Zhang, 2004; Blanchard et al., 2008) and fast learning rates (Steinwart and Scovel, 2007; Tong, 2016), SVM has triggered enormous research activities in the past twenty years. Theoretical assessments for the feasibility of SVM have been widely studied, to just name a few, (Chen et al., 2004; Wu and Zhou, 2005; Wang, 2005; Zhou and Jetter, 2006; Cucker and Zhou, 2007; Wu et al., 2007; Tong et al., 2008; Steinwart and Christmann, 2008).

As shown in (Steinwart, 2002), selecting a suitable kernel facilitates the use of SVM, both in theoretical analysis and practical applications. Gaussian kernel is one of the most important kernels in practice, where the width of the Gaussian kernel reflects the frequency information for a specified learning problem (Keerthi and Lin, 2003). Structures as well as explicit representations of the inner products and norms of Gaussian RKHS have been studied in (Steinwart et al., 2006; Minh, 2010). Furthermore, tight bounds of various covering numbers for Gaussian RKHS were provided in (Zhou, 2002, 2003; Steinwart and Scovel, 2007; Kühn, 2011). Based on these developed bounds, fast learning rates of SVM with Gaussian kernel were derived in (Ying and Zhou, 2007; Steinwart and Scovel, 2007; Ye and Zhou, 2008; Xiang and Zhou, 2009; Xiang, 2011, 2012; Hu, 2011; Eberts and Steinwart, 2013; Lin et al., 2014, 2015). As a typical example, (Steinwart and Scovel, 2007) proved that there exist non-trivial distributions such that the learning rate of SVM classification with Gaussian kernel and hinge loss can reach an order of , where is the number of samples. Similar results were established for SVM with quadratic loss in (Xiang and Zhou, 2009).

This paper aims at refined analysis for SVM classification with convex loss and Gaussian kernel. Our first purpose is to derive almost optimal learning rates for SVM classification. Our result shows that if the regression function (see Section 2 for definition) is -smooth, then SVM with Gaussian kernel and certain loss functions, such as the quadratic loss and truncated quadratic loss, can reach a learning rate of order with arbitrarily small positive number . The learning rate was proved to be optimal in the minimax sense in (Yang, 1999) for classification with these loss functions when the regression function is smooth.

Since the rate is always slower than , our second purpose is to deduce fast learning rates of SVM with Gaussian kernel under additional assumptions on the regression function. In particular, we find that if the regression function is infinitely differentiable and the Tsybakov noise exponent (Tsybakov, 2004) tends to infinity, then SVM with Gaussian kernel and loss functions including the hinge loss, quadratic loss, and truncated quadratic loss can achieve an order of . This implies that there exist non-trivial distributions such that learning rates of SVM with Gaussian kernel can reach the order of , which extends the results in (Steinwart and Scovel, 2007; Xiang and Zhou, 2009) for the hinge loss and quadratic loss to a general case.

The rest of paper is organized as follows. Section 2 presents some definitions and introduces the algorithm studied in this paper. Section 3 provides the main results. Section 4 compares our results with some related work and gives some further discussions. Section 5 establishes two oracle inequalities for SVM with convex loss. Section 6 gives the proofs of the main results.

2 Classification with Gaussian Kernel and Convex Loss

In learning theory (Cucker and Zhou, 2007; Steinwart and Christmann, 2008), the sample with and are drawn independently according to an unknown distribution on

. Binary classification algorithms produce a classifier

, whose generalization ability is measured by the misclassification error

where is the marginal distribution of and

is the conditional probability at

. The Bayes rule

minimizes the misclassification error, where is the Bayes decision function. Since is independent of the classifier , the performance of can be measured by the excess misclassification error .

Given a loss function , denote by the generalization error with respect to and by

(1)

the regression function minimizing . If is differentiable, it is easy to check that

(2)

We are concerned with the hinge loss and the following twice smooth classifying loss.

Definition 1

We say that is a classifying loss (function), if it is convex, differentiable at with , and the smallest zero of is . We say that is a twice smooth classifying loss, if in addition, it is differentialable, and its derivative is continuous and satisfies

(3)

and its modulus of convexity satisfies

(4)

The classifying loss was defined in (Xiang and Zhou, 2009) and the modulus of convexity together with condition (4) was given in (Bartlett et al., 2006). It is easy to check that the quadratic loss and truncated quadratic loss (or 2-norm hinge loss) are twice smooth classifying loss. It should be mentioned that the twice smooth classifying loss is different from the loss of quadratic type defined in (Koltchinaskii and Yuan, 2010), since the classifying loss requiring a zero point of , deports the well known logistic loss, a typical loss of quadratic type, while the twice differentiable property of the loss of quadratic type deports the truncated quadratic loss. As concrete examples of our analysis, we are specifically interested in loss functions presented in Table 1. All of them are frequently used in practical applications (Bartlett et al., 2006). The regression functions of other twice smooth classifying loss can be deduced from (2). Since the subgradient of at is not unique, the regression function for is not unique too. In Table 1, for not close to , we set but allow when , when and when .

Loss function   Mathematical representation Regression function
Quadratic   
Truncated Quadratic   
Hinge   
Table 1: Loss functions and regression functions

Let

be the Gaussian kernel, where is the width of and denotes the Euclidean norm. Denote by the RKHS associated with endowed with the inner product and norm . We consider learning rates of the following algorithm

(5)

where is a regularization parameter.

3 Main Results

Our error analysis is built upon a smoothness assumption on the regression function, which requires the following definition.

Definition 2

Let for some and . A function is said to be -smooth if for every , the partial derivatives exist and satisfy

Denote by the set of all -smooth functions.

To derive the learning rate, we need the following assumption.

Assumption 1

for some and .

Assumption 1 describes the smoothness and boundedness of the regression function. If is the quadratic or truncated quadratic loss, then the smoothness of the regression function is equivalent to the smoothness of the Bayes decision function . If is the hinge loss, is not unique. Assumption 1 means that there is an and implies that and have a strictly positive distance, which is a bit strict. Hence, for SVM with hinge loss, a preferable assumption is a geometric noise assumption introduced in (Steinwart and Scovel, 2007, Definition 2.3) (see also (Steinwart and Christmann, 2008, Definition 8.15)). Presenting learning results for SVM with hinge loss under Assumption 1 in this paper is for the sake of completeness.

Based on Assumption 1, we present our first main result.

Theorem 1

Let , be either the hinge loss or a twice smooth classifying loss. If Assumption 1 holds, and , then for arbitrary , with confidence at least , there holds

(6)

where is a positive constant independent of or .

With the help of the above confidence-based error estimate, we can derive the following learning rate in expectation.

Corollary 1

Let be either the hinge loss or a twice smooth classifying loss. If Assumption 1 holds, and , then for arbitrary , there holds

(7)

where is specified in Theorem 1.

Corollary 1 gives an upper bound for algorithm (5) with the hinge loss and twice smooth classifying loss under Assumption 1. However, it is difficult to judge whether the bound is tight for all these loss functions. We obtain in the following corollary that at least for certain specified loss functions, the error estimate in (7) is almost optimal.

Corollary 2

Let be either the quadratic loss or truncated quadratic loss. If , , then for arbitrary , there holds

(8)

where is a constant independent of , and was specified in Theorem 1.

It should be mentioned that depends on and the supremum on is equivalent to maximizing on some set of functions. Although the learning rate derived in (8) is almost optimal, it is always slower than . We then aim at deriving fast learning rates for algorithm (5) by imposing additional conditions on the distribution . For this purpose, we need the following Tsybakov noise condition (Tsybakov, 2004).

Definition 3

Let . We say that satisfies the Tsybakov noise condition with exponent if there exists a constant such that

(9)

To derive the fast learning rate, we need the following assumption.

Assumption 2

satisfies the Tsybakov noise condition with exponent .

It can be found in (Tsybakov, 2004) that Assumption 2 measures the size of the set of points that are corrupted with high noise in the labeling process, and always holds for with . It has been adopted in (Steinwart and Scovel, 2007; Xiang and Zhou, 2009; Xiang, 2011; Tong, 2016) to deduce fast learning rates for SVM with various loss functions. Noting that Assumption 1 reflects the smoothness of while Assumption 2 measures the level of critical noise, these two assumptions are compatible in some sense. A simple example is that when

is the uniform distribution on

, is the quadratic loss and , and satisfy Assumption 1 with and some . Furthermore, plugging into (9), Assumption 2 holds with and . The following two theorems show the improved learning rates under Assumptions 1 and 2.

Theorem 2

Let and be a twice smooth classifying loss. Under Assumptions 1 and 2, if and , then for arbitrary , with confidence at least , there holds

(10)

where is a constant independent of or .

It can be found in Theorem 2 and Corollary 2 that the upper bound in (10) is essentially smaller than the lower bound in (8). This is mainly due to the use of Assumption 2 in Theorem 2.

Theorem 3

Let and be the hinge loss. Under Assumptions 1 and 2, if and then for arbitrary , with confidence at least , there holds

(11)

where is a constant independent of or .

When , Theorems 2 and 3 coincide with Theorem 1. If , which implies that the approximation error approaches to , then the learning rates derived in Theorems 2 and 3 are . These rates coincide with the optimal learning rates for certain classifiers based on empirical risk minimization in (Tsybakov, 2004) up to an arbitrarily small positive number , and are the same as those presented in (Steinwart and Christmann, 2008, Chapter 8) for the hinge loss. Based on Theorems 2 and 3, we can deduce the following corollary, showing that classification with Gaussian kernel for a large number of loss functions can reach the rate for nontrivial distributions.

Corollary 3

Let , be either the hinge loss or a twice smooth classifying loss. If Assumptions 1 and 2 hold with , and , then for arbitrary , with confidence at least , there holds

(12)

where is constant independent of or .

4 Related Work and Discussion

SVM with Gaussian kernel and convex loss is a state-of-the-art learning strategy for tackling regression and classification problems. For the regression purpose, almost optimal learning rates of SVM with Gaussian kernel and quadratic loss were derived in (Eberts and Steinwart, 2013). From regression to classification, comparison inequalities play crucial roles in analysis. Given a classifier and some convex loss function , the comparison inequality in (Chen et al., 2004) showed that the excess misclassification error can be bounded by means of the generalization error :

(13)

Furthermore, for , (Zhang, 2004) showed that

(14)

From (13), results in (Eberts and Steinwart, 2013) can be used to derive learning rates for classification with Gaussian kernel and quadratic loss.

For other loss functions, learning rates of classification with Gaussian kernel were deduced in (Steinwart and Scovel, 2007; Xiang and Zhou, 2009; Xiang, 2011, 2012). In particular, (Steinwart and Scovel, 2007) proved that there exist non-trivial distributions (geometric noise assumptions for the distribution and Tsybakov noise conditions) such that learning rates of SVM with Gaussian kernel and hinge loss can reach an order of . Using the refined technique in approximation theory, (Xiang and Zhou, 2009) also constructed some distributions (smoothness assumptions for the regression function and Tsybakov noise conditions) such that learning rates of SVM with Gaussian kernel and quadratic loss can reach an order of . Moreover, (Xiang and Zhou, 2009) deduced learning rates for SVM with Gaussian kernel and classifying loss, including the norm hinge loss with and exponential hinge loss under some smoothness assumption similar to Assumption 1. When the loss function is twice differentiable, (Xiang, 2011) improved (Xiang and Zhou, 2009)’s results in terms of deriving fast learning rates of SVM under additional Tsybakov noise conditions. The main tool is the comparison inequality under Assumption 2 (Bartlett et al., 2006; Xiang, 2011) (see also (Steinwart and Christmann, 2008, Theorem 8.29)), saying that for arbitrary measurable function , there holds

(15)

where is a constant depending only on . Since the definition of the classifying loss in (Xiang and Zhou, 2009) deports the logistic loss and exponential loss, (Xiang, 2012) derived learning rates for SVM with some loss functions without the smallest zero restriction in the classifying loss. Under this circumstance, learning rates for SVM classification with Gaussian kernel and logistic loss were derived in (Xiang, 2012).

Under Assumption 1, we derive almost optimal learning rates for SVM with quadratic loss and truncated quadratic loss. The derived learning rate in (6) is better than the rates in (Xiang and Zhou, 2009, Theorem 1) with , while is the same as that rate derived in (Eberts and Steinwart, 2013) for the quadratic loss. Moreover, for the hinge loss, our result in (6) is better than that in (Xiang and Zhou, 2009, Theorem 4). Furthermore, Corollary 3 shows that for some non-trivial distributions (smoothness assumptions for the regression function and Tsybakov noise conditions), SVM with Gaussian kernel and hinge loss or twice smooth classifying loss can reach the learning rate of order with an arbitrarily small positive number . Our results extend the results in (Steinwart and Scovel, 2007) (for hinge loss) and (Xiang and Zhou, 2009) (for quadratic loss) to a general case. For another widely used kernel, the polynomial kernel with , learning rates for SVM with convex loss functions were deduced in (Zhou and Jetter, 2006; Tong et al., 2008). The detailed comparisons between our paper and (Xiang and Zhou, 2009) (XZ2009), (Eberts and Steinwart, 2013) (ES2013), (Tong et al., 2008) (T2008) are summarized in Table 2 and Table 3.

  
XZ2009   
ES2011    No No
T2008   
This paper   
Table 2: Learning rates under Assumption 1
  
XZ2009    No No
ES2013    No No
T2008   
This paper   
Table 3: Learning rates under Assumptions 1 and 2

Besides the smoothness assumption on the regression function, (Steinwart and Scovel, 2007) proposed a geometric noise assumption with exponent (Steinwart and Scovel, 2007, Definition 2.3) to describe the learning rates for SVM. Based on that assumption and Assumption 2 in this paper, a learning rate of order was derived for SVM with Gaussian kernel and hinge loss. Under the same conditions as (Steinwart and Scovel, 2007), (Tong, 2016) derived a learning rate of order for SVM with polynomial kernels and hinge loss. As mentioned in the previous section, Assumption 1 for the hinge loss implies a strictly positive distance between and for arbitrary , which implies the geometric noise assumption with . Thus a learning rate of order can be derived for arbitrary . Under this circumstance, the smoothness index fails to describe the a-priori knowledge for the classification problems and we recommend to use the geometric noise assumption in (Steinwart and Scovel, 2007, Definition 2.3) or (Steinwart and Christmann, 2008, Definition 8.15) to quantify the a-priori information. The reason of introducing Assumption 1 to analyze the learning rate for SVM with hinge loss is for the sake of completeness and uniformity for analysis.

In this paper, we study the learning performance of SVM with Gaussian kernel and convex loss. The main tools are two oracle inequalities developed in the next section. Such two oracle inequalities are different from the standard result in (Steinwart and Christmann, 2008, Theorem 7.23) that is based on a very genral oracle inequality established in (Steinwart and Christmann, 2008, Theorem 7.20). To be detailed, (Steinwart and Christmann, 2008, Theorem 7.23) requires a polynomial decaying assumption on the (weaker) covering number of the RKHS but does not need the compactness of the input space or the continuity of the kernel, while our analysis needs Assumption 3 in Section 5, compactness of and continuity of . It should be mentioned that Assumption 3 contains the logarithmic decaying for the covering number, which requires some non-trivial additional work. We believe that by using the established oracle inequalities and approximation results in (Zhou and Jetter, 2006; Tong et al., 2008), similar error analysis for the polynomial kernel can be derived. As far as the Gaussian kernel is concerned, our results might be derived from the approximation error analysis in this paper and (Steinwart and Christmann, 2008, Theorem 7.23) with slight changes, by using the twice smoothness property (4) of the loss functions to verify conditions of (Steinwart and Christmann, 2008, Theorem 7.23). It would be interesting to derive learning rates for classification with online learning and Gaussian kernel (Hu, 2011) and classification with Gaussian kernel and convex loss when is a lower dimensional manifold (Ye and Zhou, 2008) by utilizing the approaches in this paper.

5 Oracle Inequalities for SVM with Convex Loss

In this section, we present two oracle inequalities for SVM with convex loss and Mercer kernels. Denote by the space of square integrable functions endowed with norm . Let be the RKHS associated with a Mercer kernel endowed with norm . Define

(16)

where is a regularization parameter. Our oracle inequalities are built upon the following Assumption 3.

Assumption 3
(17)

where , is a decreasing and continuous function, is the ball in with some and denotes the covering number of (Xiang and Zhou, 2009).

Assumption 3 depicts the capacity of RKHS. It holds for RKHS with Gaussian kernel (Steinwart and Scovel, 2007) with for arbitrary , and for RKHS with polynomial kernel (Zhou and Jetter, 2006) with for some positive constant independent of . Under Assumption 3, we need the following two lemmas derived in (Shi et al., 2011; Shi, 2013) and (Wu and Zhou, 2005) to present the oracle inequalities.

Lemma 1

Let

be a random variable on a probability space

with variance

satisfying for some constant . Then for any , with confidence , there holds

Lemma 2

Let be a set of functions on . For every , if almost everywhere and for some , and . Then for any ,

5.1 Oracle inequality for SVM with twice smooth classifying loss

We present the first oracle inequality, which describes the learning performance of SVM with a twice smooth classifying loss under Assumption 3.

Theorem 4

Let be a twice smooth classifying loss. Under Assumption 3, if there exist constants and such that

(18)

then for arbitrary bounded , there holds

(19)

where and is a constant independent of , or whose value is specified in the proof and

To prove Theorem 4, we at first prove three propositions.

Proposition 1

Let be defined by (16). Then for arbitrary , there holds

(20)

where

(21)
(22)
(23)

and .

Proof. Direct computation yields

Since is a classifying loss, there holds . Then, it follows from (16) that

Therefore,

This finishes the proof of Proposition 1.

Proposition 2

For any , if is a twice smooth classifying loss, then with confidence , there holds

where .

Proof. Let . Since is continuous, we have . Hence, Moreover, , and the continuous differentiability of show that

which implies

(24)

Using Lemma 1 to the random variable , we obtain that

holds with confidence . This finishes the proof of Proposition 2.

Proposition 3

Let . Under Assumption 3, if is a twice smooth classifying loss and (18) holds for some and , then with confidence , there holds

where is a constant depending on and .

Proof. Let . Set

For arbitrary there exists an such that . Therefore,

Since is a classifying loss, we have and Furthermore, due to (4) and the continuously differentiable property of , it follows from Page 150 (or Lemma 7) of (Bartlett et al., 2006) that

Applying Lemma 2 with , and to , we obtain

Since , it follows from the convexity and continuous differentiable property of that for arbitrary , there exist and such that

Thus, for any , an -covering of provides an -covering of . Therefore

Due to (16), we have