1 Introduction
Classification is a fundamental machine learning task used in a wide range of applications. When designing classification algorithms, the 01 loss function is preferred, as it helps produce the Bayes optimal classifier, which has the minimum probability of classification error. However, the 01 loss is difficult to optimize because it is neither convex nor smooth (
BenDavid et al. 2003, Feldman et al. 2012). Many different computationallyfriendly surrogate loss functions have therefore been proposed as approximations for the 01 loss function.However, natural questions arise of whether they are good approximations, and then what the differences are between the surrogate loss functions and the 01 loss. To address the first question, the Bayesrisk consistency concept has been introduced (Lugosi and Vayatis 2004, Bartlett et al. 2006). A surrogate loss function is said to be Bayesrisk consistent if its corresponding empirical minimizer converges to the Bayes optimal classifier when the predefined hypothesis class is universal. That means, with a sufficiently large sample, the minimizers of those surrogate risks are identical to the minimizer of the 01 risk in the sense that they achieve the same minimum probability of classification error. Existing results (e.g., Zhang 2004, Bartlett et al. 2006, Agarwal and Agarwal 2015, Neykov et al. 2016) show Bayesrisk consistency under different conditions and, reassuringly, most of the frequently used surrogate loss functions are Bayesrisk consistent.
Although Bayesrisk consistency describes the interchangeable relationship between surrogate loss functions and the 01 loss function, it is an asymptotic concept. The nonasymptotic link between a specific surrogate loss function and the 01 loss function has remained elusive. In this paper, we study the rates of convergence from surrogate risk minimizers to the Bayes optimal classifier. We derive upper bounds for the difference between the probabilities of classification error w.r.t. the surrogate risk minimizer and the Bayes optimal classifier. Specifically, we introduce the notions of consistency intensity and conductivity , which are uniquely determined by the surrogate loss functions. We show that for any given surrogate loss function , if the convergence rate of the excess surrogate risk is of order , where is the empirical surrogate risk minimizer, is the expected surrogate risk, and is the minimal surrogate risk achievable, the corresponding convergence rate of the expected risk is of order , where and are the expected 01 risk and the minimal 01 risk achievable, respectively. The result is able to (1) describe the nonasymptotic differences between different surrogate loss functions and (2) fairly compare the rates of convergence from different empirical surrogate risk minimizers to the Bayes optimal classifier.
We apply our theorems to popular surrogate loss functions such as the hinge loss function in support vector machine, exponential loss function in AdaBoost, and logistic loss function in logistic regression (
Collins et al. 2002, Vapnik 2013). We conclude that SVM converges faster to the Bayes optimal classifier than AdaBoost and logistic regression, while AdaBoost and logistic regression have the same convergence rate. Furthermore, we provide a general rule for fairly comparing the convergence rates of different classification algorithms.We show that for a dataindependent surrogate loss, both the consistency intensity and conductivity are constants, and for different surrogate loss functions, and vary in and , respectively. Since different minimizers converge to the Bayes optimal classifier at rate , they do not contribute to accelerating the convergence rate. However, if we modify the surrogate loss function according to the sample size , can vary w.r.t. . This finding enables us to devise a datadependent loss modification method which can accelerate the convergence of a surrogate risk minimizer to the Bayes optimal classifier.
Organization. The remainder of the paper is organized as follows. In section 2, we first introduce basic mathematical notations. Then we introduce the notions of Bayes optimality and Bayesrisk consistency, along with several lemmas and related works necessary to the proofs of our main theorem. We present our main theorem in section 3, namely the rate of convergence from the expected risk to the Bayes risk for empirical surrogate risk minimization. In section 4, we show several applications of our theorem. In particular, section 4.1 details several specific examples of computing and comparing the rates of convergence to the Bayes risk and provides a general rule to compare the rates of convergence to the Bayes risk for any two algorithms with Bayesrisk consistent loss functions. Then, in section 4.2, we apply our theorems to modifying the hinge loss function to accelerate its convergence to the Bayes risk. In section 5, we conclude the paper and briefly discuss future works.
2 Preliminaries
We present basic notations in Section 2.1 and briefly introduce the concept of Bayesrisk consistency and necessary lemmas in Section 2.2. In Section 2.3, we discuss related works about the statistical properties of Bayesrisk consistent surrogate loss functions, which play critical roles in the proof of our theorems.
2.1 Notation
In this paper, we focus on binary classification, where the feature space is a subset of a Hilbert Space and the label space is denoted by
. We assume that a pair of random variables
is generated according to an unknown distribution , where is the corresponding joint probability. Binary classification aims to find a map within some particular predefined hypothesis class such that the sign of can be used as a prediction for .To evaluate the goodness of , some performance measures are required. Intuitively, the 01 risk is employed, defined as:
(2.1) 
where denotes the indicator function. From the definition, we can see that minimizing the 01 risk is equivalent to minimizing the probability of classification error. We hope to find the function such that the probability of classification error is minimized, which is called the Bayes optimal classifier, defined as follows:
(2.2) 
where the infimum is over all measurable functions. The corresponding expected risk is called the Bayes risk:
(2.3) 
In this paper, we assume that the predefined hypothesis class is universal, which means the Bayes optimal classifier is always in .
Since the joint distribution
is unknown, we cannot calculate directly. Given a training sample , the following empirical risk is widely exploited to approximate the expected risk :(2.4) 
Directly minimizing the above empirical risk is NPhard due to the nonconvexity of the 01 loss function, which forces us to adopt convex surrogate loss functions. Similarly, for any nonnegative surrogate loss function , where is the space of the output of the classifier, we can define the risk, optimal risk, empirical risk, and the empirical surrogate risk minimizer as:
(2.5)  
(2.6)  
(2.7)  
(2.8) 
Moreover, we define the excess risk and the excess risk, respectively, as follows:
(2.9)  
(2.10) 
For classification tasks, the loss function is often marginbased and we can rewrite as , where the quantity is known as the margin, which can be interpreted as the confidence in prediction (Mohri et al. 2012). In this paper, we study marginbased loss functions.
2.2 Optimality and Bayesrisk Consistency
We first introduce the concept of Bayes optimal. Let define
(2.11) 
Lemma 1.
(Bousquet et al. 2004) Assume the random pair follows a given distribution . Then, any classifier , which satisfies , is Bayes optimal under D.
Before we introduce the notion of Bayesrisk consistency for surrogate loss functions, we need to present several basic definitions (Bousquet et al. 2004). First, we define the conditional risk as:
(2.12) 
We then introduce the generic conditional risk by letting and :
(2.13) 
It immediately follows the definition of optimal generic conditional risk :
(2.14) 
We also define as:
(2.15) 
This definition follows the optimal generic conditional risk , but with the constraint that the sign of the output differs from . Under these settings, we define the Bayesrisk consistency.
Definition 1.
(Lugosi and Vayatis 2004) A surrogate loss function is Bayesrisk consistent if the minimizer of the conditional risk, , has the same sign as the Bayes optimal classifier for any . Or simply, .
We here present a necessary and sufficient condition for the Bayesrisk consistency of surrogate loss functions.
Lemma 2.
(Bousquet et al. 2004) A surrogate loss is Bayesrisk consistent if and only if , for any .
Lemma 2 has an intuitive explanation: any Bayesrisk consistent loss requires the constraint that it always leads to strictly larger conditional risks for any when the signs of the output differs from that of Bayes optimal classifier.
2.3 Asymptotic Consistency in Surrogate Risk Optimization
We briefly introduce some related works on Bayesrisk consistency and its statistical properties in classification. Lugosi and Vayatis 2004 proved that the Bayesrisk consistency is satisfied for empirical surrogate loss minimization under the condition that the surrogate loss is strictly convex, differentiable, and monotonic with . Bartlett et al. 2006 then offered a more general result, showing that the Bayesrisk consistency is possible if and only if the loss function is Bayesrisk consistent^{1}^{1}1In their paper, the term “classificationcalibrated" is used instead of “Bayesrisk consistent".. Other results on Bayesrisk consistency under different assumptions have been presented by Zhang 2004, Steinwart 2005, Neykov et al. 2016.
Below, we give the main results proved in Bartlett et al. 2006, which shows that minimizing over any surrogate risk with Bayesrisk consistent loss function is asymptotically equivalent to minimizing the 01 risk and thus leads to the Bayes optimal classifier.
Theorem 1.
(Bartlett et al. 2006) For any convex loss function , it is Bayesrisk consistent if and only if it is differentiable at and . Then for such a convex and Bayesrisk consistent loss function , any measurable function , and any distribution over , the following inequality holds,
(2.16) 
where is nonnegative, convex and invertible on and has only one zero at .
The above theorem gives an upper bound on the excess risk in terms of the excess risk and shows that minimizing over any convex Bayesrisk consistent surrogate loss is asymptotically equivalent to minimizing over 01 loss because the function is invertible and only have a single zero at . We provide deeper insights of this theorem in the next section.
3 The Rates of Convergence from Empirical Surrogate Risk Minimizers to the Bayes Optimal Classifier
We know that optimizing over any empirical surrogate risk with a Bayesrisk consistent loss function will lead to the Bayes optimal classifier when the training sample size is large enough. However, a natural question arises when we optimize over different empirical surrogate risks with Bayesrisk consistent loss functions: “What are the difference between them? Do the minimizers have the same rate of convergence to the Bayes optimal classifier?" Those problems are essential because when we choose classification algorithms for a realworld problem, we may expect a fast convergence to the Bayes optimal classifier which also implies a small sample complexity.
To answer the above questions, we need to find a proper metric to measure the distance between and . Since we are most care about the probability of classification error of the proposed learning algorithm, it’s reasonable to measure the rate of convergence from the expected risk to the Bayes risk instead of measuring the rate of convergence from to directly.
Before showing the results of convergence rates, we present the intuition of our work.
3.1 Intuition of the Proposed Method
For any empirical surrogate risk minimizer , we can rewrite the inequality in Theorem 1 as follows:
(3.1)  
To achieve our goal of measuring the rate of convergence from the empirical surrogate risk minimizer to the Bayes optimal classifier, we need to bound the term . As is invertible, we can bound the term on the right hand side of the inequality. We call the first term in the right hand side the estimation error, which depends on the learning algorithm and the training data. The second term is called the approximation error, depending on the choice of the hypothesis class
. Often, the hypothesis class is predefined and universal, e.g., for deep learning algorithms. Thus, in this paper, we just assume that the Bayes optimal classifier is right in the hypothesis class
. In other words, we have and (3.1) becomes(3.2) 
The right side of the above inequality can be further upper bounded by
(3.3) 
where the defect on the right hand side is called the generalization error. Using the concentration of measure (Boucheron et al. 2013), and the uniform convergence argument, e.g., VCdimension (Vapnik 2013), covering number (Zhang 2002), and Rademacher complexity (Bartlett and Mendelson 2002), the generalization error can be nonasymptotically upper bounded with a high probability. Often, the upper bounds can reach the order (Mohri et al. 2012). We also notice that the excess risk can achieve convergence rates faster than , such as exploiting local Rademacher complexities, low noise models, and strong convexity Tsybakov 2004, Bartlett et al. 2005, Koltchinskii et al. 2006, Sridharan et al. 2009, Liu et al. 2017. Here we mainly consider the ordinary case where the convergence rate of the excess risk is of order , but our results can directly generalize to other cases.
Theorem 1 also implies that if is convex and Bayesrisk consistent, then for any sequence of measurable functions and any distribution over ,
(3.4) 
This presents the dynamics of the Bayes risk consistency for any convex Bayesrisk consistent loss in an asymptotic way.
Observe that in equation (2.16), the asymptotic consistency property is mainly due to the uniqueness of the zero of the function , where the only zero of function is at . Thus, when , we have , which leads to . However, if we want to derive the rate of convergence from the expected risk to the Bayes risk , we may need more detailed or higher order properties of the function in the infinitesimal right neighborhood of , rather than just the value of at .
In the next subsection, we will exploit the upper bound of and a higher order property of the function to derive upper bounds for .
3.2 Consistency Intensity for Bayesrisk Consistent Loss Functions
Knowing the convergence rate of the excess risk and their relation , it’s straightforward to consider taking an inverse of the function on the both sides, yielding an upper bound, . However, in reality, for most of Bayesrisk consistent loss functions, the corresponding is sometimes intractable to take an inverse analytically. Furthermore, the term may not reflect the order of explicitly. Thus, we must figure out the factor that determines the convergence rate of the excess risk —that is some high order property of function within the infinitesimal right neighborhood of . Before we move on to our main theorems, let’s introduce some basic lemmas and propositions first.
Proposition 1.
For any two functions and , which are differentiable at and satisfy , then, the following conditions are equivalent:

;

;

when .
Proposition 1 can be proved directly by following the definition of the and notation. We now introduce the notions of consistency intensity and conductivity for convex Bayesrisk consistent loss functions.
Lemma 1.
For any given convex Bayesrisk consistent loss function , let . There exists two unique constant and such that
(3.5) 
We call the consistency intensity of this Bayesrisk consistent loss function and the conductivity of the intensity.
Proof.
From Theorem 1, we know that is a convex function. It’s also known that any convex function on a convex open subset of is semidifferentiable. Thus, we can denote the right derivative of at by . Using Maclaurin expansion, we have,
(3.6) 
Followed by Proposition 1, if , we have,
(3.7) 
If , then , which means that is the higher order infinitesimal of as . Then, by definition, for any given , we can compute . Because is the higher order infinitesimal of as , there exist unique and such that,
(3.8) 
which completes the proof. ∎
Lemma 2.
Given any convex Bayesrisk consistent loss function , let the inverse of its corresponding transform be denoted by . We have
(3.9) 
Lemma 2 shows that the equivalent infinitesimal of near is . Then, we introduce an important property for the transform.
Lemma 3.
Given any convex Bayesrisk consistent loss function , its corresponding function is interchangeable with for any . That is,
(3.12) 
Proof.
From Proposition 1, we have that there exists such that,
(3.13) 
and
(3.14) 
Substituting (3.13) and (3.14) into (3.12), it’s equivalent to proving that for any , there exists , such that,
(3.15) 
To prove (3.15), by Proposition 1, we only need to prove that, for any , there exists , such that,
(3.16) 
Followed by Lemma 2 and proposition 1, we have
(3.17) 
Substituting (3.17) into (3.16), we have,
(3.18)  
This means that for any , there exists such that (3.16) holds true, which completes the proof. ∎
Following the definitions of and in Lemma 1, we have our first main theorem.
Theorem 1.
Suppose the excess risk satisfies with a high probability. Then, with the same high probability, we have
(3.19) 
Proof.
From Theorem 1, we have that
(3.20) 
where is the minimizer of the empirical surrogate risk with sample size . Under our assumption that the Bayes optimal classifier is within the hypothesis class , then, with a high probability, we have (Bousquet et al. 2004),
(3.21) 
Note that p is often equal to for the worst cases. With (3.21) and Lemma 3, we have,
(3.22)  
Substituting (3.17) into (3.22), we have,
(3.23)  
which completes the proof. ∎
From Theorem 1, we show that the consistency intensity and conductivity have direct influence on the convergence rate, which is of order . When , the convergence rate from to will be slower than the convergence rate from to ; if , the algorithm will never reach the Bayes optimal classifier; if , the convergence rates will be the same. In the next theorem, we will show that for any convex Bayesrisk consistent loss function, the range of is .
Theorem 2.
For any convex Bayesrisk consistent loss function , it always holds true that .
Proof.
We have that , so holds trivially. To prove , it is equivalent to proving that there exists such that,
(3.24) 
because holds true if and only if ; and holds true if and only if . From proposition 1, the equation (3.24) implies,
(3.25) 
Since any convex function on a convex open subset in is semidifferentiable, is at least right differentiable at . We therefore have,
(3.26) 
where denotes the right derivative of at . The proof ends. ∎
Theorem 2 shows that for dataindependent surrogate loss functions, because , the convergence rate from to will not be faster than the convergence rate from to . As , it also means that optimizing over any convex Bayesrisk consistent surrogate loss will finally make the excess risk converge to and thus the output is the Bayes optimal classifier as sample size tends to infinity. This result matches our common sense: while we benefit from the computational efficiency of convex surrogate loss functions, we also suffer from a slower rate of convergence to the Bayes optimal classifier.
4 Applications
In this section, we present several applications of our results. In Section 4.1, we use the notion of consistency intensity to measure the rates of convergence from the empirical surrogate risk minimizers to the Bayes optimal classifier for different classification algorithms, such as support vector machine, boosting, and logistic regression. We also derive a general discriminant rule for comparing the convergence rates for different leaning algorithms. In Section 4.2, we show that the notions of consistency intensity and conductivity can help to modify surrogate loss functions so as to achieve a faster convergence rate from to .
4.1 Consistency Measurement
In this subsection, we first apply our results to some popular classification algorithms.
Example 1 (Hinge loss in SVM). Here we have , which is convex and . Note that we have defined . Let
(4.1) 
It is easy to verify that
(4.2) 
and
(4.3) 
So we have,
(4.4) 
Followed by theorem 1, we have and . Thus for SVM, with a high probability, we have,
(4.5) 
Example 2 (Exponential loss in Adaboost). We have , which is convex and . Then, it’s easy to derive that
(4.6) 
and
(4.7) 
So,
(4.8) 
Using Maclaurin expansion, we have,
(4.9) 
Thus, and . For Adaboost, with a high probability, we have
(4.10) 
Example 3 (Logistic loss in Logistic Regression). We have , which is convex and , we can follow similar procedures as before,
(4.11) 
So,
(4.12) 
Using Maclaurin expansion, we have,
(4.13) 
Therefore and . For Logistic Regression, with a high probability, we have
(4.14) 
From the above examples, we conclude that SVM has a faster convergence rate to the Bayes optimal classifier than Adaboost and Logistic Regression. Note that the proposed methods also apply to many other surrogate loss functions. However, if we only need to compare the convergence rates of any two classification algorithms, we may not need to compute the consistency intensity of the two surrogate loss functions explicitly. The following theorem gives a general discriminant rule.
Theorem 1.
Given two convex Bayesrisk consistent loss functions and , denoting their corresponding transform by and , we assume that their excess risk have the same convergence rate of order . Then, we define the intensity ratio as follows,
(4.15) 
We have the following statements:

if , then the minimizers w.r.t. and converge equally fast to the Bayes optimal classifier;

if , then the minimizer w.r.t. converges faster to the Bayes optimal classifier;

if , then the minimizer w.r.t. converges faster to the Bayes optimal classifier.
Proof.
Following Proposition 1 and Lemma 3, we have,
(4.16) 
Then, we get,
(4.17) 
Thus, we can conclude:

for , then we have and . Therefore, the minimizers w.r.t. and converge equally fast to the Bayes optimal classifier;

for , we have and . Thus the minimizer w.r.t. converges faster to the Bayes optimal classifier;

for , then we have and , which means that the minimizer w.r.t. converges faster to the Bayes optimal classifier.
∎
Using the notion of intensity ratio, we can compare the convergence rate to the Bayes optimal classifier for any two algorithms with Bayesrisk consistent loss functions without computing the consistency intensity.
We finish this section by introducing a scaling invariant property of the consistency intensity , which is useful for comparing the convergence rate, e.g., when we scale the surrogate loss function by , we get the same for and .
Theorem 2.
For any constants , the loss have the same consistency intensity as that of , which means that the intensity of surrogate loss function is scaling invariant in terms of .
Proof.
Notice that . If we scale as , the both sides of the inequality will be multiplied by , which holds trivially.
We now consider . Observe that,
(4.18)  
where . Then,
(4.19) 
which leads to the same . ∎
Theorem 2 has many applications. For example, the exponential loss in Adaboost is . Then for all constants must have the same consistency intensity as that of , which implies the minimizers of the corresponding empirical surrogate risks converge equally fast to the Bayes optimal classifier.
4.2 : An Example of the Datadependent Loss Modification Method
In the previous sections, we have provided theorems that can measure the convergence rate from the expected risk to the Bayes risk for many leaning algorithms using different surrogate loss functions. In fact, the notions of consistency intensity and conductivity can achieve something beyond that. In this subsection, we propose a datadependent loss modification method for SVM, that obtains a faster convergence rate for the bound of the excess risk and thus makes the learning algorithm achieve a faster convergence rate to the Bayes optimal classifier.
We are familiar with the standard SVM that uses the hinge loss as a surrogate. Now, we modify the hinge loss as follows:
(4.20) 
We have , which means this modified hinge loss is also Bayesrisk consistent. Then, following the similar procedure as done for the above examples, we have,
(4.21) 
It’s easy to verify that
(4.22) 
and so
(4.23) 
Thus,
(4.24)  
Using Maclaurin expansion, we have,
(4.25) 
Therefore, we have that intensity and conductivity . According to Theorem 1, we know that, with a high probability, we have,
(4.26) 
From (4.26), we find that a tighter bound can be obtained when converges to zero fast. Here, we introduce the notion of datadependent loss modification. That is, the modification parameter is dependent on the sample size . For example, if and , with a high probability, we can obtain a bound of order . Therefore, a tighter bound is obtained with our proposed method for SVM.
5 Conclusions and Future Work
In this paper, we defined the notions of consistency intensity and conductivity for convex Bayesrisk consistent surrogate loss functions and proposed a general framework that determines the relationship between the convergence rate of the excess risk and the convergence rate of the excess risk. Our methods were used to compare the convergence rates to the Bayes optimal classifier for empirical minimizers of different classification algorithms. Moreover, we used the notions of consistency intensity and conductivity to guide modifying of surrogate loss functions so as to achieve a faster convergence rate.
In this work, we need the surrogate loss function to be convex and Bayesrisk consistent, which holds true for many different surrogate loss functions. However, sometimes we may encounter nonconvex, but still Bayesrisk consistent loss functions. It is interesting to generalize the obtained results to the nonconvex situation in the future. Besides, in the future, we will apply the modified surrogate loss function to some realworld problems. Moreover, finding some other approaches that guide modifying existing surrogate losses to achieve a faster convergence rate to the Bayes optimal classifier is also quite worth exploring.
References
 Agarwal and Agarwal 2015 Agarwal, A. and Agarwal, S. (2015). On consistent surrogate risk minimization and property elicitation. In Gr nwald, P., Hazan, E., and Kale, S., editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 4–22, Paris, France. PMLR.
 Bartlett et al. 2005 Bartlett, P. L., Bousquet, O., Mendelson, S., et al. (2005). Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537.
 Bartlett et al. 2006 Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156.
 Bartlett and Mendelson 2002 Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
 BenDavid et al. 2003 BenDavid, S., Eiron, N., and Long, P. M. (2003). On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496 – 514.
 Boucheron et al. 2013 Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.

Bousquet et al. 2004
Bousquet, O., Boucheron, S., and Lugosi, G. (2004).
Introduction to statistical learning theory.
In Advanced lectures on machine learning, pages 169–207. Springer.  Collins et al. 2002 Collins, M., Schapire, R. E., and Singer, Y. (2002). Logistic regression, adaboost and bregman distances. Machine Learning, 48(13):253–285.
 Feldman et al. 2012 Feldman, V., Guruswami, V., Raghavendra, P., and Wu, Y. (2012). Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558–1590.
 Koltchinskii et al. 2006 Koltchinskii, V. et al. (2006). Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656.
 Liu et al. 2017 Liu, T., Lugosi, G., Neu, G., and Tao, D. (2017). Algorithmic stability and hypothesis complexity. arXiv preprint arXiv:1702.08712.
 Lugosi and Vayatis 2004 Lugosi, G. and Vayatis, N. (2004). On the bayesrisk consistency of regularized boosting methods. Annals of Statistics, pages 30–55.
 Mohri et al. 2012 Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of machine learning. MIT press.
 Neykov et al. 2016 Neykov, M., Liu, J. S., and Cai, T. (2016). On the characterization of a class of fisherconsistent loss functions and its application to boosting. Journal of Machine Learning Research, 17(70):1–32.
 Sridharan et al. 2009 Sridharan, K., Shalevshwartz, S., and Srebro, N. (2009). Fast rates for regularized objectives. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 1545–1552. Curran Associates, Inc.
 Steinwart 2005 Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51(1):128–142.
 Tsybakov 2004 Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Annals of Statistics, pages 135–166.
 Vapnik 2013 Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
 Zhang 2002 Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527–550.
 Zhang 2004 Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, pages 56–85.
Comments
There are no comments yet.