Classification is a fundamental machine learning task used in a wide range of applications. When designing classification algorithms, the 0-1 loss function is preferred, as it helps produce the Bayes optimal classifier, which has the minimum probability of classification error. However, the 0-1 loss is difficult to optimize because it is neither convex nor smooth (Ben-David et al. 2003, Feldman et al. 2012). Many different computationally-friendly surrogate loss functions have therefore been proposed as approximations for the 0-1 loss function.
However, natural questions arise of whether they are good approximations, and then what the differences are between the surrogate loss functions and the 0-1 loss. To address the first question, the Bayes-risk consistency concept has been introduced (Lugosi and Vayatis 2004, Bartlett et al. 2006). A surrogate loss function is said to be Bayes-risk consistent if its corresponding empirical minimizer converges to the Bayes optimal classifier when the predefined hypothesis class is universal. That means, with a sufficiently large sample, the minimizers of those surrogate risks are identical to the minimizer of the 0-1 risk in the sense that they achieve the same minimum probability of classification error. Existing results (e.g., Zhang 2004, Bartlett et al. 2006, Agarwal and Agarwal 2015, Neykov et al. 2016) show Bayes-risk consistency under different conditions and, reassuringly, most of the frequently used surrogate loss functions are Bayes-risk consistent.
Although Bayes-risk consistency describes the interchangeable relationship between surrogate loss functions and the 0-1 loss function, it is an asymptotic concept. The non-asymptotic link between a specific surrogate loss function and the 0-1 loss function has remained elusive. In this paper, we study the rates of convergence from surrogate risk minimizers to the Bayes optimal classifier. We derive upper bounds for the difference between the probabilities of classification error w.r.t. the surrogate risk minimizer and the Bayes optimal classifier. Specifically, we introduce the notions of consistency intensity and conductivity , which are uniquely determined by the surrogate loss functions. We show that for any given surrogate loss function , if the convergence rate of the excess surrogate risk is of order , where is the empirical surrogate risk minimizer, is the expected surrogate risk, and is the minimal surrogate risk achievable, the corresponding convergence rate of the expected risk is of order , where and are the expected 0-1 risk and the minimal 0-1 risk achievable, respectively. The result is able to (1) describe the non-asymptotic differences between different surrogate loss functions and (2) fairly compare the rates of convergence from different empirical surrogate risk minimizers to the Bayes optimal classifier.
We apply our theorems to popular surrogate loss functions such as the hinge loss function in support vector machine, exponential loss function in AdaBoost, and logistic loss function in logistic regression (Collins et al. 2002, Vapnik 2013). We conclude that SVM converges faster to the Bayes optimal classifier than AdaBoost and logistic regression, while AdaBoost and logistic regression have the same convergence rate. Furthermore, we provide a general rule for fairly comparing the convergence rates of different classification algorithms.
We show that for a data-independent surrogate loss, both the consistency intensity and conductivity are constants, and for different surrogate loss functions, and vary in and , respectively. Since different minimizers converge to the Bayes optimal classifier at rate , they do not contribute to accelerating the convergence rate. However, if we modify the surrogate loss function according to the sample size , can vary w.r.t. . This finding enables us to devise a data-dependent loss modification method which can accelerate the convergence of a surrogate risk minimizer to the Bayes optimal classifier.
Organization. The remainder of the paper is organized as follows. In section 2, we first introduce basic mathematical notations. Then we introduce the notions of Bayes optimality and Bayes-risk consistency, along with several lemmas and related works necessary to the proofs of our main theorem. We present our main theorem in section 3, namely the rate of convergence from the expected risk to the Bayes risk for empirical surrogate risk minimization. In section 4, we show several applications of our theorem. In particular, section 4.1 details several specific examples of computing and comparing the rates of convergence to the Bayes risk and provides a general rule to compare the rates of convergence to the Bayes risk for any two algorithms with Bayes-risk consistent loss functions. Then, in section 4.2, we apply our theorems to modifying the hinge loss function to accelerate its convergence to the Bayes risk. In section 5, we conclude the paper and briefly discuss future works.
We present basic notations in Section 2.1 and briefly introduce the concept of Bayes-risk consistency and necessary lemmas in Section 2.2. In Section 2.3, we discuss related works about the statistical properties of Bayes-risk consistent surrogate loss functions, which play critical roles in the proof of our theorems.
In this paper, we focus on binary classification, where the feature space is a subset of a Hilbert Space and the label space is denoted by
. We assume that a pair of random variablesis generated according to an unknown distribution , where is the corresponding joint probability. Binary classification aims to find a map within some particular predefined hypothesis class such that the sign of can be used as a prediction for .
To evaluate the goodness of , some performance measures are required. Intuitively, the 0-1 risk is employed, defined as:
where denotes the indicator function. From the definition, we can see that minimizing the 0-1 risk is equivalent to minimizing the probability of classification error. We hope to find the function such that the probability of classification error is minimized, which is called the Bayes optimal classifier, defined as follows:
where the infimum is over all measurable functions. The corresponding expected risk is called the Bayes risk:
In this paper, we assume that the predefined hypothesis class is universal, which means the Bayes optimal classifier is always in .
Since the joint distributionis unknown, we cannot calculate directly. Given a training sample , the following empirical risk is widely exploited to approximate the expected risk :
Directly minimizing the above empirical risk is NP-hard due to the non-convexity of the 0-1 loss function, which forces us to adopt convex surrogate loss functions. Similarly, for any non-negative surrogate loss function , where is the space of the output of the classifier, we can define the -risk, optimal -risk, empirical -risk, and the empirical surrogate risk minimizer as:
Moreover, we define the excess risk and the excess -risk, respectively, as follows:
For classification tasks, the loss function is often margin-based and we can rewrite as , where the quantity is known as the margin, which can be interpreted as the confidence in prediction (Mohri et al. 2012). In this paper, we study margin-based loss functions.
2.2 Optimality and Bayes-risk Consistency
We first introduce the concept of Bayes optimal. Let define
(Bousquet et al. 2004) Assume the random pair follows a given distribution . Then, any classifier , which satisfies , is Bayes optimal under D.
Before we introduce the notion of Bayes-risk consistency for surrogate loss functions, we need to present several basic definitions (Bousquet et al. 2004). First, we define the conditional -risk as:
We then introduce the generic conditional -risk by letting and :
It immediately follows the definition of optimal generic conditional -risk :
We also define as:
This definition follows the optimal generic conditional -risk , but with the constraint that the sign of the output differs from . Under these settings, we define the Bayes-risk consistency.
(Lugosi and Vayatis 2004) A surrogate loss function is Bayes-risk consistent if the minimizer of the conditional -risk, , has the same sign as the Bayes optimal classifier for any . Or simply, .
We here present a necessary and sufficient condition for the Bayes-risk consistency of surrogate loss functions.
(Bousquet et al. 2004) A surrogate loss is Bayes-risk consistent if and only if , for any .
Lemma 2 has an intuitive explanation: any Bayes-risk consistent loss requires the constraint that it always leads to strictly larger conditional -risks for any when the signs of the output differs from that of Bayes optimal classifier.
2.3 Asymptotic Consistency in Surrogate Risk Optimization
We briefly introduce some related works on Bayes-risk consistency and its statistical properties in classification. Lugosi and Vayatis 2004 proved that the Bayes-risk consistency is satisfied for empirical surrogate loss minimization under the condition that the surrogate loss is strictly convex, differentiable, and monotonic with . Bartlett et al. 2006 then offered a more general result, showing that the Bayes-risk consistency is possible if and only if the loss function is Bayes-risk consistent111In their paper, the term “classification-calibrated" is used instead of “Bayes-risk consistent".. Other results on Bayes-risk consistency under different assumptions have been presented by Zhang 2004, Steinwart 2005, Neykov et al. 2016.
Below, we give the main results proved in Bartlett et al. 2006, which shows that minimizing over any surrogate risk with Bayes-risk consistent loss function is asymptotically equivalent to minimizing the 0-1 risk and thus leads to the Bayes optimal classifier.
(Bartlett et al. 2006) For any convex loss function , it is Bayes-risk consistent if and only if it is differentiable at and . Then for such a convex and Bayes-risk consistent loss function , any measurable function , and any distribution over , the following inequality holds,
where is nonnegative, convex and invertible on and has only one zero at .
The above theorem gives an upper bound on the excess risk in terms of the excess -risk and shows that minimizing over any convex Bayes-risk consistent surrogate loss is asymptotically equivalent to minimizing over 0-1 loss because the function is invertible and only have a single zero at . We provide deeper insights of this theorem in the next section.
3 The Rates of Convergence from Empirical Surrogate Risk Minimizers to the Bayes Optimal Classifier
We know that optimizing over any empirical surrogate risk with a Bayes-risk consistent loss function will lead to the Bayes optimal classifier when the training sample size is large enough. However, a natural question arises when we optimize over different empirical surrogate risks with Bayes-risk consistent loss functions: “What are the difference between them? Do the minimizers have the same rate of convergence to the Bayes optimal classifier?" Those problems are essential because when we choose classification algorithms for a real-world problem, we may expect a fast convergence to the Bayes optimal classifier which also implies a small sample complexity.
To answer the above questions, we need to find a proper metric to measure the distance between and . Since we are most care about the probability of classification error of the proposed learning algorithm, it’s reasonable to measure the rate of convergence from the expected risk to the Bayes risk instead of measuring the rate of convergence from to directly.
Before showing the results of convergence rates, we present the intuition of our work.
3.1 Intuition of the Proposed Method
For any empirical surrogate risk minimizer , we can rewrite the inequality in Theorem 1 as follows:
To achieve our goal of measuring the rate of convergence from the empirical surrogate risk minimizer to the Bayes optimal classifier, we need to bound the term . As is invertible, we can bound the term on the right hand side of the inequality. We call the first term in the right hand side the estimation error, which depends on the learning algorithm and the training data. The second term is called the approximation error, depending on the choice of the hypothesis class
. Often, the hypothesis class is predefined and universal, e.g., for deep learning algorithms. Thus, in this paper, we just assume that the Bayes optimal classifier is right in the hypothesis class. In other words, we have and (3.1) becomes
The right side of the above inequality can be further upper bounded by
where the defect on the right hand side is called the generalization error. Using the concentration of measure (Boucheron et al. 2013), and the uniform convergence argument, e.g., VC-dimension (Vapnik 2013), covering number (Zhang 2002), and Rademacher complexity (Bartlett and Mendelson 2002), the generalization error can be non-asymptotically upper bounded with a high probability. Often, the upper bounds can reach the order (Mohri et al. 2012). We also notice that the excess -risk can achieve convergence rates faster than , such as exploiting local Rademacher complexities, low noise models, and strong convexity Tsybakov 2004, Bartlett et al. 2005, Koltchinskii et al. 2006, Sridharan et al. 2009, Liu et al. 2017. Here we mainly consider the ordinary case where the convergence rate of the excess -risk is of order , but our results can directly generalize to other cases.
Theorem 1 also implies that if is convex and Bayes-risk consistent, then for any sequence of measurable functions and any distribution over ,
This presents the dynamics of the Bayes risk consistency for any convex Bayes-risk consistent loss in an asymptotic way.
Observe that in equation (2.16), the asymptotic consistency property is mainly due to the uniqueness of the zero of the function , where the only zero of function is at . Thus, when , we have , which leads to . However, if we want to derive the rate of convergence from the expected risk to the Bayes risk , we may need more detailed or higher order properties of the function in the infinitesimal right neighborhood of , rather than just the value of at .
In the next subsection, we will exploit the upper bound of and a higher order property of the function to derive upper bounds for .
3.2 Consistency Intensity for Bayes-risk Consistent Loss Functions
Knowing the convergence rate of the excess -risk and their relation , it’s straightforward to consider taking an inverse of the function on the both sides, yielding an upper bound, . However, in reality, for most of Bayes-risk consistent loss functions, the corresponding is sometimes intractable to take an inverse analytically. Furthermore, the term may not reflect the order of explicitly. Thus, we must figure out the factor that determines the convergence rate of the excess risk —that is some high order property of function within the infinitesimal right neighborhood of . Before we move on to our main theorems, let’s introduce some basic lemmas and propositions first.
For any two functions and , which are differentiable at and satisfy , then, the following conditions are equivalent:
Proposition 1 can be proved directly by following the definition of the and notation. We now introduce the notions of consistency intensity and conductivity for convex Bayes-risk consistent loss functions.
For any given convex Bayes-risk consistent loss function , let . There exists two unique constant and such that
We call the consistency intensity of this Bayes-risk consistent loss function and the conductivity of the intensity.
From Theorem 1, we know that is a convex function. It’s also known that any convex function on a convex open subset of is semi-differentiable. Thus, we can denote the right derivative of at by . Using Maclaurin expansion, we have,
Followed by Proposition 1, if , we have,
If , then , which means that is the higher order infinitesimal of as . Then, by definition, for any given , we can compute . Because is the higher order infinitesimal of as , there exist unique and such that,
which completes the proof. ∎
Given any convex Bayes-risk consistent loss function , let the inverse of its corresponding -transform be denoted by . We have
Lemma 2 shows that the equivalent infinitesimal of near is . Then, we introduce an important property for the -transform.
Given any convex Bayes-risk consistent loss function , its corresponding function is interchangeable with for any . That is,
From Proposition 1, we have that there exists such that,
This means that for any , there exists such that (3.16) holds true, which completes the proof. ∎
Following the definitions of and in Lemma 1, we have our first main theorem.
Suppose the excess -risk satisfies with a high probability. Then, with the same high probability, we have
From Theorem 1, we have that
where is the minimizer of the empirical surrogate risk with sample size . Under our assumption that the Bayes optimal classifier is within the hypothesis class , then, with a high probability, we have (Bousquet et al. 2004),
which completes the proof. ∎
From Theorem 1, we show that the consistency intensity and conductivity have direct influence on the convergence rate, which is of order . When , the convergence rate from to will be slower than the convergence rate from to ; if , the algorithm will never reach the Bayes optimal classifier; if , the convergence rates will be the same. In the next theorem, we will show that for any convex Bayes-risk consistent loss function, the range of is .
For any convex Bayes-risk consistent loss function , it always holds true that .
We have that , so holds trivially. To prove , it is equivalent to proving that there exists such that,
Since any convex function on a convex open subset in is semi-differentiable, is at least right differentiable at . We therefore have,
where denotes the right derivative of at . The proof ends. ∎
Theorem 2 shows that for data-independent surrogate loss functions, because , the convergence rate from to will not be faster than the convergence rate from to . As , it also means that optimizing over any convex Bayes-risk consistent surrogate loss will finally make the excess risk converge to and thus the output is the Bayes optimal classifier as sample size tends to infinity. This result matches our common sense: while we benefit from the computational efficiency of convex surrogate loss functions, we also suffer from a slower rate of convergence to the Bayes optimal classifier.
In this section, we present several applications of our results. In Section 4.1, we use the notion of consistency intensity to measure the rates of convergence from the empirical surrogate risk minimizers to the Bayes optimal classifier for different classification algorithms, such as support vector machine, boosting, and logistic regression. We also derive a general discriminant rule for comparing the convergence rates for different leaning algorithms. In Section 4.2, we show that the notions of consistency intensity and conductivity can help to modify surrogate loss functions so as to achieve a faster convergence rate from to .
4.1 Consistency Measurement
In this subsection, we first apply our results to some popular classification algorithms.
Example 1 (Hinge loss in SVM). Here we have , which is convex and . Note that we have defined . Let
It is easy to verify that
So we have,
Followed by theorem 1, we have and . Thus for SVM, with a high probability, we have,
Example 2 (Exponential loss in Adaboost). We have , which is convex and . Then, it’s easy to derive that
Using Maclaurin expansion, we have,
Thus, and . For Adaboost, with a high probability, we have
Example 3 (Logistic loss in Logistic Regression). We have , which is convex and , we can follow similar procedures as before,
Using Maclaurin expansion, we have,
Therefore and . For Logistic Regression, with a high probability, we have
From the above examples, we conclude that SVM has a faster convergence rate to the Bayes optimal classifier than Adaboost and Logistic Regression. Note that the proposed methods also apply to many other surrogate loss functions. However, if we only need to compare the convergence rates of any two classification algorithms, we may not need to compute the consistency intensity of the two surrogate loss functions explicitly. The following theorem gives a general discriminant rule.
Given two convex Bayes-risk consistent loss functions and , denoting their corresponding -transform by and , we assume that their excess -risk have the same convergence rate of order . Then, we define the intensity ratio as follows,
We have the following statements:
if , then the minimizers w.r.t. and converge equally fast to the Bayes optimal classifier;
if , then the minimizer w.r.t. converges faster to the Bayes optimal classifier;
if , then the minimizer w.r.t. converges faster to the Bayes optimal classifier.
Following Proposition 1 and Lemma 3, we have,
Then, we get,
Thus, we can conclude:
for , then we have and . Therefore, the minimizers w.r.t. and converge equally fast to the Bayes optimal classifier;
for , we have and . Thus the minimizer w.r.t. converges faster to the Bayes optimal classifier;
for , then we have and , which means that the minimizer w.r.t. converges faster to the Bayes optimal classifier.
Using the notion of intensity ratio, we can compare the convergence rate to the Bayes optimal classifier for any two algorithms with Bayes-risk consistent loss functions without computing the consistency intensity.
We finish this section by introducing a scaling invariant property of the consistency intensity , which is useful for comparing the convergence rate, e.g., when we scale the surrogate loss function by , we get the same for and .
For any constants , the loss have the same consistency intensity as that of , which means that the intensity of surrogate loss function is scaling invariant in terms of .
Notice that . If we scale as , the both sides of the inequality will be multiplied by , which holds trivially.
We now consider . Observe that,
where . Then,
which leads to the same . ∎
Theorem 2 has many applications. For example, the exponential loss in Adaboost is . Then for all constants must have the same consistency intensity as that of , which implies the minimizers of the corresponding empirical surrogate risks converge equally fast to the Bayes optimal classifier.
4.2 : An Example of the Data-dependent Loss Modification Method
In the previous sections, we have provided theorems that can measure the convergence rate from the expected risk to the Bayes risk for many leaning algorithms using different surrogate loss functions. In fact, the notions of consistency intensity and conductivity can achieve something beyond that. In this subsection, we propose a data-dependent loss modification method for SVM, that obtains a faster convergence rate for the bound of the excess risk and thus makes the learning algorithm achieve a faster convergence rate to the Bayes optimal classifier.
We are familiar with the standard SVM that uses the hinge loss as a surrogate. Now, we modify the hinge loss as follows:
We have , which means this modified hinge loss is also Bayes-risk consistent. Then, following the similar procedure as done for the above examples, we have,
It’s easy to verify that
Using Maclaurin expansion, we have,
Therefore, we have that intensity and conductivity . According to Theorem 1, we know that, with a high probability, we have,
From (4.26), we find that a tighter bound can be obtained when converges to zero fast. Here, we introduce the notion of data-dependent loss modification. That is, the modification parameter is dependent on the sample size . For example, if and , with a high probability, we can obtain a bound of order . Therefore, a tighter bound is obtained with our proposed method for SVM.
5 Conclusions and Future Work
In this paper, we defined the notions of consistency intensity and conductivity for convex Bayes-risk consistent surrogate loss functions and proposed a general framework that determines the relationship between the convergence rate of the excess risk and the convergence rate of the excess -risk. Our methods were used to compare the convergence rates to the Bayes optimal classifier for empirical minimizers of different classification algorithms. Moreover, we used the notions of consistency intensity and conductivity to guide modifying of surrogate loss functions so as to achieve a faster convergence rate.
In this work, we need the surrogate loss function to be convex and Bayes-risk consistent, which holds true for many different surrogate loss functions. However, sometimes we may encounter non-convex, but still Bayes-risk consistent loss functions. It is interesting to generalize the obtained results to the non-convex situation in the future. Besides, in the future, we will apply the modified surrogate loss function to some real-world problems. Moreover, finding some other approaches that guide modifying existing surrogate losses to achieve a faster convergence rate to the Bayes optimal classifier is also quite worth exploring.
- Agarwal and Agarwal 2015 Agarwal, A. and Agarwal, S. (2015). On consistent surrogate risk minimization and property elicitation. In Gr nwald, P., Hazan, E., and Kale, S., editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 4–22, Paris, France. PMLR.
- Bartlett et al. 2005 Bartlett, P. L., Bousquet, O., Mendelson, S., et al. (2005). Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537.
- Bartlett et al. 2006 Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156.
- Bartlett and Mendelson 2002 Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
- Ben-David et al. 2003 Ben-David, S., Eiron, N., and Long, P. M. (2003). On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496 – 514.
- Boucheron et al. 2013 Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
Bousquet et al. 2004
Bousquet, O., Boucheron, S., and Lugosi, G. (2004).
Introduction to statistical learning theory.In Advanced lectures on machine learning, pages 169–207. Springer.
- Collins et al. 2002 Collins, M., Schapire, R. E., and Singer, Y. (2002). Logistic regression, adaboost and bregman distances. Machine Learning, 48(1-3):253–285.
- Feldman et al. 2012 Feldman, V., Guruswami, V., Raghavendra, P., and Wu, Y. (2012). Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558–1590.
- Koltchinskii et al. 2006 Koltchinskii, V. et al. (2006). Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656.
- Liu et al. 2017 Liu, T., Lugosi, G., Neu, G., and Tao, D. (2017). Algorithmic stability and hypothesis complexity. arXiv preprint arXiv:1702.08712.
- Lugosi and Vayatis 2004 Lugosi, G. and Vayatis, N. (2004). On the bayes-risk consistency of regularized boosting methods. Annals of Statistics, pages 30–55.
- Mohri et al. 2012 Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of machine learning. MIT press.
- Neykov et al. 2016 Neykov, M., Liu, J. S., and Cai, T. (2016). On the characterization of a class of fisher-consistent loss functions and its application to boosting. Journal of Machine Learning Research, 17(70):1–32.
- Sridharan et al. 2009 Sridharan, K., Shalev-shwartz, S., and Srebro, N. (2009). Fast rates for regularized objectives. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 1545–1552. Curran Associates, Inc.
- Steinwart 2005 Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51(1):128–142.
- Tsybakov 2004 Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Annals of Statistics, pages 135–166.
- Vapnik 2013 Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
- Zhang 2002 Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527–550.
- Zhang 2004 Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, pages 56–85.