# Consistency and Finite Sample Behavior of Binary Class Probability Estimation

In this work we investigate to which extent one can recover class probabilities within the empirical risk minimization (ERM) paradigm. The main aim of our paper is to extend existing results and emphasize the tight relations between empirical risk minimization and class probability estimation. Based on existing literature on excess risk bounds and proper scoring rules, we derive a class probability estimator based on empirical risk minimization. We then derive fairly general conditions under which this estimator will converge, in the L1-norm and in probability, to the true class probabilities. Our main contribution is to present a way to derive finite sample L1-convergence rates of this estimator for different surrogate loss functions. We also study in detail which commonly used loss functions are suitable for this estimation problem and finally discuss the setting of model-misspecification as well as a possible extension to asymmetric loss functions.

## Authors

• 10 publications
• 46 publications
• ### Oracle Inequalities for Convex Loss Functions with Non-Linear Targets

This paper consider penalized empirical loss minimization of convex loss...
12/12/2013 ∙ by Mehmet Caner, et al. ∙ 0

• ### Class Probability Estimation via Differential Geometric Regularization

We study the problem of supervised learning for both binary and multicla...
03/04/2015 ∙ by Qinxun Bai, et al. ∙ 0

• ### Convex Risk Minimization and Conditional Probability Estimation

This paper proves, in very general settings, that convex risk minimizati...
06/15/2015 ∙ by Matus Telgarsky, et al. ∙ 0

• ### Proper-Composite Loss Functions in Arbitrary Dimensions

The study of a machine learning problem is in many ways is difficult to ...
02/19/2019 ∙ by Zac Cranko, et al. ∙ 0

• ### Conditional Risk Minimization for Stochastic Processes

We study the task of learning from non-i.i.d. data. In particular, we ai...
10/09/2015 ∙ by Alexander Zimin, et al. ∙ 0

• ### Empirical or Invariant Risk Minimization? A Sample Complexity Perspective

Recently, invariant risk minimization (IRM) was proposed as a promising ...
10/30/2020 ∙ by Kartik Ahuja, et al. ∙ 5

• ### Benign overfitting in the large deviation regime

We investigate the benign overfitting phenomenon in the large deviation ...
03/12/2020 ∙ by Geoffrey Chinot, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In binary classification problems we try to predict a label

based on an input feature vector

. Since optimizing for the classification accuracy is often computationally too complex, one typically measures performance through a surrogate loss function. Such methods are designed to achieve good classification performance, but often we are also interested in the classifier’s confidence or a class probability estimate as such. We may, for instance, not only want to classify a tumor as benign or malignant, but also know an estimated probability that the predicted label is wrong. Also various methods in active or semi-supervised learning rely on such class probability estimates. In active learning they are, for instance, used in uncertainty based rules

Lewis and Catlett (1994); Roy and McCallum (2001) while in semi-supervised learning they can be used for performing entropy regularization Grandvalet and Bengio (2005).

In this paper we derive necessary and sufficient conditions under which classifiers, obtained through the minimization of an empirical loss function, allow us to estimate the class probability in a consistent way. More precisely, we present a general way to derive finite sample bounds based on those conditions. While the use of class probability estimates, as argued before, finds a broad audience, the necessary tools to understand the behavior, especially the literature on proper scoring rules, is not that broadly known. So next to our contribution on finite sample behavior for class probability estimation we present a condensed introduction to this, in our opinion, underappreciated field.

A proper scoring rule is essentially a loss function that can measure the class probability point-wise. We investigate in which circumstances those loss functions make use of this potential and lift this point-wise property to the complete space. Next to proper scoring rules we use excess risk bounds to come to our results. Excess risk bounds are essentially inequalities that quantify how much an empirical risk minimizer is off from the true risk.

Combining those two areas, our main contributions are the following. Based on the existing literature, we define in Section 4, Equation (8), a probability estimate derived from an empirical risk minimizer. Based on this we analyze in Section 5 to which extent commonly used loss functions are suitable for the task of class probability estimation. Following this and the analysis thereafter, we argue in Section 6.5 that the squared loss is, in view of this paper, not a particular good choice. In Section 6 we derive conditions that ensure that the estimator converges in probability towards the true posterior. In the same section we present a general way to analyze the finite sample behavior of the convergence rate for different loss functions. The idea is to bound the -distance between the estimated and the true class probability by the excess risk and then use bounds on the excess risk together with the properties of proper scoring rules to show convergence. In the same section we discuss the behavior of the estimator when it is misspecified. In this case one can in general not recover the true class probabilities, but instead find the best approximation with respect to a Bregman divergence. In Section 7 we conclude and discuss our analysis. In particular we discuss how one can extend this work to asymmetric loss functions and analyze their convergence behavior per class label. The following two sections start with related work and some preliminaries.

## 2 Related Work

The starting point of our analysis follows closely the notation and concepts as described by Buja et al. (2005) and Reid and Williamson (2010, 2011). While Buja et al. (2005) and Reid and Williamson (2010) deal with the inherent structure of proper scoring rules, Reid and Williamson (2011) make connections between the expected loss in prediction problems and divergence measures of two distributions. In contrast to that we investigate under which circumstances proper scoring rules can make use of their full potential in order to estimate class probabilities.

Telgarsky et al. (2015) perform an analysis similar to ours as they also investigate convergence properties of a class probability estimator, their start and end point are very different though. While we start with theory from proper scoring rules, their paper directly starts with the class probability estimator as found in Zhang (2004). The problem is that the estimator in Zhang (2004) only appears as a side remark, and it is unclear to which extent this is the best, only or even the correct choice. This paper contributes to close this gap and answers those questions. They show that the estimator converges to a unique class probability model. In relation to this one can view this paper as an investigation of this unique class probability model and we give necessary and sufficient conditions that lead to convergence to the true class probabilities. Note also that their paper uses convex methods, while our work in comparison draws from the theory of proper scoring rules.

Agarwal and Agarwal (2015)

look at the problem in a more general fashion. They connect different surrogate loss functions to certain statistics of the class probability distribution, e.g. the mean, while we focus on the estimation of the full class probability distribution. This allows us to come to more specific results, such as finite sample behavior.

The probability estimator we use also appears in Agarwal (2014) where it is used to derive excess risk bounds, referred to as surrogate risk bounds, for bipartite ranking. The methods used are very similar in the sense that these are also based on proper scoring rules. The difference is again the focus, and even more so the conditions used. They introduce the notion of strongly proper scoring rules which directly allows one to bound the -norm, and thus the -norm, of the estimator in terms of the excess risk. We show that convergence can be achieved already under milder conditions. We then use the concept of modulus of continuity, of which strongly proper scoring rules are a particular case, to analyze the rate of convergence.

## 3 Preliminaries

We work in the classical statistical learning setup for binary classification. We assume that we observe a finite i.i.d. sample drawn from a distribution on . Here denotes a feature space and

denotes a binary response variable. We then decide upon a hypothesis class

such that every is a map for some space . Given the space we call any function a loss function. The interpretation of the loss function is that we incur the penalty when we predicted a value while we actually observed the label . Our goal is then to find a predictor based on the finite sample such that is small, where

is a random variable distributed according to

. In other words, we want to find an estimator that approximates the true risk minimizer well in terms of the expected loss, where

 f0:=argminf∈FE[l(Y,f(X))]. (1)

The estimator is often chosen to be the empirical risk minimizer

 fn=argminf∈Fn∑i=1l(yi,f(xi)).

As we show in this paper, finding such an implicitly means to find a good estimate for in many settings. Since we regularly deal with and related quantities we introduce the following notation. To start with, we define . Depending on the context we drop the feature and think of as a scalar. Accepting the small risk of overloading the notation we sometimes also think of

as a Bernoulli distribution with outcomes in

and parameter , as in the following definition. We define the point-wise conditional risk as

 L(η,v):=EY∼η[l(Y,v)]=ηl(1,v)+(1−η)l(−1,v), (2)

the optimal point-wise conditional risk as

 L∗(η):=minv∈VL(η,v), (3)

and we denote by the set of values that optimize the point-wise conditional risk

 v∗(η):=argminv∈VL(η,v). (4)

Finally we define the conditional excess risk as

 ΔL(η,v):=L(η,v)−L∗(η). (5)

### 3.1 Proper Scoring Rules

If we chose , we say that is a CPE loss, where CPE stands for class probability estimation. The name stems from the fact that if it is already normalized to a value that can be interpreted as a probability. If is a CPE loss we call it a proper scoring rule or proper loss if and we call it a strictly proper scoring rule or strictly proper loss if . In other words, is a proper scoring rule if is a minimizer of and this is strict if is the only minimizer. In case is strict we drop the set notation of , so that .

As we will see later strictly proper CPE losses are well suited for class probability estimation. In general, however, we cannot expect that , but we may still want to use the corresponding loss function for class probability estimation. To do that we will use the concept of link functions Buja et al. (2005); Reid and Williamson (2010). A link function is a map , so a function that indeed links the values from to something that can be interpreted as a probability. Combining such a link function with a loss one can define a CPE loss as follows.

 lψ:{−1,1}×[0,1]→[0,∞) lψ(y,q):=l(y,ψ(q))

We call the combination of a loss and a link function a (strictly) proper composite loss if is (strictly) proper as a CPE loss.

To distinguish between the losses and we subscript the quantities (2)-(5) with a if we talk about instead of . For example we define for and in the same way we define , and . Note that if is a strictly proper composite loss, we know that are single element sets, but the same does not need to hold for .

To ask a composite loss to be proper is not a strong requirement, one can check that choosing as constant function already fulfills this. This is because a composite loss is proper, iff the true posterior is a minimizer of the conditional risk , i.e. . If is constant, then so is the conditional risk and then every value is a minimizer, so in particular is a minimizer. We want to avoid this degenerate behavior for the task of probability estimation and will ask to cover enough of in the following sense. We call a composite loss non-degenerate if for all we have that , where is the image of on . This does not directly exclude constant link functions for example, but consider the following. If is constant and non-degenerate, then there is a single such that for all . Thus would always minimize the loss, and we would, irrespectively of the input, always predict . This is of course a property that no reasonable loss function should carry.

## 4 Behavior of Proper Composite Losses

For our convergence results we will need a loss function to be a strictly proper CPE loss. In this section we investigate how to characterize those loss functions.

We start by investigating proper CPE loss functions. Our first lemma states that the link functions that turns the loss into a proper composite loss is already defined by the behavior of .

###### Lemma 1.

Let be a loss function and be a link function. The composite loss function is then proper and non-degenerate if and only if , meaning that for all .

###### Proof.

First we show that if is proper and non-degenerate, then . Let be a proper composite loss, so , i.e. minimizes . As is non-degenerate there exists at least one such that . If we would find that can not be a minimizer of as then .

Now we show that is a proper non-degenerate composite loss if . By definition, is proper if . This is the case if and only if . But this is the case if since is defined as the set of minimizers of . The non-degenerate follows directly by definition. ∎

This lemma gives thus necessary and sufficient condition on our link to lead to a proper loss function. The result is very similar to Corollary 12 and 14 found in Reid and Williamson (2010). Their corollaries state necessary and sufficient conditions on the link function, using the assumption that the loss has differentiable partial losses, which is an assumption we don’t require.

In Section 6.2 we show that strictly proper losses, together with some additional assumptions, lead to consistent class probability estimates. So it is useful to know how to characterize those functions. The following lemma shows that a link function that turns a loss into strictly proper and non-degenerate CPE loss can be characterized again by the behavior of .

###### Lemma 2.

Let be a loss function and a link function. A composite loss function is then strictly proper and non-degenerate if and only if and for all pairwise different .

###### Proof.

By definition the composite loss is strictly proper if and only if First we show that is strictly proper and non-degenerate if and for all . From Lemma 1 we know already that , we only have to show that is the only element in the set. For that assume that it is not the only element, so that there is a such that . As in the proof of Lemma 1 one can conclude that . But we also know, again from Lemma 1, that . That means that , which is a contradiction to our assumption.

Now we show that and for all if is strictly proper and non-degenerate. The relation follows again from Lemma 1. We prove the second claim by contradiction and assume that there exist , all pairwise different, such that . With this choice and using that is strictly proper it follows that and . That means that which is a contradiction. ∎

So if is a strictly proper composite loss it will fulfill some sort of injectivity condition on the sets . With this we will be able to define an inverse on those sets, and this will be essentially our class probability estimator. With Lemma 2 we can connect every to a unique by the unique relation if we assume that disjointly covers in the sense that

 ⋃η∈[0,1]v∗(η)=V   and (6) v∗(η1)∩v∗(η2)=∅   ∀ η1,η2∈[0,1],  η1≠η2. (7)

Note that we know from Lemma 2 that for strict properness it is sufficient for that the disjoint property (7) only holds on , the image of . This is merely a technicality and we will assume from now on that every strictly proper composite loss will satisfy (7). The covering property (6) on the other hand can be violated. This happens for example if we use the squared loss together with . For the squared loss , so it only covers the space .

If we assume, however, that the regularity properties (6) and (7) hold for a strictly proper non-degenerate composite loss we can extend the domain of from to the whole of , see also Figure 1 and the examples in Table 2.

###### Definition 1.

Let be a strictly proper, non-degenerate composite loss and assume that disjointly covers . We define, by abuse of notation, the inverse link function by , where is the unique element in such that .

The requirements from the previous definition is what we consider the archetype of a composite loss that is suitable for probability estimation, although not all of the requirements are necessary. This motivates the following definition.

###### Definition 2.

We call a composite loss a natural CPE loss if is non-degenerate, fulfills the disjoint cover property (6) and (7) and is strictly proper.

We now have all the necessary work done to make the following observation.

###### Corollary 1.

If is a natural CPE loss, then .

###### Proof.

Let and such that . Then . As by the previous lemmas we know that we have by Definition 1 that . ∎

The corollary tells us that we can optimize our loss function over to get and then map this back with the inverse link to restore the class probability . For this we once more refer to Figure 1. Remember that the set is the set of all

that minimize the loss if the true posterior probability was

. If we use a natural CPE loss we know then that maps all those points back to .

Given a predictor this motivates to define an estimator of as

 ^η=^η(x)=ψ−1(f(x)). (8)

In Section 6 we give conditions under which converges in probability towards when using an empirical risk minimizer as a prediction rule. More formally; Given any we show that under certain conditions satisfies

 P(|^ηn(X)−η(X)|>ϵ)n→∞−−−→0, (9)

where the probability is measured with respect to . In the next section, however, we want to investigate first and for some commonly used loss functions.

## 5 Analysis of Loss Functions

We now give examples of some commonly used loss functions and analyze whether they are strictly proper or not, with the aid of Lemma 2. In Table 1 we summarize the loss functions we consider and a link function that turns the loss function into a strictly proper composite loss, if possible. Table 2 shows the corresponding functions and . That the link functions indeed fulfills the requirements can be checked with Lemma 2 . The behavior of the squared and squared hinge loss seems to be very similar. In Section 6.5, however, we point out an important difference.

As already noted by Buja et al. (2005), also Table 2 shows that the hinge loss is not suitable for class probability estimation. We observe that the intersections of for different are not disjoint. By Lemma 2 we can conclude that there is no link such that is strictly proper. One way to fix this, proposed by Duin and Tax (1998) and similar by Platt (1999)

, is to fit a logistic regressor on top of the support vector machine.

Bartlett and Tewari (2004) investigate the behavior of the hinge loss deeper by connecting the class probability estimation task to the sparseness of the predictor. The hinge loss is of course classification calibrated (essentially meaning that we find point-wise the correct label with it), so between our considered surrogate losses it is the only one that really directly solves the classification problem without implicitly estimating the class probability.

## 6 Convergence of the Estimator

We now prove that the estimator as defined in Equation (8) converges in probability and in the -norm to the true class probability whenever we use an empirical risk minimizer, for which we have excess risk bounds, as a prediction rule.

### 6.1 Using the True Risk Minimizer for Estimation

Before we can investigate under which conditions an empirical risk minimizer can (asymptotically) retrieve we need to investigate under which conditions the true risk minimizer can retrieve it. In this subsection we formulate a theorem that gives necessary and sufficient conditions for that. Not surprisingly we basically require that our hypothesis class is rich enough so as to contain the class probability distribution already. Bartlett et al. (2006) and similar works often avoid problems caused by restricted classes by assuming from the beginning that the hypothesis class consists of all measurable functions. This theorem relaxes this assumption for the purpose of class probability estimation.

In this setting we assume that we use a hypothesis class where are functions . If we want to do class probability estimation we rescale those functions by composing them with the inverse link so that we effectively use the hypothesis class . We then get the following theorem about the possibility of retrieving the posterior with risk minimization.

###### Theorem 1.

Assume that is a natural CPE loss function. Let . Then almost surely if and only if

###### Proof.

If then by the definition of that space.

For the other direction assume that . First observe that

 EX[L(η(X),ψ(f(X))]=EX,Y[l(Y,ψ(f(X)))].

Since is a natural CPE loss we know that is the unique minimizer of . Since is a minimizer of it follows that almost surely. As is regular, the inverse is well-defined and thus . ∎

Following Theorem 1 we need to assume that our hypothesis class is flexible enough for consistent class probability estimation. We formulate this assumption as follows.

#### Assumption A

Given a natural CPE loss we assume that In Subsection 6.3 we will deal with the case of misspecification, i.e. when .

### 6.2 Using the Empirical Risk Minimizer for Estimation

In the previous section we considered the possibility of retrieving class probability estimates with the true risk minimizer. To move on to empirical risk minimizers we need the notion of excess risk bounds.

###### Definition 3.

Let be any estimator of . We call

 BF(n):N→[0,∞) (10)

an excess risk bound for if for and

 EX[ΔL(η(X),fn(X))]=EX,Y[l(Y,fn(X))−l(Y,f0)]≤BF(n). (11)

Excess risk bounds are typically in the order of , where and is a notion of model complexity. Common measures for the model complexity are the VC dimension Vapnik (1998), Rademacher complexity Bartlett et al. (2005) or -cover Benedek and Itai (1991). The existence of excess risk bounds is tied to the finiteness of any of those complexity notions. A lot of efforts in this line of research are made to find relations between the exponent and the statistical learning problem given by , the loss and the underlying distribution . Conditions that ensure are often called easiness conditions, such as the Tsybakov condition Tsybakov (2004) or the Bernstein condition Audibert (2004)

. Intuitively those conditions often state that the variance of our estimator gets smaller the closer we are to the optimal solution. For a in-depth discussion and some recent results we refer to the work of

Grünwald and Mehta (2016).

Excess risk bounds allow us to bound the expected value of for a loss , so in particular we can bound for a composite loss . We will show -convergence by connecting the behavior of to . The following lemma introduces a condition that allows us to draw this connection.

###### Lemma 3.

Let be a natural CPE loss. Assume that for all the maps

 L0ψ(η,⋅):=Lψ(η,⋅)↾[0,η]:[0,η]→R

and

 L1ψ(η,⋅):=Lψ(η,⋅)↾[η,1]:[η,1]→R

are strictly monotonic, where refers to the restriction of the mapping to an interval . This is the case iff is strictly convex with as its minimizer. Then there exists for all a such that for all

 |ΔLψ(η,^η)|<δ⇒|η−^η|<ϵ. (12)
###### Proof.

With the assumptions on and those maps have a well defined inverse mapping with their image as the domain and those inverse mappings are continuous Hoffmann (2015). That means in particular that for every and for all there exists a such that

 |^l−l|<δ⇒|L0ψ−1(η,^l)−L0ψ−1(η,l)|<ϵ (13)

and similar for . W.l.o.g assume now that so that . Then we set and . Plugging this into (13) we get the following relation.

 |ΔLψ(η,^η)|=|L0ψ(η,^η)−L0ψ(η,η)|<δ ⇒ |^η−η|=|L0ψ−1(η,^l)−L0ψ−1(η,l)|<ϵ

The map captures the behavior of the loss when is the true class probability and we predict a class probability less than . Similarly captures the behavior when we predict a class probability bigger than , see also Figure 4. In Corollary 3, further below, we draw a connection between and the modulus of continuity of the inverse functions of and . The function plays an important role in the convergence rate of the estimator as described in the next theorem.

###### Theorem 2.

Let be a natural CPE loss and assume Assumption A holds. Furthermore let be an excess risk bound for and assume that is strictly convex with as its minimizer. Then there exists a mapping
such that for we have that

 P(|η(X)−^ηn(X)|>ϵ)≤BF(n)δ(ϵ). (14)
###### Proof.

Using Lemma 3 for the first inequality, Markov’s Inequality for the second and the excess risk bound for the third inequality it follows that

 P(|η(X)−^ηn(X)|>ϵ)≤P(ΔLψ(η(X),^ηn(X))>δ) =P(ΔL(η(X),fn(X))>δ)≤E[ΔL(η(X),fn(X))]δ(ϵ)≤BF(n)δ(ϵ).

This theorem gives us directly the earlier claimed asymptotic convergence result.

###### Corollary 2.

Under the assumptions of Theorem 2 we have that converges in probability and -norm to .

We do not have to restrict ourselves to asymptotic results though. Theorem 2 can also be used to derive rate of convergences as we will see in Subsection 6.4. But before that we briefly want to address the case of misspecification, i.e. the case when Assumption A does not hold.

### 6.3 Misspecification

The case of misspecification can be dealt with once we assume that has a gradient. If this holds then Reid and Williamson (2010) show the identity

 ΔLψ(η,^η)=D−L∗ψ(η,^η) (15)

where is the with associated Bregman divergence between and . Excess risk bounds on translate then into bounds on the Bregman divergence between and and asymptotically we approach the best class probability estimate in terms of this divergence.

### 6.4 Rate of Convergence

For the rate of convergence it is crucial to investigate the function from Inequality (14). One way to analyze this is to study the modulus of continuity of the inverse functions of and :

###### Definition 4.

Let be a monotonically increasing function. Let be an interval. A function admits as a modulus of continuity at if and only if

 |g(x)−g(y)|≤ω(|x−y|)

for all .

For example Hölder and Lipschitz continuity are particular moduli of continuity. This notion allows us a to draw the following connection between and .

###### Corollary 3.

Let be a natural CPE loss and let be a monotonically increasing function. Assume that for all the mappings and admit as a modulus of continuity at . Then is a mapping such that Implication (12) holds.

###### Proof.

W.l.o.g. assume that . Let and . By using that admits as a modulus of continuity we have

 |L0ψ−1(η,l)−L0ψ−1(η,^l)|≤ω(|l−^l|).

Plugging in the definition of and this means that Using the monotonicity of it follows that if , then

This is exactly the Implication (12). ∎

Note that it follows from the proof that finding a modulus of continuity for and can be done by showing the bound . We will use that in the following examples, where we analyze for the squared (hinge) loss and the logistic loss. We show that those loss functions lead to a modulus of continuity given by the square root times a constant. Agarwal (2014) calls loss functions that admit this modulus of continuity strongly-proper loss functions. The following analysis can thus be found there in more detail and for a few more examples. We will use for simplicity versions of the losses that do not need a link function, and are already CPE losses, the results are summarized in Table 2.

#### Example: Squared Loss and Squared Hinge Loss

Let be given by the partial loss functions and . We can derive that . With this we can directly bound

 |^η−η|≤√ΔL(η,^η)

and thus choose as the inverse of the square-root function, so that . The analysis for the squared hinge loss is the same as this version of the squared loss is already a CPE loss.

#### Example: Logistic Loss

Let be given by the partial loss functions and . One can derive that . One can show the bound , so that we can choose .

### 6.5 Squared Loss vs Squared Hinge Loss

In this section we will subscript previously defined entities with and for the squared and square hinge loss respectively. When using squared loss vs the squared hinge loss for class probability estimation there is one big difference in the inverse of the link function, namely its domain. The inverse link function is a map . If we use the square loss we implicitly chose since this is the range of . The range of on the other hand is . That means that if we want to use the squared loss for class probability estimation we really have to parametrize our prediction functions , a simple linear model for example would usually not fit this assumption as the range of those models can be outside of . For the squared hinge loss on the other hand we can allow for functions .

Sugiyama (2010) proposes to just truncate the inverse link for the squared loss, so using the same inverse link as for the squared hinge loss. This is fine as long as our hypothesis class is flexible enough, but leads to problems if that is not the case as the following example shows.

Assume we are given three one dimensional data points together with their true class probabilities . We want to learn this classification with linear models, which are two-dimensional after including a bias term. That means that . One can check that in case of the squared hinge loss function we can recover the true class probabilities with the linear function given by . By Theorem 1 we know then that a optimal solution is also able to recover the true class probabilities.

The squared loss has after truncating the following problem. Although the linear function is part of , after truncating, it will not be found back as an optimal solution . One can instead check that for the given example the true risk minimizer is given by . And this hypothesis does not recover the true class probabilities. This might appear as a contradiction to Theorem 1. But the problem arises because we use a different link function than the one associated to the square loss.

## 7 Discussion and Conclusion

The starting point of this paper is the question if one can retrieve consistently a class probability estimate based on ERM in a consistent way. To answer this question we draw from earlier work on proper scoring rules and excess risk bounds. Lemmas 1 and 2, our first results, characterize strictly proper composite loss functions in terms of their link function. Based on those lemmas, we subsequently derive necessary and sufficient conditions for retrieving the true class probability with ERM as formulated in Theorem 1. Next to some regularity conditions on the loss function, we show that to retrieve the true probabilities we essentially need that they are already part of our hypothesis class , which, in a way, is not surprising.

In Section 6 we use the results from the previous sections and theory about excess risk bounds to state our main consistency and finite sample size results. We show that consistency arises whenever we use strictly proper (composite) loss functions, our hypothesis class is flexible enough, and we have excess risk bounds. This is the case, for example, whenever one of the complexity notions mentioned in Section 6 is finite. We then discuss the relation between the finite sample size behavior of the excess risk bound and the probability estimate and examine this relation for two example loss functions.

In Lemma 3 we introduce fairly general conditions under which a composite loss function leads to a consistent class probability estimator. In particular we have a condition on the conditional risk . (See also Figure 4.) Based on that we derive in Corollary 3 conditions which allow us to analyze the convergence rate for different loss functions. In the corollary we don’t distinguish between and , which leads to the same convergence rate for predicting values left and right from . But the modulus of continuity for those two functions can be really different, especially when using asymmetric proper scoring rules Winkler (1994). We believe that by analyzing and individually one can extend our work to analyze the convergence behavior of asymmetric scoring rules in more detail.

As stated from the outset, one of our main goals is to emphasize the tight relationships between empirical risk minimization and class probability estimation in a distilled and compact version. The concepts of link functions and the relation between them and empirical risk minimization do not get the attention they deserve and are thus reinvented from time to time. Many of those concepts appear for example in the analysis of Zhang (2004) without any explicit reference to proper scoring rules.

## References

• Lewis and Catlett (1994) D. D. Lewis and J. Catlett, Heterogeneous uncertainty sampling for supervised learning, in

In Proceedings of the Eleventh International Conference on Machine Learning

(Morgan Kaufmann, 1994) pp. 148–156.
• Roy and McCallum (2001) N. Roy and A. McCallum, Toward optimal active learning through sampling estimation of error reduction, in Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01 (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001) pp. 441–448.
• Grandvalet and Bengio (2005) Y. Grandvalet and Y. Bengio, Semi-supervised learning by entropy minimization, in Advances in Neural Information Processing Systems 17, edited by L. K. Saul, Y. Weiss,  and L. Bottou (MIT Press, 2005) pp. 529–536.
• Buja et al. (2005) A. Buja, W. Stuetzle,  and Y. Shen, Loss functions for binary class probability estimation and classification: Structure and applications,? manuscript, available at www-stat.wharton.upenn.edu/ buja,  (2005).
• Reid and Williamson (2010) M. D. Reid and R. C. Williamson, Composite binary losses, J. Mach. Learn. Res. 11, 2387 (2010).
• Reid and Williamson (2011) M. D. Reid and R. C. Williamson, Information, divergence and risk for binary experiments, J. Mach. Learn. Res. 12, 731 (2011).
• Telgarsky et al. (2015) M. Telgarsky, M. Dudík,  and R. Schapire, Convex risk minimization and conditional probability estimation, in Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015 (2015) pp. 1629–1682.
• Zhang (2004) T. Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, The Annals of Statistics 32, 56 (2004).
• Agarwal and Agarwal (2015) A. Agarwal and S. Agarwal, On consistent surrogate risk minimization and property elicitation, in Proceedings of The 28th Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 40, edited by P. Grünwald, E. Hazan,  and S. Kale (PMLR, Paris, France, 2015) pp. 4–22.
• Agarwal (2014) S. Agarwal, Surrogate regret bounds for bipartite ranking via strongly proper losses, Journal of Machine Learning Research 15, 1653 (2014).
• Duin and Tax (1998) R. P. Duin and D. M. Tax, Classifier conditional posterior probabilities, in

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

(Springer, 1998) pp. 611–619.
• Platt (1999) J. C. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in ADVANCES IN LARGE MARGIN CLASSIFIERS (MIT Press, 1999) pp. 61–74.
• Bartlett and Tewari (2004) P. L. Bartlett and A. Tewari, Sparseness versus estimating conditional probabilities: Some asymptotic results, in Learning Theory, edited by J. Shawe-Taylor and Y. Singer (Springer Berlin Heidelberg, Berlin, Heidelberg, 2004) pp. 564–578.
• Bartlett et al. (2006) P. L. Bartlett, M. I. Jordan,  and J. D. McAuliffe, Convexity, classification, and risk bounds, Journal of the American Statistical Association 101, 138 (2006)https://doi.org/10.1198/016214505000000907 .
• Vapnik (1998) V. N. Vapnik, Statistical Learning Theory (Wiley-Interscience, 1998).
• Bartlett et al. (2005) P. L. Bartlett, O. Bousquet,  and S. Mendelson, Local rademacher complexities, Ann. Statist. 33, 1497 (2005).
• Benedek and Itai (1991) G. M. Benedek and A. Itai, Learnability with respect to fixed distributions, Theor. Comput. Sci. 86, 377 (1991).
• Tsybakov (2004) A. B. Tsybakov, Optimal aggregation of classifiers in statistical learning, Ann. Statist. 32, 135 (2004).
• Audibert (2004) J.-Y. Audibert, Une approche PAC-bayésienne de la théorie statistique de l?apprentissage, Ph.D. thesis, Université Paris 6 (2004).
• Grünwald and Mehta (2016) P. D. Grünwald and N. A. Mehta, Fast rates with unbounded losses, CoRR abs/1605.00252 (2016), arXiv:1605.00252 .
• Hoffmann (2015) H. Hoffmann, On the continuity of the inverses of strictly monotonic functions. Bulletin of the Irish Mathematical Society 75, 45 (2015).
• Sugiyama (2010) M. Sugiyama, Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting, IEICE Transactions 93-D, 2690 (2010).
• Winkler (1994) R. L. Winkler, Evaluating probabilities: Asymmetric scoring rules, Management Science 40, 1395 (1994).