# Large-Margin Classification with Multiple Decision Rules

Binary classification is a common statistical learning problem in which a model is estimated on a set of covariates for some outcome indicating the membership of one of two classes. In the literature, there exists a distinction between hard and soft classification. In soft classification, the conditional class probability is modeled as a function of the covariates. In contrast, hard classification methods only target the optimal prediction boundary. While hard and soft classification methods have been studied extensively, not much work has been done to compare the actual tasks of hard and soft classification. In this paper we propose a spectrum of statistical learning problems which span the hard and soft classification tasks based on fitting multiple decision rules to the data. By doing so, we reveal a novel collection of learning tasks of increasing complexity. We study the problems using the framework of large-margin classifiers and a class of piecewise linear convex surrogates, for which we derive statistical properties and a corresponding sub-gradient descent algorithm. We conclude by applying our approach to simulation settings and a magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.

• 1 publication
• 1 publication
• 18 publications
• 18 publications
05/21/2012

### Soft Rule Ensembles for Statistical Learning

In this article supervised learning problems are solved using soft rule ...
03/29/2020

### On the Precise Error Analysis of Support Vector Machines

This paper investigates the asymptotic behavior of the soft-margin and h...
06/17/2018

### Feature Learning and Classification in Neuroimaging: Predicting Cognitive Impairment from Magnetic Resonance Imaging

Due to the rapid innovation of technology and the desire to find and emp...
02/03/2020

### On the Performance under Hard and Soft Bitwise Mismatched-Decoding

We investigated a suitable auxiliary channel setting and the gap between...
03/14/2022

### Soft-margin classification of object manifolds

A neural population responding to multiple appearances of a single objec...
06/22/2018

### A Novel ECOC Algorithm with Centroid Distance Based Soft Coding Scheme

In ECOC framework, the ternary coding strategy is widely deployed in cod...
01/20/2017

### Stability Enhanced Large-Margin Classifier Selection

Stability is an important aspect of a classification procedure because u...

## 1 Introduction

Classification is one of the most widely applied and well studied problems in supervised learning. Given a training set of observed covariates and outcomes, similar to the usual regression problem, in classification, the outcome is modeled as a function of the set of covariates. However, in contrast to standard regression with a continuous response variable, classification describes the setting where the outcome is a discrete class label. While generalizations to more than two classes exist, in this paper we focus on the standard binary problem where the label takes one of two possible values, typically denoted by

and .

Given such a dataset, commonly, the goal is to build a model, either to predict the class of a new observation from the covariate space, or to estimate the probably of each class as a function of the covariates. The tasks correspond respectively to hard and soft classification. Briefly, we refer to methods which only target the optimal prediction rule as hard classifiers, and those which produce estimates of class probability as soft classifiers. Examples of hard classifiers include the support vector machine (SVM)

[1, 2] and -learning [3, 4]

, and examples of soft classifiers include logistic regression and other likelihood-based approaches. Often, soft classifiers are also used to obtain hard classification rules by predicting the class with greater estimated probability. These rules are commonly referred to as plug-in classifiers. While hard classification rules do not directly provide conditional class probability estimates, several approaches have been proposed for estimating class probabilities based on hard classifiers, including those of

[5] and [6]. As such, methods which may be traditionally viewed as soft and hard classifiers are often used for either task. Naturally, a question of interest is: how are hard and soft classifiers related, and how do they differ in practice?

Recently, [7] introduced the Large-margin Unified Machines (LUM) family of margin-based classifiers, shedding some light on the the relationship between hard and soft classifiers. The LUM family connects several popular margin-based classification methods, including SVM, distance-weighted discrimination (DWD) [8], and a new hybrid logistic loss. Their approach was further extended to the multi-category case in [9]. Margin-based approaches to classification are popular in practice for their accuracy and computational efficiency in both low and high-dimensional settings. While a flexible family of margin-based classifiers, the LUM approach examines only a specific parameterized collection of classifiers along the gradient of soft to hard classification. In this paper, we similarly focus on connecting hard and soft margin-based methods. However, we consider a more natural approach based on connecting the tasks of hard and soft classification rather than specific hard and soft classifiers. Specifically, we propose a novel framework of binary learning problems which may be formulated as partial or full estimation of the conditional class probability based on fitting an arbitrary number of boundaries to the data. As an example, suppose we are interested in separating patients into four disease risk groups based on clinical measurements. One possible approach is to group patients according to whether their conditional probability of being positive for the disease is less than 25%, between 25% to 50%, between 50% to 75%, or greater than 75%. In this setting, the emphasis is not on the accuracy of class probability estimates, but instead, on the correct stratification of individuals into risk groups. Therefore, only partial estimation of the conditional class probability is required; in particular, at the three boundaries, 25%, 50%, and 75%. While stratification of the patient classes is possible using a soft classifier, an approach directly targeting the three boundaries may provide improved stratification by requiring weaker assumptions on the entire form of the underlying conditional class probability.

In addition to hard and soft classification, the proposed framework also encompasses rejection-option classification [10, 11, 12, 13] and weighted classification [14, 15], two other well-studied binary learning problems. Briefly, the rejection-option problem expands on standard binary classification by introducing a third option to reject, where neither label is predicted. Notably, it can be shown that the decision to reject directly corresponds to a prediction that the probability of belonging to either class does not exceed a specified threshold. Since the task requires estimation of more than a single classification boundary, but less than the full class conditional probability, it may be viewed as an intermediate problem to hard and soft classification, as in the example given above. Applications of rejection option classification include certain medical settings where predictions should only be made when a level of certainty is obtained. Additionally, weighted classification extends the standard classification problem by accounting for differences or biases in class populations. We define these problems more formally, along with hard and soft classification, in Section 2.

The remainder of this paper is organized as follows. In the first part of Section 2 we provide a review of margin-based learning. Then, in the remainder of Section 2, we define our family of binary learning problems and introduce a corresponding theoretical loss, which generalizes the standard misclassification error to connect class prediction with probability estimation. In Section 3

we provide necessary and sufficient conditions for consistency of a surrogate loss function, and propose a class of consistent piecewise linear surrogates akin to the SVM hinge loss for binary classification. In Section

4, we present theoretical bounds on the empirical performance of classification rules obtained using surrogate loss functions. In Section 5, we provide a sub-gradient descent (SGD) algorithm for solving the corresponding optimization problem using the proposed piecewise linear surrogates. We then illustrate the behavior of our generalized family of classifiers using simulation in Section 6, and a real data example from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database in Section 7. We conclude in Section 8 with a discussion of the proposed framework.

## 2 Methodology

In this section, we first briefly introduce margin-based classifiers, and formally define the notion of classification consistency for loss functions. We then state the general form of our unified framework of problems and introduce a corresponding family of theoretical loss functions which encompasses the standard misclassification error as a special case.

### 2.1 Margin-Based Classifiers

Let denote a training set of covariate–label pairs drawn from according to some unknown distribution . For binary problems, is used to denote the label space, and often , with . Given a training set, margin-based classifiers minimize a penalized loss over a class, , of margin functions, . Typically, the corresponding optimization problem is written as:

 minf∈F 1nn∑i=1L(yif(xi))loss + λJ(f)penalty, (1)

where is a loss function defined with respect to the functional margin, , and is some roughness measure on with corresponding tuning parameter . Both hard and soft classification may be formulated as margin-based problems. In the case of hard classification, with a little abuse of notation, we use to denote a predicted class label, and to denote a prediction rule on . In margin-based classification, is combined with a margin function, , to obtain predictions in . Most commonly, in hard classification the sign rule, , is used, assuming almost surely (a.s.). Thus, given a new pair with , correct classification is obtained if and only if . Since the functional margin, , serves as an approximate measure for classification correctness, the loss function, , in (1) is often chosen to be a non-increasing function over . A natural choice of in hard classification is the misclassification error, or 01 loss, given by:

 ℓ0−1(Y,ˆY)=I{ˆY≠Y}, (2)

where is used to denote the indicator function. Using the sign rule, the loss may be equivalently written over the class of margin functions as: . However, direct optimization of the non-convex and discontinuous loss, , is NP-hard and often infeasible in practice. Thus, continuous convex losses, called surrogates, are commonly used instead. Choices of the surrogate loss function corresponding to existing margin-based classifiers include the SVM hinge loss, , logistic loss, , and the DWD loss, . Finally, the penalty term, is used to prevent over-fitting and improve generalizability of the resulting classifier. The amount of penalization is commonly determined by cross-validation over a grid of values. Here, we note that while in the literature there exists a natural theoretical loss for hard classification, i.e. the 01 loss, there is no equivalent theoretical loss targeting consistent probability estimation for soft classification. In addition to providing a spectrum of theoretical loss functions covering soft and hard classifications at the two extremes, our proposed framework also naturally defines precisely such a theoretical loss for the soft classification problem (Figure 2C).

In Section 1, we briefly discussed the learning tasks of rejection-option and weighted classification. As with hard and soft classification, these tasks may also be formulated as margin-based problems. We next describe how rejection-option classification may be formulated as a problem of the form (1). Borrowing the notation of [12], we use to denote the rejection option such that a prediction, , takes values in . Then, for some pre-specified rejection cost , they propose the following theoretical loss for rejection-option classification:

 ℓrej,π(Y,ˆYrej)=⎧⎪⎨⎪⎩1if ˆYrej≠Y, ˆYrej≠0πif ˆYrej=00otherwise. (3)

To express the loss as a function over , [12] propose the prediction rule for some appropriately chosen . Then, may be written as the following generalized 01 loss on :

 Lrej,π(Yf(X);δ)=(1−π)I{Yf(X)≤−δ}+πI{Yf(X)<δ}.

We finally consider the task of weighted classification. In contrast to the problems mentioned thus far, to fit the form of (1), weighted classification requires specifying separate theoretical loss functions for observations from the and classes, denoted by and . For simplicity, we use to denote the loss for both classes. Similar to hard classification, the task is to predict class labels in . The loss function depends on a weight parameter, , which accounts for imbalances between the two classes. Commonly, is constrained to the interval without loss of generality. Then, for fixed weight , the weighted loss is given by:

 ℓYw,π(Y,ˆY) =I{Y=+1}⋅ℓ+w,π(ˆY)+I{Y=−1}⋅ℓ−w,π(ˆY), (4) ℓ+w,π(ˆY) =(1−π)⋅I{ˆY≠+1}, ℓ−w,π(ˆY) =π⋅I{ˆY≠−1}.

Note that the standard 01 loss corresponds to the special case of the weighted loss (4) when equal weight is assigned to the two classes with . Using the same prediction rule as for hard classification, , the loss over the functional margin may be written:

 LYw,π(Yf(X)) =I{Y=+1}⋅L+w,π(Yf(X))+I{Y=−1}⋅L−w,π(Yf(X)), L+w,π(Yf(X)) =(1−π)⋅I{Yf(X)<0}, L−w,π(Yf(X)) =π⋅I{Yf(X)<0}.

As with the usual 01 loss, optimization of and is NP-hard, and in practice should be approximated using a convex surrogate loss. In the next section, we introduce the notion of consistency, an important statistical property of surrogate loss functions.

### 2.2 Classification Consistency

Much work has been done to study the statistical properties of classifiers of the form given in (1) [16, 17, 18, 19]. Of these, consistency of loss functions is one of the most fundamental. In general, a loss function is called consistent for a margin-based learning problem if it recovers in expectation the optimal rule, often called the Bayes rule, to the theoretical loss function, e.g. , or . More formally, for a theoretical loss function, , and a surrogate loss, , let and denote the Bayes rule and -optimal margin function, respectively. Then, we call consistent for if , where is the appropriate prediction rule, e.g. the sign function. Equivalently, using the margin-based formulation of the theoretical loss, , and letting denote the -optimal margin function, consistency may be expressed as . For rejection-option classification, the Bayes optimal rule is given by:

 Y∗rej,π(X)=⎧⎪⎨⎪⎩+1if p(X)≥1−π0if p(X)∈(π,1−π)−1if p(X)≤π. (5)

The Bayes optimal rule for weighted classification is given by:

 Y∗w,π(X)={+1if p(X)>π−1if p(X)≤π. (6)

For hard classification, the Bayes optimal rule corresponds to , and consistency is often referred to as Fisher consistency or classification calibrated [18]. While no theoretical loss has been proposed for soft classification, using to denote the conditional class probability at , commonly, is called consistent for soft classification if there exists some monotone mapping, such that . Naturally, may be viewed as an extension of the prediction rules and given for hard and rejection-option classification. Necessary and sufficient conditions for Fisher, rejection-option, and probability estimation consistency have been described in [20, 12, 21].

In this paper, we propose a novel framework for unifying hard, soft, rejection-option, and weighted classification through a generalized formulation of their corresponding theoretical losses, corresponding Bayes optimal rules, and necessary and sufficient conditions for consistency. Our generalized formulation not only provides a platform for comparing existing binary classification tasks, but also introduces an entire family of new tasks which fills the gap between these problems. We next formally introduce our unified framework of binary learning problems.

### 2.3 Unified Framework

First, we note that all of the classification tasks described in Section 2.1 may be formulated as learning problems which target partial or complete estimation of the conditional class probability, . We propose our framework of unified margin-based learning problems based on this insight. Let denote the ordered partition of the interval obtained by splitting at , where . Assume a.s. for all , such that observations belong to only a single region of interval. Letting and for ease of notation, we write:

 Ωπ ={ω0,…,ωK},

where , and , for . As our framework, we propose the class of problems which target a partition of the covariate space, , into the regions, . In Figure 1, we show a sample of observations drawn from the same underlying distribution, along with optimal solutions to three representative problems from our proposed framework. Note that the extreme cases of with (Figure 1A), and with dense on (Figure 1C) correspond to hard and soft classification, respectively. We discuss these connections in more detail later in this section. To illustrate the spectrum of problems in our framework, we also show a new intermediate problem in Figure 1B, with and .

Formally, we define our framework as the collection of minimization tasks of a theoretical loss which generalizes the 01 loss, over the collection of rules . Recall the weighted 01 loss, , for weighted classification described above. For positive and negative class weights and where , the weighted 01 loss has corresponding Bayes boundary at . Problems under our framework may be viewed as the task of simultaneously estimating such boundaries. Intuitively, we formulate our theoretical loss as the average of weighted 01 loss functions with corresponding weights . Throughout, we use and to denote the loss for positive and negative class observations, respectively. As with the weighted loss, we use to denote the loss for both classes:

 ℓYπ(g(X)) =2KK∑k=1ℓYπk(g(X)), (7)

where

 ℓ+πk(g(X)) =(1−πk)⋅I{g(X)≤πk}, ℓ−πk(g(X)) =πk⋅I{g(X)>πk},

and the notion of inequalities is extended to elements of such that if and if . As we show in Supplementary Section S1, our theoretical loss encompasses the usual 01 loss, its weighted variant, and the rejection-option loss proposed by [12]. The multiplicative constant, 2, is included in (7) such that reduces precisely to the usual 01 loss when . Note that since is effectively the average of indicator functions scaled by 2, the function takes values in the interval . In Figure 2, we show as a function of , corresponding to the problems in Figure 1. Along the horizontal axis, the range is split into corresponding intervals. Note that the loss function is constant within each interval, giving the appearance of a step function, except in the extreme case when . As increases, the theoretical loss becomes smoother, with the limit at corresponding to the proposed theoretical loss for consistent soft classification described in Section 2.1. Additionally, note that while the loss functions, and , are symmetric in Panels A and C of Figure 2, the same is not true for the loss functions in Panel B. This is due to the fact that the boundaries of interest, , are symmetric between the two classes, i.e. , when and , but not when .

The following result states that the class of problems defined by our theoretical loss indeed corresponds to the proposed framework of learning tasks. That is, the Bayes optimal rule given by , is precisely the partitioning task described above.

###### Theorem 1.

For fixed and defined as above, the Bayes optimal rule for the theoretical loss (7) is given by:

 W∗π(X) =argming∈GπEY|X{ℓYπ(g(X))} =K∑k=0ωk⋅I{p(X)∈ωk}.

In addition to the results of Theorem 1, the theoretical loss functions for hard (2), rejection-option (3), and weighted (4) classification can be derived as special cases of (7). This is shown by first noting the equivalence of to and based on the Bayes optimal rules, (5) and (6). From this equivalence, (3) and (4) can be obtained directly from (7). For soft classification, we derive a new theoretical loss from the limiting form of (7):

 ℓYπ(g(X)) =limK→∞2KK∑k=1ℓYπk(g(X)), =(I{Y=+1}−g(X))2.

The resulting theoretical loss is shown in Figure 2C. Since , the Bayes rule is simply the conditional class probability, , corresponding to soft classification. All proofs, and a more complete derivation of these results may be found in the Supplementary Materials.

As with the problems described in Section 2.1, optimization of with respect to is NP-hard. Thus, we first reformulate as a function on to express the optimization over a collection of margin functions, . We then propose in Section 3 to solve the approximate problem using convex surrogate loss functions. Generalizing the approach of [12] for rejection-option classification, we frame the optimization task over the class of margin functions, , using a prediction rule of the form:

 C(f(x);δ) =K∑k=0ωk⋅I{f(x)∈(δk−1,δk]}, (8)

for monotone increasing , and , . Intuitively, each corresponds to the -boundary along the range of the margin function, . As is common in margin-based learning, we write the theoretical loss as the following function over :

 LYπ(Yf(X);δ) =ℓYπ(C(f(X);δ)) ={2K∑Kk=1(1−πk)⋅I% {Yf(X)≤δk} if Y=+12K∑Kk=1πk⋅I{Yf(X)<−δk} if Y=−1. (9)

In Figure 3, we plot the corresponding margin-based formulations of the theoretical loss functions shown in Figure 2, with well chosen . Intuitively, both and are non-increasing on . We also note that and differ by a reflection along the vertical axis since is defined with respect to . Given the margin-based formulation (9), we propose to solve our class of problems using convex surrogate loss functions. In the following section, we first present necessary and sufficient conditions for a surrogate loss to be consistent to (7). We then introduce a class of consistent piecewise linear surrogates, which includes the SVM hinge loss as a special case.

## 3 Convex Surrogate Loss Functions

Since the proposed theoretical loss function (7) and its margin-based reformulation (9) are discontinuous and non-convex for any finite choice of and , empirical minimization can quickly become intractable. Therefore, we propose to instead minimize a convex surrogate loss over the class of margin functions, as in hard and soft classification. In this section, we first provide necessary and sufficient conditions for a surrogate loss to be consistent for (7) with fixed and . Then, we introduce a class of convex piecewise linear surrogates which includes the SVM hinge loss as a special case. Intuitively, the piecewise linear surrogates each consist of non-zero segments, corresponding to the boundaries, . In the limit, as becomes dense on , the piecewise linear surrogate tends towards a smooth loss, as in Panel C of Figures 2 and 3.

### 3.1 Consistency

Throughout this section, we assume and to be fixed. First, let and denote a pair of convex surrogate loss functions for and . Further, let denote the -optimal rule over the class of all measurable functions. We call consistent if there exists such that the prediction rule (8) satisfies , i.e. if there exists a known monotone mapping from the -optimal rule to the partition of to . The following result provides necessary and sufficient conditions for the consistency of the surrogate loss to .

###### Theorem 2.

A pair of convex surrogate loss functions, , are consistent for if and only if there exists such that for each : and exist, and , and

 ϕ−′(−δk)ϕ−′(−δk)+ϕ+′(δk)=πk. (10)

Naturally, any surrogate loss satisfying the conditions of Theorem 2 for some , must also satisfy the set of conditions for any subset of the boundaries, . Thus, for surrogate loss functions consistent for soft classification, i.e. when , there exists an appropriate for any possible and . Similar intuition is used to justify the use of soft classification based plug-in classifiers described in Section 1. Examples of surrogate losses consistent for soft classification include the logistic, squared hinge, exponential, and DWD losses. Values of such that the conditions of Theorem 2 are met for these loss functions are provided in Corollaries 3-8 of [12]. In the next section, we introduce a class of piecewise linear surrogates which, similar to the SVM loss for hard classification, satisfy consistency for the of interest, but not for any . We refer to such a piecewise linear surrogate as being minimally consistent for a corresponding set of boundaries, . In contrast to soft classification losses which satisfy consistency for all , minimally consistent surrogates are well-tuned for a given , and may provide improved stratification of to the sets, .

### 3.2 Piecewise Linear Surrogates

Throughout, we use and to denote piecewise linear surrogates. To build intuition, in the columns of Figure 4, we show examples of for , corresponding to hard classification, rejection-option classification, and the new problem shown in Figure 1B. Circles are used to highlight the hinges, i.e. non-differentiable points, along the piecewise linear loss functions. The corresponding margin-based theoretical loss, , is also shown in each panel using appropriately chosen . First, note that the losses in Panels A and B of Figure 4 correspond to the standard SVM hinge loss and generalized hinge loss of [11], respectively. Consider the new surrogate losses in Figure 4C for boundaries at . Note that and each consist of non-zero linear segments. Furthermore, each linear segment only spans a single or for and , respectively. We will refer to these pairs of linear segments as the -consistent segments. This construction allows for the consistency of the surrogate loss for each to be controlled separately by the pairs of -consistent segments along the piecewise linear loss.

We formulate our collection of piecewise linear surrogate losses as the maximum of the linear segments and 0. Consider first the surrogate loss for positive observations, . Using to denote the intercept and slope of the -consistent segment, we express the piecewise linear loss as:

 φ+(z)=max{0, A+(π1)+B+(π1)⋅z, …, A+(πK)+B+(πK)⋅z}. (11)

We similarly use and to denote the intercept and slope of the -consistent segment for the negative class loss such that:

 φ−(z)=max{0, A−(π1)+B−(π1)⋅z, …, A−(πK)+B−(πK)⋅z}. (12)

By construction, the resulting piecewise linear losses are non-negative, convex and continuous. While (11) and (12) define a general class of piecewise linear losses, we focus on a subset of minimally consistent piecewise linear surrogates. In the following theorem, we provide a set of sufficient conditions for a piecewise linear loss to be minimally consistent for a specified .

###### Theorem 3.

Let denote the location of the hinges along the respective loss functions between consecutive boundaries, . Then, is a minimally consistent piecewise linear surrogate for if the intercept and slope parameters, and , satisfy the following conditions:

1. is non-decreasing, and is non-increasing in .

2. The hinge points are such that:

 −H−(πk−1,πk) =H+(πk−1,πk)     for k=2,…,K, H+(πk−1,πk) H−(πK−1,πK).
3. satisfy:

 B−(πk)B−(πk)+B+(πk)=πk    for 1≤k≤K.

Conditions (C1) and (C2) guarantee that the linear segments are well-ordered and non-degenerate along with appropriately aligned hinge points. Condition (C3) guarantees the consistency of to the corresponding . Most importantly, by aligning the hinge points, and , we ensure that there does not exist a such that (10) is satisfied for any . Next, we present an approach to obtaining and which satisfy the conditions of Theorem 3 using the logistic loss as an example.

### 3.3 Logistic Derived Surrogates

In this section, we propose to construct piecewise linear losses by choosing to be the tangent lines to the logistic loss at . A similar approach was used by [22] to construct a piecewise linear loss for the rejection-option problem. The following Proposition states that piecewise linear loss functions constructed using this approach satisfy the conditions of Theorem 3 for any choice of and .

###### Proposition 1.

For fixed and , let be the piecewise linear loss constructed from the tangent lines to the logistic loss such that and are defined as:

 A+(π) =A−(1−π)=−πlog(π)−(1−π)log(1−π), B+(π) =B−(1−π)=−(1−π).

Then, is a minimally consistent piecewise linear surrogate for satisfying the conditions of Theorem 3.

In Figure 5, we illustrate the logistic-derived piecewise linear loss for . The logistic loss is shown by dotted lines, with the piecewise linear surrogate functions for the positive and negative classes shown in solid black. Thin vertical lines are used to denote the tangent points where the losses are equal, and thin dashed lines give the tangent lines to the logistic loss corresponding to for . Additionally, the non-differentiable hinge points are highlighted by circles. While the loss functions appear roughly equivalent within the region of the tangent points, the difference is non-negligible above and below these bounds. Notably, the piecewise linear losses diverge slower as tends to

, suggesting the losses may be more robust to outliers

[7]. Additionally, the logistic derived loss functions provide a natural spectrum for comparing the impact of targeting different partitions, , on the same dataset. We explore these issues using simulation in Section 6.

## 4 Statistical Properties

We next derive statistical properties for surrogate loss functions to the theoretical loss, . In Subsection 4.1, we first show that the excess risk with respect to may be bounded by the excess risk of a consistent surrogate loss. Then, in Subsection 4.2, we use these risk bounds to derive convergence rates for the empirical minimizer of a surrogate loss to the Bayes optimal rule. Our results generalize and extend those derived for the particular case of rejection-option classification in [10, 11, 12], to an arbitrary number of boundaries.

### 4.1 Excess Risk Bounds

For a rule , we define the -risk of to be the expected loss of the rule, denoted by . In statistical machine learning, a natural measure of the performance of a rule is its excess risk: , where such that . In this section, we derive convergence rates on for rules obtained using consistent surrogate loss functions. For a surrogate loss , we similarly define the -risk and excess -risk over the class of margin functions, , to be and . To obtain convergence rates on , we first show that under certain conditions, the excess -risk of a margin function can be used to bound the corresponding excess -risk of . Using this bound, we then derive rates of convergence on through rates of convergence on . The following additional notation is used to denote excess conditional -risk and excess conditional -risk:

 Rp(g) :=EY|X{ℓYπ(g(X))}, Qp(f) :=EY|X{ϕY(Yf(X))}, ΔRp(g) :=Rp(g)−Rp(W∗π), ΔQp(f) :=Qp(f)−Qp(f∗ϕ).

In the following results, we provide conditions under which there exists some function, , such that can be used to bound the corresponding .

###### Theorem 4.

Let be a consistent surrogate loss for satisfying the conditions for Theorem 2 at . Furthermore, suppose there exist constants and such that for all ,

 |p(X)−πk|s ≤CsΔQp(δk). (13) Then, ΔR(C(f;δ)) ≤C[2⋅ΔQ(f)]1/s.

The above bound may be tightened as in [12] by the additional assumption:

 P{|p(X)−πk|≤t}≤Atα,   k=1,…,K, (14)

for some , . The bound (14) generalizes the margin condition introduced by [23] and used in [10].

###### Theorem 5.

In addition to the assumptions of Theorem 4, assume that there exists and , such that (14) holds for . Then, for some depending on ,

 ΔR(C(f;δ))≤D⋅ΔQ(f)1/(s+β−βs)

where .

Note that when , Theorem 5 provides the same bound as Theorem 4. However, as , the bound becomes tighter, with limiting to . While neither result depends explicitly on , Theorem 5 suggests that tighter bounds may be achieved by only targeting such that the margin condition is satisfied with large . This reiterates the motivating intuition for our proposed framework, in which we formalize a class of learning problems for settings where more information than hard classification is desired, but soft classification may not be appropriate.

Corresponding values of and for the exponential, logistic, squared hinge and DWD losses, are provided in Corollaries 13–16 of [12]. In the following result, we derive values of and for our class of minimally consistent piecewise linear surrogates.

###### Corollary 1.

For minimally consistent piecewise linear loss, , defined as in (11) and (12) and satisfying the conditions of Theorem 3 for boundaries , the inequality (13) is satisfied by and

 C=max{−πkB−(πk)⋅|δk−Hj|:k=1,…,K; j=0,…,K},

where is used to denote , to denote for , and to denote .

Consider now a sequence of margin functions, . By Theorems 4 and 5, to show that the excess -risk, , converges to 0 as , it suffices to show that as . In the following results, we derive convergence rates for for the sequence of functions, , where is used to denote the empirical minimizer of the surrogate loss over a training set of size .

### 4.2 Rates of Convergence

In this section, we derive convergence results for two classes of surrogate loss functions separately. We first consider Lipschitz continuous and differentiable surrogate loss functions which satisfy a modulus of convexity condition specified below. Examples of such loss functions include the exponential, logistic, squared hinge and DWD losses. We then separately consider the class of piecewise linear surrogates described in Section 3.

Let denote a Lipschitz continuous and differentiable surrogate loss function. Assume that the corresponding -risk, , has modulus of convexity,

 δ(ϵ) =inf{Q(f)+Q(g)2−Q(f+g2):E[(f−g)2(X)]≥ϵ2} (15)

satisfying for some . Furthermore, let denote the Lipschitz constant, such that for all and . Letting denote the class of uniformly bounded functions such that for all , we use to denote the cardinality of the set of closed balls with radius in needed to cover . Finally, as stated above, let denote the empirical minimizer of over the training set . For the following corollary, we make use of Theorem 18 from [12] which provides a bound on the expected estimation error, , for consistent loss functions satisfying the modulus of convexity condition stated above. Combining Theorem 18 of [12] with the excess risk bounds of Theorems 4 and 5, we obtain the following result.

###### Corollary 2.

If satisfies the assumptions of Theorems 2 and 4, and has modulus of convexity (15) satisfying for some , then with probability at least ,

 ΔR(C(^fn;δ))≤C⋅21/s{inff∈FBΔQ(f)+3Ln+8(L22c+LB3)log(Nn/γ)n}1/s.

Furthermore, if the generalized margin condition of Theorem 5 holds, then with probability at least ,

 ΔR(C(^fn;δ))≤D{inff∈FBΔQ(f)+3Ln+8(L22c+LB3)log(Nn/γ)n}1/(s+β−βs), (16)

for constants defined as in Theorems 4 and 5.

From the bound on excess risk obtained in Corollary 2, corresponding rates of convergence can be derived based on the cardinality, , of the class of functions, .

Due to the non-differentiability of the loss at hinge points, our class of piecewise linear surrogates do not satisfy the modulus of convexity condition (15). The following theorem provides separate convergence results for our class of minimally consistent piecewise linear surrogates. Again, we use to denote a class of uniformly bounded functions, and let denote the empirical minimizer of .

###### Theorem 6.

If is a minimally consistent piecewise linear loss satisfying the conditions of Theorem 3, satisfying the generalized margin condition of Theorem 5, then with probability at least ,

 ΔQ(^fn)≤3Ln+4LB3⋅G(γ)+((4LB3⋅G(γ))2+8⋅B′⋅G(γ))1/2,

where , and is some constant depending on , , and margin constants .

Combining Theorems 45, and 6, we obtain the following corollary.

###### Corollary 3.

If is a minimally consistent piecewise linear loss satisfying the assumptions of Theorems 24, and 5, then with probability at least ,

 (17)

for constants defined as in Theorems 4 and 5.

As in Theorem 5, while the convergence rate of Theorem 6 does not depend on explicitly, it does depend on the parameters of the margin condition (14). Therefore, Theorem 6 further suggests the advantage of targeting for which the data show strong separation with large . Furthermore, in contrast to Theorem 18 of [12] which provides a bound on the expected estimation error, Theorem 6 bounds the total -risk, including both the expected estimation error, and expected approximation error of the class of functions . As a result, while the bounds in Corollary 2 include the separate approximation error term, , the piecewise linear bound in Corollary 3, does not.

Based on the bounds in (16) and (17), rates of convergence can be obtained as in [12]. As an example, we consider the case when is the class of linear combinations of decision stumps, ,

 fλ(x)=M∑j=1λjfj(x)

where , and . By (16) and (17), the same rate,