1 Introduction
Classification is one of the most widely applied and well studied problems in supervised learning. Given a training set of observed covariates and outcomes, similar to the usual regression problem, in classification, the outcome is modeled as a function of the set of covariates. However, in contrast to standard regression with a continuous response variable, classification describes the setting where the outcome is a discrete class label. While generalizations to more than two classes exist, in this paper we focus on the standard binary problem where the label takes one of two possible values, typically denoted by
and .Given such a dataset, commonly, the goal is to build a model, either to predict the class of a new observation from the covariate space, or to estimate the probably of each class as a function of the covariates. The tasks correspond respectively to hard and soft classification. Briefly, we refer to methods which only target the optimal prediction rule as hard classifiers, and those which produce estimates of class probability as soft classifiers. Examples of hard classifiers include the support vector machine (SVM)
[1, 2] and learning [3, 4], and examples of soft classifiers include logistic regression and other likelihoodbased approaches. Often, soft classifiers are also used to obtain hard classification rules by predicting the class with greater estimated probability. These rules are commonly referred to as plugin classifiers. While hard classification rules do not directly provide conditional class probability estimates, several approaches have been proposed for estimating class probabilities based on hard classifiers, including those of
[5] and [6]. As such, methods which may be traditionally viewed as soft and hard classifiers are often used for either task. Naturally, a question of interest is: how are hard and soft classifiers related, and how do they differ in practice?Recently, [7] introduced the Largemargin Unified Machines (LUM) family of marginbased classifiers, shedding some light on the the relationship between hard and soft classifiers. The LUM family connects several popular marginbased classification methods, including SVM, distanceweighted discrimination (DWD) [8], and a new hybrid logistic loss. Their approach was further extended to the multicategory case in [9]. Marginbased approaches to classification are popular in practice for their accuracy and computational efficiency in both low and highdimensional settings. While a flexible family of marginbased classifiers, the LUM approach examines only a specific parameterized collection of classifiers along the gradient of soft to hard classification. In this paper, we similarly focus on connecting hard and soft marginbased methods. However, we consider a more natural approach based on connecting the tasks of hard and soft classification rather than specific hard and soft classifiers. Specifically, we propose a novel framework of binary learning problems which may be formulated as partial or full estimation of the conditional class probability based on fitting an arbitrary number of boundaries to the data. As an example, suppose we are interested in separating patients into four disease risk groups based on clinical measurements. One possible approach is to group patients according to whether their conditional probability of being positive for the disease is less than 25%, between 25% to 50%, between 50% to 75%, or greater than 75%. In this setting, the emphasis is not on the accuracy of class probability estimates, but instead, on the correct stratification of individuals into risk groups. Therefore, only partial estimation of the conditional class probability is required; in particular, at the three boundaries, 25%, 50%, and 75%. While stratification of the patient classes is possible using a soft classifier, an approach directly targeting the three boundaries may provide improved stratification by requiring weaker assumptions on the entire form of the underlying conditional class probability.
In addition to hard and soft classification, the proposed framework also encompasses rejectionoption classification [10, 11, 12, 13] and weighted classification [14, 15], two other wellstudied binary learning problems. Briefly, the rejectionoption problem expands on standard binary classification by introducing a third option to reject, where neither label is predicted. Notably, it can be shown that the decision to reject directly corresponds to a prediction that the probability of belonging to either class does not exceed a specified threshold. Since the task requires estimation of more than a single classification boundary, but less than the full class conditional probability, it may be viewed as an intermediate problem to hard and soft classification, as in the example given above. Applications of rejection option classification include certain medical settings where predictions should only be made when a level of certainty is obtained. Additionally, weighted classification extends the standard classification problem by accounting for differences or biases in class populations. We define these problems more formally, along with hard and soft classification, in Section 2.
The remainder of this paper is organized as follows. In the first part of Section 2 we provide a review of marginbased learning. Then, in the remainder of Section 2, we define our family of binary learning problems and introduce a corresponding theoretical loss, which generalizes the standard misclassification error to connect class prediction with probability estimation. In Section 3
we provide necessary and sufficient conditions for consistency of a surrogate loss function, and propose a class of consistent piecewise linear surrogates akin to the SVM hinge loss for binary classification. In Section
4, we present theoretical bounds on the empirical performance of classification rules obtained using surrogate loss functions. In Section 5, we provide a subgradient descent (SGD) algorithm for solving the corresponding optimization problem using the proposed piecewise linear surrogates. We then illustrate the behavior of our generalized family of classifiers using simulation in Section 6, and a real data example from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database in Section 7. We conclude in Section 8 with a discussion of the proposed framework.2 Methodology
In this section, we first briefly introduce marginbased classifiers, and formally define the notion of classification consistency for loss functions. We then state the general form of our unified framework of problems and introduce a corresponding family of theoretical loss functions which encompasses the standard misclassification error as a special case.
2.1 MarginBased Classifiers
Let denote a training set of covariate–label pairs drawn from according to some unknown distribution . For binary problems, is used to denote the label space, and often , with . Given a training set, marginbased classifiers minimize a penalized loss over a class, , of margin functions, . Typically, the corresponding optimization problem is written as:
(1) 
where is a loss function defined with respect to the functional margin, , and is some roughness measure on with corresponding tuning parameter . Both hard and soft classification may be formulated as marginbased problems. In the case of hard classification, with a little abuse of notation, we use to denote a predicted class label, and to denote a prediction rule on . In marginbased classification, is combined with a margin function, , to obtain predictions in . Most commonly, in hard classification the sign rule, , is used, assuming almost surely (a.s.). Thus, given a new pair with , correct classification is obtained if and only if . Since the functional margin, , serves as an approximate measure for classification correctness, the loss function, , in (1) is often chosen to be a nonincreasing function over . A natural choice of in hard classification is the misclassification error, or 01 loss, given by:
(2) 
where is used to denote the indicator function. Using the sign rule, the loss may be equivalently written over the class of margin functions as: . However, direct optimization of the nonconvex and discontinuous loss, , is NPhard and often infeasible in practice. Thus, continuous convex losses, called surrogates, are commonly used instead. Choices of the surrogate loss function corresponding to existing marginbased classifiers include the SVM hinge loss, , logistic loss, , and the DWD loss, . Finally, the penalty term, is used to prevent overfitting and improve generalizability of the resulting classifier. The amount of penalization is commonly determined by crossvalidation over a grid of values. Here, we note that while in the literature there exists a natural theoretical loss for hard classification, i.e. the 01 loss, there is no equivalent theoretical loss targeting consistent probability estimation for soft classification. In addition to providing a spectrum of theoretical loss functions covering soft and hard classifications at the two extremes, our proposed framework also naturally defines precisely such a theoretical loss for the soft classification problem (Figure 2C).
In Section 1, we briefly discussed the learning tasks of rejectionoption and weighted classification. As with hard and soft classification, these tasks may also be formulated as marginbased problems. We next describe how rejectionoption classification may be formulated as a problem of the form (1). Borrowing the notation of [12], we use to denote the rejection option such that a prediction, , takes values in . Then, for some prespecified rejection cost , they propose the following theoretical loss for rejectionoption classification:
(3) 
To express the loss as a function over , [12] propose the prediction rule for some appropriately chosen . Then, may be written as the following generalized 01 loss on :
We finally consider the task of weighted classification. In contrast to the problems mentioned thus far, to fit the form of (1), weighted classification requires specifying separate theoretical loss functions for observations from the and classes, denoted by and . For simplicity, we use to denote the loss for both classes. Similar to hard classification, the task is to predict class labels in . The loss function depends on a weight parameter, , which accounts for imbalances between the two classes. Commonly, is constrained to the interval without loss of generality. Then, for fixed weight , the weighted loss is given by:
(4)  
Note that the standard 01 loss corresponds to the special case of the weighted loss (4) when equal weight is assigned to the two classes with . Using the same prediction rule as for hard classification, , the loss over the functional margin may be written:
As with the usual 01 loss, optimization of and is NPhard, and in practice should be approximated using a convex surrogate loss. In the next section, we introduce the notion of consistency, an important statistical property of surrogate loss functions.
2.2 Classification Consistency
Much work has been done to study the statistical properties of classifiers of the form given in (1) [16, 17, 18, 19]. Of these, consistency of loss functions is one of the most fundamental. In general, a loss function is called consistent for a marginbased learning problem if it recovers in expectation the optimal rule, often called the Bayes rule, to the theoretical loss function, e.g. , or . More formally, for a theoretical loss function, , and a surrogate loss, , let and denote the Bayes rule and optimal margin function, respectively. Then, we call consistent for if , where is the appropriate prediction rule, e.g. the sign function. Equivalently, using the marginbased formulation of the theoretical loss, , and letting denote the optimal margin function, consistency may be expressed as . For rejectionoption classification, the Bayes optimal rule is given by:
(5) 
The Bayes optimal rule for weighted classification is given by:
(6) 
For hard classification, the Bayes optimal rule corresponds to , and consistency is often referred to as Fisher consistency or classification calibrated [18]. While no theoretical loss has been proposed for soft classification, using to denote the conditional class probability at , commonly, is called consistent for soft classification if there exists some monotone mapping, such that . Naturally, may be viewed as an extension of the prediction rules and given for hard and rejectionoption classification. Necessary and sufficient conditions for Fisher, rejectionoption, and probability estimation consistency have been described in [20, 12, 21].
In this paper, we propose a novel framework for unifying hard, soft, rejectionoption, and weighted classification through a generalized formulation of their corresponding theoretical losses, corresponding Bayes optimal rules, and necessary and sufficient conditions for consistency. Our generalized formulation not only provides a platform for comparing existing binary classification tasks, but also introduces an entire family of new tasks which fills the gap between these problems. We next formally introduce our unified framework of binary learning problems.
2.3 Unified Framework
First, we note that all of the classification tasks described in Section 2.1 may be formulated as learning problems which target partial or complete estimation of the conditional class probability, . We propose our framework of unified marginbased learning problems based on this insight. Let denote the ordered partition of the interval obtained by splitting at , where . Assume a.s. for all , such that observations belong to only a single region of interval. Letting and for ease of notation, we write:
where , and , for . As our framework, we propose the class of problems which target a partition of the covariate space, , into the regions, . In Figure 1, we show a sample of observations drawn from the same underlying distribution, along with optimal solutions to three representative problems from our proposed framework. Note that the extreme cases of with (Figure 1A), and with dense on (Figure 1C) correspond to hard and soft classification, respectively. We discuss these connections in more detail later in this section. To illustrate the spectrum of problems in our framework, we also show a new intermediate problem in Figure 1B, with and .
Formally, we define our framework as the collection of minimization tasks of a theoretical loss which generalizes the 01 loss, over the collection of rules . Recall the weighted 01 loss, , for weighted classification described above. For positive and negative class weights and where , the weighted 01 loss has corresponding Bayes boundary at . Problems under our framework may be viewed as the task of simultaneously estimating such boundaries. Intuitively, we formulate our theoretical loss as the average of weighted 01 loss functions with corresponding weights . Throughout, we use and to denote the loss for positive and negative class observations, respectively. As with the weighted loss, we use to denote the loss for both classes:
(7) 
where
and the notion of inequalities is extended to elements of such that if and if . As we show in Supplementary Section S1, our theoretical loss encompasses the usual 01 loss, its weighted variant, and the rejectionoption loss proposed by [12]. The multiplicative constant, 2, is included in (7) such that reduces precisely to the usual 01 loss when . Note that since is effectively the average of indicator functions scaled by 2, the function takes values in the interval . In Figure 2, we show as a function of , corresponding to the problems in Figure 1. Along the horizontal axis, the range is split into corresponding intervals. Note that the loss function is constant within each interval, giving the appearance of a step function, except in the extreme case when . As increases, the theoretical loss becomes smoother, with the limit at corresponding to the proposed theoretical loss for consistent soft classification described in Section 2.1. Additionally, note that while the loss functions, and , are symmetric in Panels A and C of Figure 2, the same is not true for the loss functions in Panel B. This is due to the fact that the boundaries of interest, , are symmetric between the two classes, i.e. , when and , but not when .
The following result states that the class of problems defined by our theoretical loss indeed corresponds to the proposed framework of learning tasks. That is, the Bayes optimal rule given by , is precisely the partitioning task described above.
Theorem 1.
For fixed and defined as above, the Bayes optimal rule for the theoretical loss (7) is given by:
In addition to the results of Theorem 1, the theoretical loss functions for hard (2), rejectionoption (3), and weighted (4) classification can be derived as special cases of (7). This is shown by first noting the equivalence of to and based on the Bayes optimal rules, (5) and (6). From this equivalence, (3) and (4) can be obtained directly from (7). For soft classification, we derive a new theoretical loss from the limiting form of (7):
The resulting theoretical loss is shown in Figure 2C. Since , the Bayes rule is simply the conditional class probability, , corresponding to soft classification. All proofs, and a more complete derivation of these results may be found in the Supplementary Materials.
As with the problems described in Section 2.1, optimization of with respect to is NPhard. Thus, we first reformulate as a function on to express the optimization over a collection of margin functions, . We then propose in Section 3 to solve the approximate problem using convex surrogate loss functions. Generalizing the approach of [12] for rejectionoption classification, we frame the optimization task over the class of margin functions, , using a prediction rule of the form:
(8) 
for monotone increasing , and , . Intuitively, each corresponds to the boundary along the range of the margin function, . As is common in marginbased learning, we write the theoretical loss as the following function over :
(9) 
In Figure 3, we plot the corresponding marginbased formulations of the theoretical loss functions shown in Figure 2, with well chosen . Intuitively, both and are nonincreasing on . We also note that and differ by a reflection along the vertical axis since is defined with respect to . Given the marginbased formulation (9), we propose to solve our class of problems using convex surrogate loss functions. In the following section, we first present necessary and sufficient conditions for a surrogate loss to be consistent to (7). We then introduce a class of consistent piecewise linear surrogates, which includes the SVM hinge loss as a special case.
3 Convex Surrogate Loss Functions
Since the proposed theoretical loss function (7) and its marginbased reformulation (9) are discontinuous and nonconvex for any finite choice of and , empirical minimization can quickly become intractable. Therefore, we propose to instead minimize a convex surrogate loss over the class of margin functions, as in hard and soft classification. In this section, we first provide necessary and sufficient conditions for a surrogate loss to be consistent for (7) with fixed and . Then, we introduce a class of convex piecewise linear surrogates which includes the SVM hinge loss as a special case. Intuitively, the piecewise linear surrogates each consist of nonzero segments, corresponding to the boundaries, . In the limit, as becomes dense on , the piecewise linear surrogate tends towards a smooth loss, as in Panel C of Figures 2 and 3.
3.1 Consistency
Throughout this section, we assume and to be fixed. First, let and denote a pair of convex surrogate loss functions for and . Further, let denote the optimal rule over the class of all measurable functions. We call consistent if there exists such that the prediction rule (8) satisfies , i.e. if there exists a known monotone mapping from the optimal rule to the partition of to . The following result provides necessary and sufficient conditions for the consistency of the surrogate loss to .
Theorem 2.
A pair of convex surrogate loss functions, , are consistent for if and only if there exists such that for each : and exist, and , and
(10) 
Naturally, any surrogate loss satisfying the conditions of Theorem 2 for some , must also satisfy the set of conditions for any subset of the boundaries, . Thus, for surrogate loss functions consistent for soft classification, i.e. when , there exists an appropriate for any possible and . Similar intuition is used to justify the use of soft classification based plugin classifiers described in Section 1. Examples of surrogate losses consistent for soft classification include the logistic, squared hinge, exponential, and DWD losses. Values of such that the conditions of Theorem 2 are met for these loss functions are provided in Corollaries 38 of [12]. In the next section, we introduce a class of piecewise linear surrogates which, similar to the SVM loss for hard classification, satisfy consistency for the of interest, but not for any . We refer to such a piecewise linear surrogate as being minimally consistent for a corresponding set of boundaries, . In contrast to soft classification losses which satisfy consistency for all , minimally consistent surrogates are welltuned for a given , and may provide improved stratification of to the sets, .
3.2 Piecewise Linear Surrogates
Throughout, we use and to denote piecewise linear surrogates. To build intuition, in the columns of Figure 4, we show examples of for , corresponding to hard classification, rejectionoption classification, and the new problem shown in Figure 1B. Circles are used to highlight the hinges, i.e. nondifferentiable points, along the piecewise linear loss functions. The corresponding marginbased theoretical loss, , is also shown in each panel using appropriately chosen . First, note that the losses in Panels A and B of Figure 4 correspond to the standard SVM hinge loss and generalized hinge loss of [11], respectively. Consider the new surrogate losses in Figure 4C for boundaries at . Note that and each consist of nonzero linear segments. Furthermore, each linear segment only spans a single or for and , respectively. We will refer to these pairs of linear segments as the consistent segments. This construction allows for the consistency of the surrogate loss for each to be controlled separately by the pairs of consistent segments along the piecewise linear loss.
We formulate our collection of piecewise linear surrogate losses as the maximum of the linear segments and 0. Consider first the surrogate loss for positive observations, . Using to denote the intercept and slope of the consistent segment, we express the piecewise linear loss as:
(11) 
We similarly use and to denote the intercept and slope of the consistent segment for the negative class loss such that:
(12) 
By construction, the resulting piecewise linear losses are nonnegative, convex and continuous. While (11) and (12) define a general class of piecewise linear losses, we focus on a subset of minimally consistent piecewise linear surrogates. In the following theorem, we provide a set of sufficient conditions for a piecewise linear loss to be minimally consistent for a specified .
Theorem 3.
Let denote the location of the hinges along the respective loss functions between consecutive boundaries, . Then, is a minimally consistent piecewise linear surrogate for if the intercept and slope parameters, and , satisfy the following conditions:

is nondecreasing, and is nonincreasing in .

The hinge points are such that:

satisfy:
Conditions (C1) and (C2) guarantee that the linear segments are wellordered and nondegenerate along with appropriately aligned hinge points. Condition (C3) guarantees the consistency of to the corresponding . Most importantly, by aligning the hinge points, and , we ensure that there does not exist a such that (10) is satisfied for any . Next, we present an approach to obtaining and which satisfy the conditions of Theorem 3 using the logistic loss as an example.
3.3 Logistic Derived Surrogates
In this section, we propose to construct piecewise linear losses by choosing to be the tangent lines to the logistic loss at . A similar approach was used by [22] to construct a piecewise linear loss for the rejectionoption problem. The following Proposition states that piecewise linear loss functions constructed using this approach satisfy the conditions of Theorem 3 for any choice of and .
Proposition 1.
For fixed and , let be the piecewise linear loss constructed from the tangent lines to the logistic loss such that and are defined as:
Then, is a minimally consistent piecewise linear surrogate for satisfying the conditions of Theorem 3.
In Figure 5, we illustrate the logisticderived piecewise linear loss for . The logistic loss is shown by dotted lines, with the piecewise linear surrogate functions for the positive and negative classes shown in solid black. Thin vertical lines are used to denote the tangent points where the losses are equal, and thin dashed lines give the tangent lines to the logistic loss corresponding to for . Additionally, the nondifferentiable hinge points are highlighted by circles. While the loss functions appear roughly equivalent within the region of the tangent points, the difference is nonnegligible above and below these bounds. Notably, the piecewise linear losses diverge slower as tends to
, suggesting the losses may be more robust to outliers
[7]. Additionally, the logistic derived loss functions provide a natural spectrum for comparing the impact of targeting different partitions, , on the same dataset. We explore these issues using simulation in Section 6.4 Statistical Properties
We next derive statistical properties for surrogate loss functions to the theoretical loss, . In Subsection 4.1, we first show that the excess risk with respect to may be bounded by the excess risk of a consistent surrogate loss. Then, in Subsection 4.2, we use these risk bounds to derive convergence rates for the empirical minimizer of a surrogate loss to the Bayes optimal rule. Our results generalize and extend those derived for the particular case of rejectionoption classification in [10, 11, 12], to an arbitrary number of boundaries.
4.1 Excess Risk Bounds
For a rule , we define the risk of to be the expected loss of the rule, denoted by . In statistical machine learning, a natural measure of the performance of a rule is its excess risk: , where such that . In this section, we derive convergence rates on for rules obtained using consistent surrogate loss functions. For a surrogate loss , we similarly define the risk and excess risk over the class of margin functions, , to be and . To obtain convergence rates on , we first show that under certain conditions, the excess risk of a margin function can be used to bound the corresponding excess risk of . Using this bound, we then derive rates of convergence on through rates of convergence on . The following additional notation is used to denote excess conditional risk and excess conditional risk:
In the following results, we provide conditions under which there exists some function, , such that can be used to bound the corresponding .
Theorem 4.
Let be a consistent surrogate loss for satisfying the conditions for Theorem 2 at . Furthermore, suppose there exist constants and such that for all ,
(13)  
Then,  
The above bound may be tightened as in [12] by the additional assumption:
(14) 
for some , . The bound (14) generalizes the margin condition introduced by [23] and used in [10].
Theorem 5.
Note that when , Theorem 5 provides the same bound as Theorem 4. However, as , the bound becomes tighter, with limiting to . While neither result depends explicitly on , Theorem 5 suggests that tighter bounds may be achieved by only targeting such that the margin condition is satisfied with large . This reiterates the motivating intuition for our proposed framework, in which we formalize a class of learning problems for settings where more information than hard classification is desired, but soft classification may not be appropriate.
Corresponding values of and for the exponential, logistic, squared hinge and DWD losses, are provided in Corollaries 13–16 of [12]. In the following result, we derive values of and for our class of minimally consistent piecewise linear surrogates.
Corollary 1.
Consider now a sequence of margin functions, . By Theorems 4 and 5, to show that the excess risk, , converges to 0 as , it suffices to show that as . In the following results, we derive convergence rates for for the sequence of functions, , where is used to denote the empirical minimizer of the surrogate loss over a training set of size .
4.2 Rates of Convergence
In this section, we derive convergence results for two classes of surrogate loss functions separately. We first consider Lipschitz continuous and differentiable surrogate loss functions which satisfy a modulus of convexity condition specified below. Examples of such loss functions include the exponential, logistic, squared hinge and DWD losses. We then separately consider the class of piecewise linear surrogates described in Section 3.
Let denote a Lipschitz continuous and differentiable surrogate loss function. Assume that the corresponding risk, , has modulus of convexity,
(15) 
satisfying for some . Furthermore, let denote the Lipschitz constant, such that for all and . Letting denote the class of uniformly bounded functions such that for all , we use to denote the cardinality of the set of closed balls with radius in needed to cover . Finally, as stated above, let denote the empirical minimizer of over the training set . For the following corollary, we make use of Theorem 18 from [12] which provides a bound on the expected estimation error, , for consistent loss functions satisfying the modulus of convexity condition stated above. Combining Theorem 18 of [12] with the excess risk bounds of Theorems 4 and 5, we obtain the following result.
Corollary 2.
From the bound on excess risk obtained in Corollary 2, corresponding rates of convergence can be derived based on the cardinality, , of the class of functions, .
Due to the nondifferentiability of the loss at hinge points, our class of piecewise linear surrogates do not satisfy the modulus of convexity condition (15). The following theorem provides separate convergence results for our class of minimally consistent piecewise linear surrogates. Again, we use to denote a class of uniformly bounded functions, and let denote the empirical minimizer of .
Theorem 6.
Corollary 3.
As in Theorem 5, while the convergence rate of Theorem 6 does not depend on explicitly, it does depend on the parameters of the margin condition (14). Therefore, Theorem 6 further suggests the advantage of targeting for which the data show strong separation with large . Furthermore, in contrast to Theorem 18 of [12] which provides a bound on the expected estimation error, Theorem 6 bounds the total risk, including both the expected estimation error, and expected approximation error of the class of functions . As a result, while the bounds in Corollary 2 include the separate approximation error term, , the piecewise linear bound in Corollary 3, does not.