Classification is one of the founding pillars for statistical learning. In binary classification, an training data set is obtained from an unknown distribution , where is the observed covariates and is the class label. The learning goal is to obtain a classifier based on the training data, such that for any new observation with only available, its class label can be accurately predicted using . The goodness of a classifier is commonly measured by the misclassification rate,
, where the probability is taken with respect to. We aim to find the best classifier that minimizes the expected value of the - loss .
There are many classification methods in the literature. For an overall introduction, see Hastie et al. (2009). Among these methods, margin-based classifiers are very popular. For a binary margin-based classifier, one typically finds a classification function and defines the classifier as . A correct classification occurs when the functional margin . Since directly minimizing the empirical - loss is difficult due to the discontinuity of the -loss function, a surrogate loss is often used to encourage large values of the functional margin
. Many binary margin-based classifiers using different surrogate loss functions have been proposed in the literature, such as Support Vector Machines(SVM; Cortes and Vapnik, 1995; Vapnik, 1998), AdaBoost (Freund and Schapire, 1997), -learning (Shen et al., 2003), Distance-Weighted Discrimination (DWD; Marron et al., 2007), Large-margin Unified Machine (LUM; Liu et al., 2011) and Flexible High-dimensional Classification Machines (FLAME; Qiao and Zhang, 2015).
When there are classes, the class label can be coded as instead. In this article, we focus on multicategory classifiers that consider all classes simultaneously in a single optimization problem. A common approach is to train a vector-valued function , and define the classifier as . A sum-to-zero constraint, , is often imposed for theoretical and practical concerns. See, for example, Vapnik (1998), Crammer and Singer (2001), Lee et al. (2004), Zhu and Hastie (2005), Liu and Shen (2006), Liu and Yuan (2011), Zhang and Liu (2013), among others. Recently, Zhang and Liu (2014) proposed the angle-based classification framework. The angle-based classifiers are free of the sum-to-zero constraint, and can be advantageous in terms of computational speed and classification performance, especially for high-dimensional problems. In this paper, our proposed method is based on the angle-based classification framework.
In real applications, it is often the case that an accurate decision is hard to reach, and the consequence of misclassification is disastrous and too severe to bear. In these situations, it may be wise to resort for a reject option, i.e., to report “I don’t know” (denoted as ® hereafter), to avoid such a consequence. With a reject option, future resources will be allocated to these previously rejected subjects to improve their classification. For example, in cancer diagnosis, an oncologist should send a patient, who is difficult to be diagnosed based on preliminary results, for more tests, or seek a second opinion, instead of telling the patient, with little confidence, that she probably has or does not have the cancer.
To adopt a reject option, a possible approach is to modify the - loss such that when ® occurs, a positive cost is present (otherwise, ® would always be preferred). For instance, Herbei and Wegkamp (2006) considered the -- loss, , where is the cost for a rejection (e.g., this may be the cost for the additional tests that the oncologist orders for the patient.)
Recently, there have been a number of works on the reject option for binary classification in the literature (Fumera and Roli, 2002; Herbei and Wegkamp, 2006; Wegkamp, 2007; El-Yaniv and Wiener, 2010; Yuan and Wegkamp, 2010; Wegkamp and Yuan, 2011). However, much less attention has been paid to multicategory classification. In the literature, Fumera et al. (2000), Tax and Duin (2008) and Le Capitaine and Frélicot (2010)
considered the reject option in multicategory classification using methods that depend on explicit class conditional probability estimation. However, probability or density estimation is often much more difficult than class label prediction(Fürnkranz and Hüllermeier, 2010), especially when the dimension is high (Zhang et al., 2013). Hence, it is desirable to have a multicategory classifier with a reject option that does not rely on explicit class probability estimation. The current article fills the gap on this end.
Our first contribution is to propose multicategory classifiers with a reject option. Our methods are based on angle-based multicategory methods and do not involve estimating the class conditional probability, hence can be robust and efficient for high-dimensional problems.
Secondly, we introduce a new notion that is quite unique for the multicategory problem (which is absent in the binary case), namely, a refine option. A refinement predicts the class label to be from a set of labels, where . When , it reduces to the regular definite classification; when , no information is provided and a refinement is the same as ®; when , we have refined the number of classes that an observation most likely belongs to, from to . A smaller leads to more useful information, yet it increases the chance of misclassification. In this paper, we introduce a data-adaptive approach that can automatically select the size for a new prediction.
The usefulness of the refine option can be understood from two sides. In contrast to a definite but potentially reckless answer (), a refinement is more cautious and risk-avert; catastrophic consequences of misclassification can be effectively avoided. On the other hand, compared with a complete reject option (), which tells little about an observation, a refinement provides constructive information; future investigation can be conducted on a set of originally confusable classes, which can improve the classification performance.
Our next contribution is a thorough investigation of the theoretical properties of our methods, focusing on the asymptotic behavior of the excess -risk when the number of classes and the dimension both diverge. In particular, we calibrate the difficulty of classification when increases. This helps to shed some light on the usefulness of our new refine option, that is, one can focus on a subset of classes in a refined further analysis, which can in turn improve the classification accuracy. Moreover, we demonstrate that if the number of noise predictors diverges faster than does, then the penalty can perform better than the regularization. On the other hand, if the number of noise predictors is negligible with respect to the number of classes, then the and methods are comparable.
The rest of the article is organized as follows. Section 2 provides some background information. The main methods are introduced in Section 3. Section 4 presents the algorithms and tuning parameter selection. A novel statistical learning theory is provided in Section 5. Section 6 includes all the numerical studies. Some concluding remarks are given in Section 7. Most technical proofs are collected in the Supplementary Materials.
Let be the class conditional probability of observation for class ( or ). In the binary case, it can be shown that the Bayes decision under the -- loss is, if , if , or ® otherwise (Herbei and Wegkamp, 2006).
Note that where for binary classification or for the multicategory case. Hence, for each , must fall on a simplex in . Throughout this article, we define the Bayes reject region to be , a region on this simplex. For example, in the binary case, we have .
While it is possible to achieve the reject option by first estimating the conditional probabilities for each and then plugging the estimates in the Bayes rule (whose form in the multicategory case will be formally presented in Proposition 2), it is well known that probability estimation can be more difficult than mere label prediction (Wang et al., 2008; Fürnkranz and Hüllermeier, 2010; Wu et al., 2010), especially when the dimension is large (Zhang et al., 2013). Hence our goal here is to propose multicategory classifiers with a reject option that does not require explicit probability estimation.
We first briefly introduce the state-of-the-art for binary classification with a reject option. Section 2.2 reviews the angle-based multicategory classification methods.
2.1 Binary Margin-based Classification with a Reject Option
The seminal paper of Bartlett and Wegkamp (2008) proposed a novel method that employed a modified hinge loss for binary classification with a reject option. In particular, if , if , and otherwise, where (see Figure 1.) Define to be the minimizer of the conditional expected loss (for an appropriate space ) and define the associated classifier to be if , or otherwise. Then the -reject region is defined as . Bartlett and Wegkamp (2008) showed that their coincided with the Bayes rule and hence, .
2.2 Angle-based Multicategory Classification
Zhang and Liu (2014) showed that multicategory margin-based classification methods with classification functions under the sum-to-zero constraint can be inefficient, and proposed the angle-based classification framework. They showed that angle-based classifiers are competitive in terms of classification accuracy and computational speed, especially when is large. The idea of angle-based classifiers are briefly introduced here. For a problem with classes, consider a centered simplex in with vertices, . Here
where is a vector of all ’s, and has on its th element and elsewhere. One can verify that ’s have unit norms, and the pairwise distances between and are the same for all . Therefore, forms a simplex with vertices in . We use as the surrogate coding vector for the class label ‘’. In angle-based methods, a vector-valued classification function maps to . Each induces angles with , namely, , . Zhang and Liu (2014) proposed to use the prediction rule . Here, the inner product can be viewed as an analog to the functional margin in a non-angle-based method, and hence is referred to as an angle margin hereafter. From this point of view, Zhang and Liu (2014) proposed to solve the following optimization problem to find within some functional space ,
where is a common binary margin-based surrogate loss function, is a penalty on to prevent overfitting, and is a tuning parameter to balance the goodness of fit and the complexity of the model. The optimization (1) encourages a large value for .
3.1 Multicategory Classification with a Reject Option
Given an observation , recall the definition of . Let be the th greatest value among ’s, let be the class label corresponding to , and define to be the coding vector for . Note that is not necessarily the true class label for , but is its th most plausible class. Lastly, we define , and .
Our approach is inspired by the work of Bartlett and Wegkamp (2008) for binary problems. In particular, their loss function was , where was the hinge loss function for SVM and was an additional slope added to the hinge loss for . One can view as the hinge loss, bent at so that the left derivative is (negatively) larger than the right derivative . Denote the theoretical minimizer . The bent loss function can keep at if is not significantly different from (), thus leading to a rejection in this case. In particular, is positive if , is negative if , and remains if . Note that comparing with is equivalent to comparing with .
Inspired by these observations, to realize a reject option for multicategory classification, we employ a similar technique, namely, to use a bent loss function that has different left and right derivatives at . Specifically, we equip an angle-based multicategory classifier with a modified loss, with the aim to have the angle margin for all , where is the theoretical minimizer of the loss (to be defined more precisely later) if the class conditional probability ’s are not significantly different from each other; note that this implies that is not large enough and that ’s are similar as well. We will show in Proposition 1 that this is indeed the case.
For any observation and function , we propose a loss function defined as
where . Here and is the loss function for any Fisher consistent binary margin-based classifier (such as the hinge loss, the DWD loss, the LUM loss and the FLAME loss.) Throughout this paper we assume for simplicity. Furthermore, is defined so that for , and for . Hence is the result of bending using . This will be illustrated in Figure 2 using two typical loss functions. The loss function (2) is the sum of over all class ’s not equal to the true class . With this loss function, our classification function is obtained by,
The monotonically increasing loss function encourages a small value of for which indirectly maximizes since .
With a small positive constant, define the soft thresholding operator (Donoho, 1995) as . The induced classifier can be summarized as,
That is, we report a rejection when all ’s are close to 0.
Our method is very general, as one can use any Fisher consistent binary margin-based loss and extend the binary classifier to the multicategory case, meanwhile allowing for a reject option. For the purpose of illustration, in this section we generalize two popular binary margin-based classifiers, SVM and DWD. The bent SVM and DWD losses are,
We plot and in Figure 2.
To provide more insights to the new classifier, we first study the population version of , namely, the theoretical minimizer , and its associated reject region. We will compare the reject region of our method with the Bayes reject region under a generalized -- loss, and show that our methods mimic the latter, which helps to justify our approach from a theoretical view.
Let be a bent loss function as defined in (2), with and . For the sequence , if there exists some such that and , then the theoretical minimizer of the conditional expected loss satisfies that , , and for all ; otherwise, for all .
Proposition 1 indicates that for all when , that is, the class conditional probability of the most plausible class is not significantly different from that of the least plausible class , by a ratio not exceeding .
Hence, the corresponding -reject region is which depends on the parameter . When the context is clear, we may use notation without explicitly declaring its dependence on . On Panels (b) and (c) of Figure 3, we plot for a three-class example with two values of , and , defined in Proposition 3.
For each , the -reject region is near the center of the simplex, which is where the class conditional probability ’s are close to each other. Intuitively, that is a difficult observation to classify. Next, consider a natural generalization of the (binary) -- loss in Herbei and Wegkamp (2006) to the multicategory case, which assigns for correct decisions, for mistakes, and for ®. In a -class problem, we must have to prevent the reject option from being inadmissible. The next proposition gives the Bayes classifier under the generalized -- loss for multicategory classification, which depends on only. The Bayes reject region is (see Panel (a) of Figure 3.)
(Chow, 1970) For the -- loss in multicategory classification, the Bayes classifier is if , and ® otherwise.
One would expect a good classifier with a reject option to have a reject region that resembles (or even coincides with) that of the Bayes rule (under an appropriate loss function). Indeed, one can deduct from Proposition 1 that for any Fisher consistent binary loss function with and , our coincides with the Bayes reject region under the -- loss. However, in the multicategory case, this property generally does not hold. The next proposition gives the greatest and smallest such that and bound from two sides.
For a -class problem with the cost for rejection , define and . Then we have . The bounds are tight in the sense that for any such that , and .
Panels (b) and (c) in Figure 3 show the -reject regions for and . From the comparison between these two reject regions and the Bayes reject region shown in Panel (a), one can see that our method induces a reject region that closely approximates the Bayes reject region. In practice, one can choose from for such an approximation. The issue of tuning the parameter is deferred to Section 4.2.
In each panel among (a), (b) and (c), the reject region occupies the center of the simplex where all are close to each other (i.e. is not large enough). Out of that area, some or all the classes other than the dominating class would appear to be unlikely and hence are ruled out. In this case, a rejection is not yielded by (4).
3.2 Classification with a Refinement Option
The previous subsection is built on the assumption that a reject option is necessary when an observation falls into the reject region, depicted in Figure 3, where all classes seem to be equally likely and it is difficult to distinguish one class from another. On the other hand, even if an observation is not in the reject region, it is not necessarily the case that a definite classification is desirable. This is the main point of the current subsection. In each of Panels (a)-(c) of Figure 3, out of the blue reject region, there are still areas where some confusion may occur between two classes. For example, many observations near the boundary between the black (class 1) and the red (class 2) regions are not likely to be from class 3, but we still have difficulty determining between class 1 and class 2. A method which is only capable of yielding rejections is still not able to effectively avoid an expensive misclassification which is very likely to happen in this situation. This naturally motivates a new refine option for multicategory classification, in which, we may rule out class 3 and predict the observation to be from either class 1 or class 2. On one hand, we can avoid a potential misclassification by using a set of classes as the prediction; on the other hand, the set prediction provides additional information compared to what a rejection would do (which is almost null.)
The discussion above suggests that the complement of the reject region (the previous definite regions) be further partitioned to some definite regions and refine regions. In Figure 3, for example, in addition to rejections, we should have (a) three definite regions where the prediction is a single class label, 1, 2 or 3, and (b) three refine regions where the prediction is a set of two classes, namely, , or .
To this end, we review the results of Proposition 1: a rejection occurs (all the angle margins ) when the most plausible class cannot be distinguished from the least plausible one (since ); otherwise, the angle margin for the most plausible class is positive, the angle margins for some less plausible classes are zero, although the conditional probabilities of these classes are still close to that of (since ), and the angle margins for the implausible classes are all negative. Hence we may use the angle margins to define predictions, since they reflect the plausibility of a class label for an observation. A general guideline is that a positively large angle margin suggests a label prediction, the presence of some angle margins close to 0 and some angle margins negatively large suggests refinement (and ruling out those implausible), and the case of all angle margin close to 0 indicates rejection.
In reality, since the empirical angle margin may be deviated from the theoretical counterpart for a finite sample problem, the gap between angle margins may not appear obvious. In this case, we employ a soft-thresholding technique to distinguish significantly large and small angle margins. In particular, with the thresholded angle margins, our new classifier with both reject and refine options is defined as,
Note that the reject rule, the first line in (5), is identical to that in (4). This corresponds to the case that all angle margins are close to 0, implying that all the class conditional probabilities are close to each other. The second line attempts to find the most significantly large margin, and hence the most plausible class. In our numerical experience, we occasionally observe cases with multiple significantly large margins which are close to each other. In this case, we have chosen to include all those plausible classes (if any) as a set prediction. The third line corresponds to the case where the most plausible class is not significantly different from some other classes and we resort to ruling out those implausible classes (those with significantly negatively large margins) instead.
It can be seen that the union of the second and third cases in (5) is identical to the definite label prediction region in (4) (the second line therein). However, when an observation belongs to the third case in (5), the rule in (4) recklessly reports a single label as the prediction, while the novel refinement rule (5) here uses a set prediction. This is the main difference between the classifiers in (4) and (5).
For illustration, we plot the reject, refine and definite regions for a three-class problem on Panel (d) of Figure 3 for . It can be seen that the three refine (cyan) regions are cut from the previous definite regions in Panel (c) and hence the current definite regions are smaller than in (c) as well. More importantly, one may hold more confidence for a label prediction made by the new classifier (5). In Section 6, we demonstrate through numerical examples that the classification accuracy on the refine region in (5) can be significantly improved, compared to the classification accuracy on the counterpart of (4).
4 Optimization and Tuning Parameter Selection
In this section, we discuss how to solve the optimization problem (3), from which both our methods (4) and (5) are derived. Various approaches are possible, depending on the choice of , and . For demonstration purpose, in this section we use the reversed hinge loss for . We let be the norm penalty in linear learning, and the squared norm penalty in kernel learning (Schölkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004). For other cases with some general properties, such as one with a differentiable and a separable penalty function, one can solve (3) by the alternating direction method of multipliers (Boyd et al., 2011). We have developed fast implementations for our methods based on the hinge loss, the DWD loss and the Soft classifier loss (Liu et al., 2011). These algorithms will be publicly available in R.
We start our discussion from linear learning. Suppose for . Notice that we include the intercept terms in the ’s by catenating to . The penalty can be written as . The bent hinge loss can be decomposed as . After a series of introduction of Lagrangian multiplier and slack variables, and manipulations due to the KKT conditions (detailed derivations of the algorithms can be found in the Supplementary Materials), we can show that the optimization problem (3) is equivalent to
where Observe that the objective function is quadratic in terms of ’s and ’s, and the constraints are box constraints. Therefore, one can solve (4.1) via the very fast coordinate descent algorithm (Friedman et al., 2010). Moreover, as the objective function is quadratic, for each coordinate-wise update, the solution can be explicitly calculated. This greatly boosts the computational speed.
Similarly, for kernel learning, we can use , , for kernel function , where the square norm penalty is , and is the th element of . In the same manner as above, one can derive a fast solution to this problem.
4.2 Tuning Parameter Selection
There are three tuning parameters in our methods, namely , and . Here is associated with the cost of rejection , where the latter should be fixed a priori. In the numerical study, we find that the choice of does not affect the result much, as long as . We recommend to try both and and use the one with a better result.
Parameter restricts the model space that the classifier is searched from. Typically is tuned from a grid of many candidate values. The optimal is chosen for one that minimizes the -- loss for a separate tuning data set or via cross-validation.
Lastly, is a small positive constant used to distinguish significantly large and small angle margins. Similar to , we tune by choosing the one that leads to the smallest -- loss for a separate tuning data set or via cross-validation. However, note that solving the optimization problem (3) to obtain does not involve ; only the conversion from to the classifier or does. Hence tuning hardly adds to the computational cost.
5 Statistical Learning Theory
In this section, we first study the convergence rate of the excess -risk under various settings. In particular, we study the cases of linear learning with and penalties, and kernel learning with the squared norm penalty. Then, we improve our results with an additional low noise assumption, analogous to Tsybakov’s margin condition (Tsybakov, 2004).
5.1 General Convergence Rate of the Excess -Risk
In the literature, the excess -risk for a learning procedure has been studied by many authors in different settings. See Zhang (2004) and Bartlett et al. (2006) for standard binary classification, Liu and Shen (2006), Wang and Shen (2007), and Zhang and Liu (2014) for multicategory classification, and Herbei and Wegkamp (2006), Wegkamp (2007), and Wegkamp and Yuan (2011) for binary classification with a reject option. We focus on the excess -risk for the multicategory classification with a reject option.
We first consider linear learning with a diverging number of predictors and a diverging number of classes . In the statistical learning literature, it is becoming increasingly popular to consider large as (for example, Fan and Lü, 2008; Mai and Zou, 2012; Cai et al., 2014, among others.) On the other hand, for classification problems, not much attention has been paid to the large situation. Recently, Gupta et al. (2014) studied classification problems with tens of thousands of classes. However, the theoretical property of classifiers with diverging remains largely unknown.
First, we assume that each predictor is bounded within , though our theory can be generalized to cases where it is uniformly bounded. As the number of predictors and the number of classes diverge, we let the underlying distribution be defined on , where is the -field generated by open balls with the topology under the uniform metric , and is the power set of and hence a -field.
For linear learning, we have with , . We define . For the penalty, , and for the penalty, . Let be the full -dimensional model with classes. Recall that Let the best classification function be denoted by .
For any classification function , the excess -risk is defined as
We denote as the approximation error between and . Theorem 1 establishes the convergence rate of as .
Assume as . For linear learning with the penalty, , almost surely under . For the penalty, , almost surely under .
In Theorem 1, controls the balance between the estimation error, that is or , and the approximation error . As increases, decreases. The best trade off is one such that for the penalty, and for the penalty. The convergence of the excess -risk requires that and for the penalized method, and and for the method.
Theorem 1 suggests that classification with a large number of classes can be very difficult. This helps to shed some light on the usefulness of our refine option. In particular, if a set of class labels frequently appears in set predictions (for instance, see Examples 2 and 3 in Section 6), one can consider a refined classification problem (with labels restricted in the prediction set) and use a richer functional space if desired. Theorem 1 suggests that the new classifier can have better performance since the number of classes is smaller.
When is bounded, and the classification signal is sparse, Theorem 1 demonstrates the effectiveness of the method: it can be verified that if the true classification signal is sparse, then one can choose a large enough but fixed , such that the approximation error is . In other words, for some . In this case, Theorem 1 can be greatly simplified.
Assume that is bounded, and the true classification signal depends on finitely many predictors. Assume as . We can choose for all large , such that . Consequently, for the penalty, , almost surely under , and for the penalty, , almost surely under .
On the other hand, for as , we cannot have a fixed such that , even if the dimensionality is bounded. The next corollary considers a special situation where the number of true signal grows linearly with the number of classes. In this case, we can let , such that the approximation error is zero.
Consider any classification sub-problem where the label is restricted in , for any . Suppose that the classification signal for the restricted sub-problem depends on at most predictors, where is a fixed positive integer that is universal for all . Then for the complete problem with classes, one can choose with a fixed constant , such that the approximation error . Consequently, for the penalty, and for the penalty, almost surely under .
A common scenario in which the assumptions of Corollary 2 hold is when each class has its own identifying attributes, and the number of signature attributes for each class is uniformly bounded by . For instance, in cancer research, one may identify each cancer subtype with mutations on a small and non-overlapping group of feature genes. In this case, we can choose as a linear function of such that the approximation error is . Another insight of Corollary 2 is that when there is no noise variable, that is, when , we have that the performance of the and regularization methods is comparable since the corresponding estimation errors have the same convergence rate.
Next, we study the convergence rate of the excess -risk for kernel learning. To this end, we impose an assumption that the kernel is separable, and its corresponding kernel function is uniformly upper bounded. In other words, . Steinwart and Scovel (2007) and Blanchard et al. (2008), among others, used a similar assumption.
For kernel learning with the squared norm penalty, recall from Section 4.1 that the estimated classification functions are of the form with . We define , where . Note that the intercepts are included in the penalty. In the RKHS learning literature, many theoretical results are derived without the intercept term (Bousquet and Elisseeff, 2002; Chen et al., 2004; Steinwart and Christmann, 2008). Our theory can incorporate regularized intercepts in the classification functions, hence is more general. Let , and be defined analogously as in the linear learning case. The next theorem gives the convergence rate of the excess -risk for kernel learning.
Assume as . For RKHS learning, assume that the kernel is separable, and the corresponding kernel function is uniformly upper bounded. We then have , almost surely under .
In Theorem 2, the dimension of the predictors does not directly affect the estimation error . Instead, it is implicitly involved in the approximation error . This is because the proof of Theorem 2 relies on the complexity of the function space , in terms of its covering number (van der Vaart and Wellner, 2000). Note that Theorem 2 requires only that the kernel is separable and the kernel function is upper bounded, hence can be very general. On the other hand, if we restrict our consideration on a specific kernel, then more refined results can be obtained. For instance, many theoretical properties of the well known Gaussian kernel have been established. In Zhou (2002) and Steinwart and Scovel (2007), the relation between the covering number of the corresponding function space and has been obtained. Therefore, one can modify the proof of Theorem 2 and explore the explicit effect of on the estimation error accordingly.
So far, we have obtained the convergence rate of the estimation error for our classifiers. For linear learning and kernel learning, the rate can be close to the parametric rate , if and are negligible as . In the next section, we consider stronger assumptions, including a low noise assumption for multicategory classification problems. We show that faster rates are possible under these additional conditions.
5.2 Fast Rate under Low Noise Assumption
In the literature, many theoretical results have been established for binary SVMs with assumptions similar to Tsybakov’s margin condition (see Steinwart and Scovel, 2007; Bartlett and Wegkamp, 2008; Wegkamp and Yuan, 2011; Zhao et al., 2012, and the references therein.) In this paper, we consider the margin condition in multicategory problems with a reject option for a general loss function in (2). We show that when the classification function is in certain RKHSs, for example the Gaussian kernel space, a faster rate of convergence of the excess -risk can be obtained.
(Low noise assumption) For -class classification problems, we say that the distribution satisfies the margin condition at threshold level with exponent , if there exists a constant such that for all ,
Intuitively, under Assumption 1 with large , little probability mass is put around the boundary between the reject region and its complement. Thus, the classification signal is strong, and we expect that the estimation error can have a faster convergence rate. For binary SVM with a reject option, (7) reduces to the low noise assumption introduced in Bartlett and Wegkamp (2008) with .
Because we intend to consider a general loss function, we impose some minor restrictions on the loss and some assumptions on the distribution. The next assumption is needed to prevent from being too large, which yields a lower bound for the second order derivative of at the theoretical minimizer .
The loss function in (2) is twice differentiable for . Furthermore, for any , the class conditional probability for any class is bounded away from . In other words, for a small and positive .
Theorem 3 improves the convergence rate under the new assumptions.
Hence, the estimation error can converge at a rate faster than . In particular, for a problem with fixed and for a non-diverging , if , then the rate can become arbitrarily close to .
We remark that for a differentiable loss function whose derivative is strictly positive for small , one may have if some goes to zero. In this perspective, Assumption 2 helps to bound . However, Assumption 2 may not be needed for some special loss functions. For example, if is the reversed hinge loss, or the reversed FLAME loss proposed by Qiao and Zhang (2015), we can drop Assumption 2 while the result in Theorem 3 remains valid. In general, if the loss function is flat for small enough , we can remove Assumption 2 from Theorem 3. See the proof and discussion of Theorem 3 in the Supplementary Materials for more discussions.
6 Numerical Studies
In this section, we study the numerical performance of our proposed classifiers (one with a reject option only, and one with both reject and refinement options.) For classification problems with weak signals, we show that the empirical -- loss for classifiers with a reject option can be smaller than that for regular classifiers. Furthermore, we show that the refine option can often provide refined set prediction with very high accuracy. Due to its reliable performance, in practical problems, the refinement option can be used to identify classes that are highly confusable with each other, so that future tests can be dedicated to these classes for potential improvement in classification accuracy.
6.1 Method of Comparisons
For all numerical problems in the current section, we study the performance of regular classifiers, classifiers with only the reject option, and classifiers with both reject and refine options. There are three possible prediction outcomes: (definite) label predictions, (refined) set predictions and rejections. Different types of predictions are shown using different colors in Figure 4. It can be seen that different classifiers have different capacity: regular classifiers can only provide label predictions while our classifiers with a refinement option can yield all three prediction types. We report classification performance on three disjoint subsets of observations, namely, , and . The three subsets are defined as the observations which are label predicted, set predicted and rejected, respectively, by the classifier with both reject and refine options.
We report the misclassification error for each observation subset for each classifier. No misclassification rate is reported for rejected observations. For observations that are refined by our classifier (), we report the mis-refinement rate, which is defined as the proportion of observations whose true class labels are not in the prediction sets. We also report the empirical -- loss for the whole test data set for each classifier, where we count cost for each misclassification or mis-refinement, and cost for each rejection. The proportions of , and are reported, since one may want to avoid large proportions of and unless necessary. We also report the proportion and mis-refinement rate for selected sets of class labels when they are of interest for the discussion.
For comparison purpose, we also use classifiers with probability estimation, and plug in the estimates into the Bayes rule (in Proposition 2) to achieve a reject option. The proportion of the label predicted and rejected observations by this approach are calculated, and the misclassification rate for the label prediction set is reported. The overall empirical -- loss is reported as well.
We conduct 100 replications for each example and report the average.
We consider three simulated examples to assess the performance of the proposed methods. We focus on linear learning here and consider the Soft-LUM classifier loss (Soft; Liu et al., 2011), the DWD loss, and the SVM loss. Each loss is associated with one regular classifier, one with rejection only and one with both reject and refine options. Moreover, we implement the probability estimation method associated with Theorem 3 of Zhang and Liu (2014).
To select the best tuning parameters and , we choose from a candidate set the best pair that minimizes the empirical -- loss on a separate tuning data set, where consists of values, and . The multiplicative constant is used to scale for the magnitude of the angle margins. This is because when is large, a severe regularization is often needed, which would shrink the magnitude of (Zhang et al., 2013). In this case, using a fixed set of candidate values in could be suboptimal. Note that letting shuts off reject and refine options. To illustrate the effect of on the reject and refinement results, we fit the classifiers with several values between and , but show the results for the best one only to save some space. More details are included in the Supplementary Materials.
A four-class example with equal prior probabilities. We first generate two covariates that determine the true class distributions. In particular,,
, are uniformly distributed in, , , and respectively. See the left panel of Figure 5 for a typical example on the first two dimensions. We then add 98 noise covariates. The training and tuning data sets are of size respectively, and the test data set is of size . In this example we let , and report the behavior of the Soft loss using the penalty only.