Standard classification tasks focus on building a classifier which predicts well on future examples. The overall goal is to minimize the number of mis-classifications. However, when the cost of mis-classification is very high, a generic classifier may still suffer from very high risk. In such cases it makes more sense not to classify such examples. This choice given to the classifier is called reject option. Hence, the classifiers which can also reject examples are called reject option classifiers. The rejection also has its cost but it is very less compared to the cost of misclassification. For example, in medical domain, making a poor decision based on the diagnostic reports can cost huge amount of money on further treatments or it can be cost of a life. If the reports are ambiguous or some rare symptoms are seen which are unexplainable without further investigation, then the physician might choose not to risk misdiagnosing the patient. In this case, he might instead choose to perform further medical tests, or to refer the case to an appropriate specialist. The principal response in these cases is to “reject” the example. Reject option classifier may also be found useful in financial services. Consider a banker looking at loan application of a customer. He may choose not to decide on the basis of the information available, and ask for a credit bureau score or further recommendations from the stake holders. These actions can be viewed as akin to a classifier refusing to return a prediction (in this case, a diagnosis) in order to avoid a potential misclassification. While the follow-up actions might vary (asking for more features to describe the example, or using a different classifier), the principal response in these cases is to “reject” the example. The focus of this paper is on learning a classifier with a reject option.
Reject option classifier can be viewed as combination of a classifier and a rejection function. The rejection region impacts the proportion of examples that are likely to be rejected, as well as the proportion of predicted examples that are likely to be correctly classified. An optimal reject option classifier is the one which minimizes the rejection rate as well as the mis-classification rate on the predicted examples.
Let be the feature space and be the label space. Typically, for binary classification, we use . Examples
are generated from unknown joint distributionon the product space . A typical reject option classifier is defined using a decision surface () and a bandwidth parameter (). Bandwidth parameter determines the rejection region. Then, a reject option classifier is defined as:
A reject option classifier can be viewed as two parallel surfaces and the area between them as rejection region. The goal is to determine both and simultaneously. The performance of a reject option classifier is measured using loss function defined as:
where is the cost of rejection. If , then will always reject. is chosen in the range . (described in equation. 1) has been shown to be infinite sample consistent with respect to the generalized Bayes classifier (Yuan and Wegkamp, 2010). A reject option classifier is learnt by minimizing the risk which is the expectation of with respect to the joint distribution . The risk under is minimized by generalized Bayes discriminant (Chow, 1970), which is
where . However, in general we do not know . But, we have the access to a finite set of examples drawn from called training set. We find the reject option classifier by minimizing the empirical risk. Minimizing the empirical risk under is computationally hard. To overcome this problem, convex surrogates of have been proposed. Generalized hinge based convex loss has been proposed for reject option classifier (Bartlett and Wegkamp, 2008). The paper describes an algorithm for minimizing regularized risk under generalized hinge loss. A sparse reject option classifier can be learnt by minimizing regularized risk under generalized hinge loss (Wegkamp and Yuan, 2011). In that approach, first a classifier is learnt based on empirical risk minimization under generalized hinge loss and then a threshold is learnt for rejection. Ideally, the classifier and the rejection threshold should be learnt simultaneously. This approach might not give the optimal parameters. Also, limited experimental results have been provided to show the effectiveness of the proposed approaches (Wegkamp and Yuan, 2011). A cost sensitive convex surrogate for called double hinge loss has been proposed in Grandvalet et al. (2008). The double hinge loss remains an upper bound to provided , which is very strict condition. So far, the approaches proposed learn a threshold for rejection along with the classifier. However, in general, the rejection region may not be symmetrically located near the classification boundary. A generic convex approach has been proposed which simultaneously learns the classifier as well as the rejection function (Cortes et al., 2016). The main challenge with the convex surrogates is that they are not constant even in the reject region in contrast to loss. Moreover, convex losses have been shown to perform poor in case of label noise (Manwani and Sastry, 2013; Ghosh et al., 2015). A non-convex formulation for learning reject option classifier using logistic function is proposed in Fumera and Roli (Fumera and Roli, 2002b). However, theoretical guarantees for the approach are not known. Also, a very limited set of experiments are provided in support of the approach. A bounded non-convex surrogate called double ramp loss is proposed in Manwani et al. (2015). A regularized risk minimization algorithm was proposed with regularization (Manwani et al., 2015). The approach proposed shown to have interesting geometric properties and robustness to the label noise. However, statistical properties of are not studied so far. Moreover, with regularization, it does not give sparse classifiers.
In this paper, we propose a sparse reject option classifier learning algorithm using double ramp loss with regularization term. The overall objective function becomes non-convex and can be written as a difference of convex functions. We use difference of convex (DC) programming (Thi Hoai An and Dinh Tao, 1997)
to minimize the regularized risk. The final algorithm turns out to be solving successive linear programs. By sparseness, we mean that the number of support vectors needed to express the classifier are much less. We prove that generalized Bayes classifier minimizes the risk under. We also derive the excess risk bound for . We also show experimentally that the proposed approach is robust against label noise.
2 Proposed Approach
We propose a new algorithm for learning reject option classifier which minimizes the -regularized risk under double ramp loss function (Manwani et al., 2015). is a non-convex surrogate of as follows.
where is the slope of the loss in linear region and . Note that depends on specific choice of . Also, for a valid reject region, we want . Figure 1 shows the plot of for different values of .
2.1 Sparse Double Ramp SVM (SDR-SVM)
Let be the training set where . Let the reject option classifier be of the form . Let be a Mercer kernel (continuous, symmetric and positive semi-definite) to produce nonlinear classifiers. Let be the reproducing kernel Hilbert space (RKHS) induced by the Mercer kernel with the norm (Aronszajn, 1950). To learn sparse reject option classifier, we use regularization term. Thus, we find the classifier as solving following optimization problem.
However, the optimal lies in a finite dimensional subspace of (Scholkopf and Smola, 2001). . Given , the regularization is defined as for (Wu and Zhou, 2005). Thus, the sparse double ramp SVM can be learnt by minimizing following regularized risk.
where . . We see that is a non-convex function. However, can decomposed as a difference of two convex functions and as , where
To minimize such a function which can be expressed as difference of two convex functions, we can use difference of convex (DC) programming. In this case, DC programming guarantees to find a local optima of the objective function (Thi Hoai An and Dinh Tao, 1997). The simplified DC algorithm uses the convexity property of and finds an upper bound of as , where . is the parameter vector after iteration, is a sub-gradient of at . is found by minimizing . Thus, . Thus, the DC program reduces the value of in every iteration. Now, we will derive a DC algorithm for minimizing . Given , we find . We use as:
where and . Note that . The new parameters are found by minimizing subject to . Which becomes
Thus, can be minimized by solving a linear program. Thus, the algorithm solves a sequence of linear programs to learn a sparse reject option classifier. The complete approach is described in Algorithm 1.
In this section, we establish certain important properties of . We first show that is classification consistent which means that the minimizer of the risk under is Bayes classifier. 1 The generalized Bayes discriminant function (described in equation. (3)) minimizes the risk
over all measurable functions .
Let and . Thus, . The function can take different values in different cases as described in eq. (5).
From above equations, we can say that
Minimum of any graph will occur when the slope of the graph is equal to zero and goes from negative to positive or it can happen at any boundary point. Thus, minimum of can occur in only following 3 intervals: , . Thus,
Now, if , then .We know that therefore if then . If then which implies that therefore for , . If then which implies that . If then which implies therefore for , . If then which implies that . We know that therefore which implies that and therefore for , . Combining above statements,
and Bayes discriminant function for double ramp loss will be
which is same as . Therefore, minimizes the risk . We will now derive the excess risk bound for . We know that . This relation remains preserved when we take expectations both side, means . This relation is also true for excess risk. To show that, We first define the following terms.
We know that and . Furthermore, we define
We observe the following relationship.
It can easily be seen that the graph of is piece-wise linear. Therefore, infimum of will occur only at the corners of graph. Thus, comparing the slopes of different linear functions, we get
We now analyze different cases as follows.
We know that
Now we can easily see that
Thus, , we get,
We use the piece-wise linear property of . can be written as
For further analysis, we can divide in two parts with respect to values of where minimum function changes value.
Thus, and , we get
Part 3: is expressed as
Using logic of piece-wise linearity,
Now, we can divide function when minimum function changes its value.