In classification problems, one often encounters cases where it would be better for the classifier to take no decision and abstain from predicting rather than making a wrong prediction. For example, in the problem of medical diagnosis with inexpensive tests as features, a conclusive decision is good, but in the face of uncertainty it is better to not make a prediction and go for costlier tests.
For the case of binary actions, this problem has been called ‘classification with a reject option’ (Bartlett & Wegkamp, 2008; Yuan & Wegkamp, 2010; Grandvalet et al., 2008; Fumera & Roli, 2002, 2004; Fumera et al., 2000, 2003; Golfarelli et al., 1997). Yuan and Wegkamp (2010)
show that many standard convex optimization based procedures for binary classification like logistic regression, least squares classification and exponential loss minimization (Adaboost) yield consistent algorithms for this problem. But as Bartlett and Wegkamp(2008) show, the algorithm based on minimizing the hinge loss (SVM) requires a modification to be consistent. The suggested modification is rather simple – use a double hinge loss with three linear segments instead of the two segments in standard hinge loss, the ratio of slopes of the two non-flat segments depends on the cost of abstaining .
In the case of multiclass classification however there exist no such results and it is not straightforward to generalize the double hinge loss to this setting. To the best of our knowledge, there has been only empirical and heuristic work on multiclass version of this problem,(Zou et al., 2011; Simeone et al., 2012; Wu et al., 2007). In this paper, we give a formal treatment of the multiclass problem with a ’reject’ option and provide consistent algorithms for this problem.
The reject option is accommodated into the problem of
-class classification through the evaluation metric. We now seek a function, where is the instance space, and the classes are denoted by and denotes the action of abstaining or the ‘reject’ option. The loss incurred by such a function on an example with is given by
where denotes the cost of abstaining. We will call this loss the abstain loss.
It can be easily shown that the Bayes optimal risk for the above loss is attained by the function given by
where . The above can be seen as a natural extension of the ‘Chow’s rule’ (Chow, 1970) for the binary case. It can also be seen that the interesting range of values for is as for all the Bayes optimal classifier for the abstain() loss never abstains. For example, in binary classification, only is meaningful, as higher values of imply it is never optimal to abstain.
For small , the classifier acts as a high-confidence classifier and would be useful in applications like medical diagnosis. For example, if one wishes to learn a classifier for diagnosing an illness with confidence, and recommend further medical tests if it is not possible, the ideal classifier would be , which is the minimizer of the abstain(0.2) loss. If , the Bayes classifier has a very appealing structure – a class is predicted only if the class has a simple majority. The abstain() loss is also useful in applications where a ‘greater than
conditional probability detector’ can be used as a black box. For example a greater thanconditional probability detector plays a crucial role in hierarchical classification (Ramaswamy et al., 2015). (Details in supplementary material.)
As it can be seen that the Bayes classifier depends only on the conditional distribution of
, any algorithm that gives a consistent estimator of the conditional probability of the classes, e.g. minimizing the one vs all squared loss,(Ramaswamy & Agarwal, 2012; Vernet et al., 2011), can be made into a consistent algorithm (with a suitable change in the decision) for this problem.
However smooth surrogates that estimate the conditional probability do much more than what is necessary to solve this problem. Consistent piecewise linear surrogate minimizing algorithms, on the other hand do only what is needed and can be expected to be more successful. For example, least squares classification, logistic regression and SVM are all consistent for standard binary classification, but the SVM (which minimizes a piecewise linear hinge loss surrogate) is arguably the most widely used method. Piecewise linear surrogates have other advantages like easier optimization and sparsity (in the dual) as well, hence finding consistent piecewise linear surrogates for the abstain loss is an important and interesting task.
We show that the -dimensional multiclass surrogate of Crammer and Singer (Crammer & Singer, 2001) and the simple one vs all hinge surrogate loss (Rifkin & Klautau, 2004) both yield a consistent algorithm for the abstain loss. It is interesting to note that both these surrogates are not consistent for the standard multiclass classification problem (Tewari & Bartlett, 2007; Lee et al., 2004; Zhang, 2004).
More interestingly, we construct a new convex piecewise linear surrogate, which we call the binary encoded predictions (BEP) surrogate that operates on a dimensional space, and yields a consistent algorithm for the -class abstain loss. When optimized over comparable function classes, this algorithm is more efficient than the Crammer-Singer and one vs all algorithms due to requiring to only find functions over the instance space, as opposed to functions. This result is surprising because, it has been shown that one needs to minimize at least a dimensional convex surrogate to get a consistent algorithm for the standard -class problem, i.e. without the reject option (Ramaswamy & Agarwal, 2012). Also the only known generic way of generating consistent surrogate minimizing algorithms for a given loss matrix (Ramaswamy & Agarwal, 2012), when applied to the -class abstain loss would give a -dimensional surrogate here.
It is important to note the role of – the cost of abstaining. While conditional probability estimation based surrogates can be used for designing consistent algorithms for the -class problem with the reject option with any , the Crammer-Singer surrogate, the one vs all hinge and the BEP surrogate and their corresponding variants all yield consistent algorithms only for . While this may seem restrictive, we contend that these form an interesting and useful set of problems to solve. We also suspect that, abstain() problems with are fundamentally more difficult than those with , for the reason that evaluating the Bayes classifier can be done for without finding the maximum conditional probability – just check if any class has conditional probability greater than as there can only be one. This is also evidenced by the more complicated partitions of the simplex induced by the Bayes optimal classifier for as shown in Figure 1.
We start with some preliminaries and notation in Section 2. In Section 3 we give excess risk bounds relating the excess Crammer-Singer multiclass surrogate risk and one vs all hinge surrogate risk to the excess abstain risk. In Section 4 we give our dimensional BEP surrogate, and give similar excess risk bounds. In Section 5, we frame the learning problem with the BEP surrogate as an optimization problem, derive its dual and give a block co-ordinate descent style algorithm for solving it. In Section 6 we give generalizations of the Crammer-Singer, one vs all hinge and BEP surrogates that are consistent for abstain loss for . In Section 7 we include experimental results for all three algorithms. We conclude in Section 8 with a summary.
Let the instance space be , the finite set of class labels be , and the finite set of target labels be given by . Given training examples drawn i.i.d. from a distribution on , the goal is to learn a prediction model .
For any given , the performance of a prediction model is measured via the abstain loss from Equation (1). denotes the loss incurred on predicting when the truth is
. We will find it convenient to represent the loss functionas a loss matrix with elements for
, and column vectorsfor . The abstain( loss matrix and a schematic representation of the Bayes classifier for various values of given by equation (2) are given in Figure 1 for .
Specifically, the goal is to learn a model with low expected loss or -error
Ideally, one wants the -error of the learned model to be close to the optimal -error
An algorithm, which outputs a (random) model on being given a random training sample as above, is said to be consistent w.r.t. if the -error of the learned model converges in probability to the optimal for all distributions : . Here the convergence in probability is over the learned classifier as a function of the training sample distributed i.i.d. according to .
However, minimizing the discrete -error directly is computationally difficult; therefore one uses instead a surrogate loss function (where ), for some , and learns a model by minimizing (approximately, based on the training sample) the -error
Predictions on new instances are then made by applying the learned model and mapping back to predictions in the target space via some mapping , giving .
Under suitable conditions, algorithms that approximately minimize the -error based on a training sample are known to be consistent with respect to , i.e. to converge in probability to the optimal -error
Also, when is convex in its second argument, the resulting optimization problem is convex and can be efficiently solved.
Hence, we seek a surrogate and a predictor , with convex over its second argument, and satisfying a bound of the following form holding for all
where is increasing, continuous at and . A surrogate and a predictor , satisfying such a bound, known as an excess risk transform bound, would immediately give an algorithm consistent w.r.t. from an algorithm consistent w.r.t. . We derive such bounds w.r.t. the loss for the Crammer-Singer surrogate, the one vs all hinge surrogate, and the BEP surrogate, with as a linear function.
3 Excess Risk Bounds for the Crammer-Singer and One vs All Hinge Surrogates
In this section we give an excess risk bound relating the abstain loss , and the Crammer-Singer surrogate (Crammer & Singer, 2001) and also the one vs all Hinge loss.
Define the surrogate and predictor as
where , is the th element of the components of when sorted in descending order and is a threshold parameter.
We proceed further and also define the surrogate and predictor for the one vs all hinge loss. The surrogate and predictor are defined as
where and is a threshold parameter, and ties are broken arbitrarily, say, in favor of the label with the smaller index.
Let , and . Then for all
Remark: It has been pointed out previously by Zhang (2004), that if the data distribution is such that for all , the Crammer-Singer surrogate and the one vs all hinge loss are consistent with the zero-one loss when used with the standard argmax predictor. Our Theorem 1 implies the above observation. However it also gives more – in the case that the distribution does not satisfy the dominant class assumption, the model learned by using the surrogate and predictor or asymptotically still gives the right answer for instances having a dominant class, and fails in a graceful manner by abstaining for instances that do not have a dominant class.
4 Excess Risk Bounds for the BEP Surrogate
The Crammer-Singer surrogate and the one vs all hinge surrogate, just like surrogates designed for conditional probability estimation, are defined over an -dimensional domain. Thus any algorithm that minimizes these surrogates must learn real valued functions over the instance space. In this section, we construct a dimensional convex surrogate, which we call as the binary encoded predictions (BEP) surrogate and give an excess risk bound relating this surrogate and the abstain loss. In particular these results show that the BEP surrogate is calibrated w.r.t. the abstain loss; this in turn implies that the convex calibration dimension (CC-dimension) (Ramaswamy & Agarwal, 2012) of the abstain loss is at most .
For the purpose of simplicity let us assume for some positive integer .111If is not a power of , just add enough dummy classes that never occur. Let be any one-one and onto mapping, with an inverse mapping . Define the BEP surrogate and its corresponding predictor as
where is the sign of , with and is a threshold parameter.
Define the sets , where . Which evaluates to
To make the above definition clear we will see how the surrogate and predictor look like for the case of and . We have . Let us fix the mapping such that is the standard -bit binary representation of , with in the place of . Then we have,
Figure 2 gives the partition induced by the predictor .
The following is the main result of this section, the proof of which is in Appendix C
Let and . Let . Then for all
Remark: The excess risk bounds for the CS, OVA, and BEP surrogates suggest that is the best choice for CS and BEP surrogates, while is the best choice for the OVA surrogate. However, intuitively is the threshold converting confidence values to predictions, and so it makes sense to use values closer to (or in the case of OVA) to predict aggressively in low-noise situations, and use larger to predict conservatively in noisy situations. Practically, it makes sense to choose the parameter via cross-validation.
5 BEP Surrogate Optimization Algorithm
In this section we frame the problem of finding the linear (vector valued) function that minimizes the BEP surrogate loss over a training set , with and , as a convex optimization problem. Once again, for simplicity we assume that the size of the label space for some . The primal and dual of the resulting optimization problem with a norm squared regularizer is given below:
We optimize the dual as it can be easily extended to work with kernels. The structure of the constraints in the dual lends itself easily to a block co-ordinate ascent algorithm, where we optimize over and fix every other variable in each iteration. Such methods have been recently proven to have exponential convergence rate for SVM-type problems (Wang & Lin, 2014), and we expect results of those type to apply to our problem as well.
The problem to be solved at every iteration reduces to a projection of a vector on to the set , where is such that . The projection problem is a simple variant of projecting a vector on the ball of radius , which can be solved efficiently in time (Duchi et al., 2008). The vector is such that for any ,
6 Abstain() Loss for
The excess risk bounds derived for the CS, OVA hinge loss and BEP surrogates apply only to the abstain loss. But it is possible to derive such excess risk bounds for abstain() with with slight modifications to the CS, OVA and BEP surrogates.
Define , and , with as
where, and is any bijection. Note that , and .
Let and . Let . Then for all ,
Remark: When , the Crammer-Singer surrogate, the one vs all hinge and the BEP surrogate all reduce to the hinge loss and is restricted to be at most to ensure the relevance of the abstain option. Applying the above extension for to the hinge loss, we get the ‘generalized hinge loss’ of Bartlett and Wegkamp (2008).
7 Experimental Results
In this section give our experimental results for the algorithms proposed on both synthetic and real datasets.
7.1 Synthetic Data
We optimize the Crammer-Singer surrogate, the one vs all hinge surrogate and the BEP surrogate, over appropriate kernel spaces on a 2-dimensional 8 class synthetic data set and show that the the abstain loss incurred by the trained model for all three algorithms approaches the Bayes optimal under various thresholds.
The dataset we used was generated as follows. We randomly sample 8 prototype vectors , with each
drawn independently from a zero mean unit variance 2D-Gaussian,distribution. These 8 prototype vectors correspond to the 8 classes. Each example is generated by first picking from one of the 8 classes uniformly at random, and the instance is set as , where is independently drawn from . We generated 12800 such pairs for training, and another 10000 instances, for testing.
The CS, OVA, BEP surrogates were all optimized over a reproducing kernel Hilbert Space (RKHS) with a Gaussian kernel and the standard norm-squared regularizer. The kernel width parameter and the regularization parameter were chosen by grid search using a separate validation set.222We used Joachims’ SVM-light package (Joachims, 1999) for the OVA and CS algorithms.
As Figure 3 indicates, the expected abstain risk incurred by the trained model approaches the Bayes risk with increasing training data for all three algorithms and intermediate values. The excess risk bounds in Theorems 1 and 2 breakdown when the threshold parameter for the CS and BEP surrogates, and when for the OVA surrogate. This is supported by the observation that, in Figure 3 the curves corresponding to these thresholds perform poorly. In particular, using for the CS and BEP algorithms implies that the resulting algorithms never abstain.
Though all three surrogate minimizing algorithms we consider are consistent w.r.t. abstain loss, we find that the BEP and OVA algorithms use less computation time and samples than the CS algorithm to attain the same error. However, the BEP surrogate performs poorly when optimized over a linear function class (experiments not shown here), due to its much restricted representation power.
7.2 Real Data
We ran experiments on real multiclass datasets from the UCI repository, the details of which are in Table 2. In each of these datasets if a train/test split is not indicated in the dataset we make one ourselves by splitting at random.
|# Train||# Test||# Feat||# Class|
All three algorithms (CS, OVA and BEP) were optimized over an RKHS with a Gaussian kernel and the standard norm-squared regularizer. The kernel width and regularization parameters were chosen through validation – 10-fold cross-validation in the case of satimage, yeast, vehicle and image datasets, and a 75-25 split of the train set into train and validation for the letter and covertype datasets. For simplicity we set (or for OVA) during the validation phase.
The results of the experiment with the CS, OVA and BEP algorithms is given in Table 2. The rejection rate is fixed at some given level by choosing the threshold for each algorithm and dataset appropriately. As can be seen from the Table, the BEP algorithm’s performance is comparable to the OVA, and is better than the CS algorithm. However, Table 3, which gives the training times for the algorithms, reveals that the BEP algorithm runs the fastest, thus making the BEP algorithm a good option for large datasets. The main reason for the observed speedup of the BEP is that it learns only functions for a -class problem and hence the speedup factor of the BEP over the OVA would potentially be better for larger .
The multiclass classification problem with a reject option, is a powerful abstraction that captures controlling the uncertainty of the classifier and is very useful in applications like medical diagnosis. We formalized this problem via an evaluation metric, called the abstain loss, and gave excess risk bounds relating the abstain loss to the Crammer-Singer surrogate, the one vs all hinge surrogate and also to the BEP surrogate which is a new surrogate and operates on a much smaller dimension. Extending these results for other such evaluation metrics, in particular the abstain loss for , is an interesting future direction.
- Bartlett & Wegkamp (2008) Bartlett, P. L. and Wegkamp, M. H. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823–1840, 2008.
- Chow (1970) Chow, C. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16:41–46, 1970.
- Crammer & Singer (2001) Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
- Duchi et al. (2008) Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. Efficient projections onto the l1 -ball for learning in high dimensions. In International Conference on Machine Learning, 2008.
- Fumera & Roli (2002) Fumera, G. and Roli, F. Suppport vector machines with embedded reject option. Pattern Recognition with Support Vector Machines, pp. 68–82, 2002.
- Fumera & Roli (2004) Fumera, G. and Roli, F. Analysis of error-reject trade-off in linearly combined multiple classifiers. Pattern Recognition, 37:1245–1265, 2004.
- Fumera et al. (2000) Fumera, G., Roli, F., and Giacinto, G. Reject option with multiple thresholds. Pattern Recognition, 33:2099–2101, 2000.
- Fumera et al. (2003) Fumera, G., Pillai, I., and Roli, F. Classification with reject option in text categorisation systems. In IEEE International Conference on Image Analysis and Processing, pp. 582–587, 2003.
- Golfarelli et al. (1997) Golfarelli, M., Maio, D., and Maltoni, D. On the error-reject trade-off in biometric verification systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:786–796, 1997.
- Grandvalet et al. (2008) Grandvalet, Y., Rakotomamonjy, A., Keshet, J., and Canu, S. Support vector machines with a reject option. In Neural Information Processing Systems, 2008.
- Joachims (1999) Joachims, T. Making large-scale svm learning practical. In Schölkopf, B., Burges, C., and Smola, A. (eds.), Advances in Kernel Methods - Support Vector Learning. MIT-Press, 1999.
- Lee et al. (2004) Lee, Y., Lin, Y., and Wahba, G. Multicategory support vector machines: Theory and application to the classification of microarray data. Journal of the American Statistical Association, 99(465):67–81, 2004.
- Ramaswamy & Agarwal (2012) Ramaswamy, H. G. and Agarwal, S. Classification calibration dimension for general multiclass losses. In Advances in Neural Information Processing Systems 25, pp. 2087–2095. 2012.
- Ramaswamy et al. (2015) Ramaswamy, H. G., Tewari, A., and Agarwal, S. Convex calibrated surrogates for hierarchical classification. In International Conference on Machine Learning, 2015.
- Rifkin & Klautau (2004) Rifkin, R. and Klautau, A. In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101–141, 2004.
- Simeone et al. (2012) Simeone, P., Marrocco, C., and Tortorella, F. Design of reject rules for ECOC classification systems. Pattern Recognition, 45:863–875, 2012.
- Tewari & Bartlett (2007) Tewari, A. and Bartlett, P. L. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007–1025, 2007.
- Vernet et al. (2011) Vernet, E., Williamson, R. C., and Reid, M. D. Composite multiclass losses. In Neural Information Processing Systems, 2011.
- Wang & Lin (2014) Wang, P.-W. and Lin, C.-J. Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 15:1523–1548, 2014.
- Wu et al. (2007) Wu, Q., Jia, C., and Chen, W. A novel classification-rejection sphere SVMs for multi-class classification problems. In IEEE International Conference on Natural Computation, 2007.
- Yuan & Wegkamp (2010) Yuan, M. and Wegkamp, M. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11:111–130, 2010.
- Zhang (2004) Zhang, T. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004.
- Zou et al. (2011) Zou, C., hui Zheng, E., wei Xu, H., and Chen, L. Cost-sensitive multi-class SVM with reject option: A method for steam turbine generator fault diagnosis. International Journal of Computer Theory and Engineering, 2011.
Appendix A Proof of Excess Risk Bounds for the Crammer Singer Surrogate
Define the sets such that is the set of vectors in , for which
The following lemma gives some crucial, but straightforward to prove, (in)equalities satisfied by the Crammer-Singer surrogate.
where is the vector in with in the position and everywhere else.
The part of Theorem 1 proved here is restated below.
Let and . Then for all
We will show that and all
The Theorem simply follows from linearity of expectation.
Case 1: for some .
We have that .
The RHS of equation (7) is zero, and hence becomes trivial.
We have that .
Let . We then have
The last inequality follows from and the following observations. If then , and if we have .