# Online Active Learning of Reject Option Classifiers

Active learning is an important technique to reduce the number of labeled examples in supervised learning. Active learning for binary classification has been well addressed. However, active learning of reject option classifier is still an unsolved problem. In this paper, we propose novel algorithms for active learning of reject option classifiers. We develop an active learning algorithm using double ramp loss function. We provide mistake bounds for this algorithm. We also propose a new loss function called double sigmoid loss function for reject option and corresponding active learning algorithm. We provide extensive experimental results to show the effectiveness of the proposed algorithms. The proposed algorithms efficiently reduce the number of label examples required.

## Authors

• 6 publications
• 14 publications
• ### RISAN: Robust Instance Specific Abstention Network

In this paper, we propose deep architectures for learning instance speci...
07/07/2021 ∙ by Bhavya Kalra, et al. ∙ 0

• ### Active Learning for Binary Classification with Abstention

We construct and analyze active learning algorithms for the problem of b...
06/01/2019 ∙ by Shubhanshu Shekhar, et al. ∙ 0

• ### Efficient and Parsimonious Agnostic Active Learning

We develop a new active learning algorithm for the streaming setting sat...
06/29/2015 ∙ by Tzu-Kuo Huang, et al. ∙ 0

• ### Polyceptron: A Polyhedral Learning Algorithm

In this paper we propose a new algorithm for learning polyhedral classif...
07/08/2011 ∙ by Naresh Manwani, et al. ∙ 0

• ### Active Learning and Proofreading for Delineation of Curvilinear Structures

Many state-of-the-art delineation methods rely on supervised machine lea...
12/23/2016 ∙ by Agata Mosinska, et al. ∙ 0

• ### Active learning for binary classification with variable selection

Modern computing and communication technologies can make data collection...
01/29/2019 ∙ by Zhanfeng Wang, et al. ∙ 0

• ### ALEX: Active Learning based Enhancement of a Model's Explainability

An active learning (AL) algorithm seeks to construct an effective classi...
09/02/2020 ∙ by Ishani Mondal, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In standard binary classification problems, algorithms return prediction on every example. For any misprediction, the algorithms incur a cost. Many real-life applications involve very high misclassification costs. Thus, for some confusing examples, not predicting anything may be less costly than any misclassification. The choice of not predicting anything for an example is called reject option in machine learning literature. Classifiers with the reject option are called reject option classifiers.

Reject option classification is very useful in many applications. Consider a doctor diagnosing a patient based on the observed symptoms and preliminary diagnosis. If there is an ambiguity in observations and preliminary diagnosis, the doctor can hold the decision on treatment and recommend to take advanced tests or consult a specialist to avoid the risk of misdiagnosing the patient. The holding response of the doctor is the same as to reject option for the specific patient (da Rocha Neto et al., 2011). On the other hand, the doctor’s misprediction can cost huge money for further treatment or the life of a person. In another example, a banker can use reject option while looking at loan application of a customer (Rosowsky and Smith, 2013). A banker may choose not to decide based on the information available because of high misclassification cost, and asks for further recommendations or a credit bureau score from the stakeholders. Application of reject option classifiers include healthcare Hanczar and Dougherty (2008); da Rocha Neto et al. (2011), text categorization Fumera et al. (2003), crowdsourcing Li et al. (2017) etc.

Let be the feature space and be the label space. Examples of the form are generated from an unknown fixed distribution on . A reject option classifier can be described with the help of a function and a rejection width parameter as below.

 hρ(f(x))=1.I{f(x)>ρ}−1.I{f(x)<−ρ}−0.I{|f(x)|≤ρ} (1)

The goal is to learn and simultaneously. For a given example , the performance of reject option classifier is measured using the loss as follows.

 Ld(yf(x),ρ)=I{yf(x)≤−ρ}+dI{|f(x)|≤ρ} (2)

where is the cost of rejection. A reject option classifier is learnt by minimizing the risk (expectation of loss) under . As is not continuous, optimization of empirical risk under is difficult. Thus, its convex and non-convex surrogate loss functions have been proposed. Bartlett and Wegkamp (2008); Wegkamp and Yuan (2011) propose risk minimization algorithms based on generalized hinge loss. Grandvalet et al. (2008) propose double hinge loss based approach for reject option classification. Both generalized hinge loss and double hinge loss are convex surrogates of the loss . Manwani et al. (2015); Shah and Manwani (2019) propose double ramp loss based approaches for reject option classification. Double ramp loss is a non-convex bounded loss function.

In general, classifiers learned with a large amount of training data can give better generalization on testing data. However, in many applications, it can be costly and difficult to get a large amount of labeled data. If we can extract useful information from the unlabeled data, it reduces dependence on labeled examples. This concept motivates the field of active learning. In active learning, the label of an example is queried only if it is not explained well using the existing classifier. Active learning of standard binary classifier has been widely studied (Dasgupta et al., 2009; Bachrach et al., 1999). In El-Yaniv and Wiener (2012), authors reduce active learning for usual binary classification problem to learning a reject option classifier to achieve faster convergence rates. However, active learning of reject option classifiers has not been addressed at all in the literature. In this paper, we propose online active learning algorithms to reject option classification.

A broad class of active learning algorithms is inspired by the concept of a margin between the two classes. Thus, an example, which falls in the margin area with respect to the current classifier, carries more information than the examples which are either correctly classified with good margin or the examples which are misclassified by a good margin. Margin examples can bring more changes to the existing classifier. Thus, querying the label of margin examples is more desirable than the other two kinds of examples.

A reject option classifier can be viewed as two parallel surfaces with the rejection area in between. Thus, active learning of reject option classifier becomes active learning of two surfaces in parallel with a shared objective. This shared objective is nothing but to minimize the sum of losses over a sequence of examples. In Manwani et al. (2015), authors propose a risk minimization approach based on double ramp loss () for learning the reject option classifier. In Manwani et al. (2015), it is shown that at the optimality, the two surfaces can be represented using only those examples which are close to them. This motivates us to use double ramp loss for developing an active learning approach to reject option classifiers.

We propose an active learning algorithm based on . We give bounds to the number of rejected examples and misclassification rate for the un-rejected examples. As based active learning approach does not query labels in the constant regions of the loss, it may cause losing some informative examples in those regions. Thus, we propose a smooth non-convex loss called double sigmoid loss . We offer an active learning algorithm based on . We present extensive simulation results for the proposed active learning algorithms and compare them with the corresponding online algorithm. We observe that the active learning algorithms reduce the number of labels required effectively.

## 2 Proposed Approach: Active Learning of Reject Option Classifier Inspired by Double Ramp Loss

In active learning, we do not ask the label in every trial. We can think of any active learning algorithm as a stochastic optimization method as follows. We denote the instance presented to algorithm at trial by . Each is associated with a unique label . We assume that . We denote by

the weight vector used by the algorithm at trial

. is the width of rejection region at time . At any trial , if an active learning algorithm does not query the label , the parameters () are also not updated. Therefore, examples for which active learning algorithm does not query a label or perform an update must be the examples at which the gradient is zero. An active algorithm which queries the label only if and uses (Guillory et al., 2009), we require

 ∀f−∇O(f)≈Ex,y[Update(f,x,y)Query(f,x)] (3)

is based on the double ramp loss () (Manwani et al., 2015).

 Ldr,ρ(z) =d[[1−z+ρ]+−[−1−z+ρ]+]+(1−d)[[1−z−ρ]+−[−1−z−ρ]+]

Here , indicates the width of the rejection region, is the cost of rejection.We first consider developing active learning algorithm for linear classifiers. The objective function for learning reject option classifier is as follows.

 O(f,ρ)=Ex,y[Ldr,ρ(f(x),y)] (4)

To find the , we find the subgradient of the objective function described in eq.(4) w.r.t. and .

 −∇wO(f,ρ) =Ex,y[dyxI{ρ−1≤yf(x)≤ρ+1}+(1−d)yxI{−ρ−1≤yf(x)≤−ρ+1}] −∇ρO(f,ρ) =Ex,y[−dI{ρ−1≤yf(x)≤ρ+1}+(1−d)I{−ρ−1≤yf(x)≤−ρ+1}]

Thus, from eq.(3),

 Update(w,x,y) ={ηdyxifρ−1≤yf(x)≤ρ+1η(1−d)yxif−ρ−1≤yf(x)≤−ρ+1 (5) Update(ρ,x,y) ={−ηdifρ−1≤yf(x)≤ρ+1η(1−d)if−ρ−1≤yf(x)≤−ρ+1 (6)

Here, is step-size. Thus,

 wt+1 =⎧⎨⎩wt+ηdyxifρ−1≤yf(x)≤ρ+1wt+η(1−d)yxif−ρ−1≤yf(x)≤−ρ+1wtotherwise ρt+1 =⎧⎪⎨⎪⎩ρt−ηdifρ−1≤yf(x)≤ρ+1ρt+η(1−d)if−ρ−1≤yf(x)≤−ρ+1ρtotherwise

Here if is between to then we are decreasing which is intuitive because we are correctly classifying still we are rejecting it. Now, combining eq.(3), (5) and (6), we can get Query() as follows.

 Query(w,ρ,x)={1ifρ−1≤|f(x)|≤ρ+10otherwise (7)

Thus, we ask the label of the current example only if it falls in the linear region of the loss . The detailed algorithm is given in Algorithm 1. We call it DRAL (double ramp loss based active learning). Note that for a trial , means label is queried for .

### 2.1 Mistake Bounds for DRAL

In this section, we theoretically analyze the mistake bounds of DRAL. Before presenting the mistake bounds, we begin by presenting a lemma which would facilitate the following mistake bound proofs. Let be an indicator function, then we define the following notations:

 Ct=I{ρt≤ytwt⋅xt≤ρt+1} R1t=I{ρt−1≤ytwt⋅xt≤ρt} (8) R2t=I{−ρt≤ytwt⋅xt≤−ρt+1} Mt=I{−ρt−1≤ytwt⋅xt≤−ρt}
• Let be a sequence of input instances, where and for all .111Here, denotes the sequence . Given as defined in eq.(8) and , the following bound holds for any .

 α2∥w∥2+(1−αρ)2+T∑t=12αηLdr,ρ(w⋅xt,yt)≥T∑t=1[Ct+R1t][2αηd+2η(Ldr,ρt(wt⋅xt,yt)−d) (9) −η2d2(∥xt∥2+1)]+T∑t=1[R2t+Mt][2αη(1+d)+2η(Ldr,ρt(wt⋅xt,yt)−d−1)−η2(1−d)2(∥xt∥2+1)]

The proof of this lemma is given in the Appendix. Now, we will find the bounds on rejection rate and mis-classification rate.

• Let be a sequence of input instances, where and and for all . Assume that there exists a vector and such that for all .

1. Number of examples rejected by DRAL (Algorithm 1) among those for which the label was asked in this sequence is upper bounded as follows.

 ∑t:Qt=1[R1t+R2t]≤α2∥w∥2+(1−αρ)2

where .

2. Number of examples mis-classified by DRAL (Algorithm 1) among those for which the label was asked in this sequence is upper bounded as follows.

 ∑t:Qt=1Mt≤α2∥w∥2+(1−αρ)2

where .

The proof of this theorem is given in the Appendix. The above theorem assumes that there exists and such that . This means that the data is linearly separable. In such a case, the number of mistakes made by the algorithm on unrejected examples as well as the number of rejected examples are upper bounded by a complexity term. Now, we will see the bounds when there does not exist any such that for all .

• Let be a sequence of input instances, where and and for all . Then, for any vector ,

1. Number of rejected examples by DRAL (Algorithm 1) among those for which the label was asked in this sequence is upper bounded as follows.

 ∑t:Qt=1[R1t+R2t]≤α2∥w∥2+(1−αρ)2+T∑t=12ηαLdr,ρ(w⋅xt,yt)

where .

2. The number of misclassified examples by DRAL (Algorithm 1) among those for which the label was asked in this sequence is upper bounded as follows.

 ∑t:Qt=1Mt≤α2∥w∥2+(1−αρ)2+T∑t=12ηαLdr,ρ(w⋅xt,yt)

where .

The proof is given in the Appendix. We see that in the case when the data is not linearly separable, the number of mistakes made by the algorithm is upper bounded by the sum of complexity term and sum of losses with respect to a fixed classifier.

## 3 Active Learning Using Double Sigmoid Loss Function

We observe that the double ramp loss is not smooth. Moreover, it is constant in three regions, namely (, and ). Thus, when loss for an example falls in any of these three regions, the gradient of the loss becomes zero. Thus, in the context of double ramp loss, there is no benefit of asking the labels in these regions. However, we don’t want to ignore these regions completely. To capture the information in these regions, we need to change the loss function in such a way that the gradient becomes non-zero in these regions. We also want the loss function to be bounded so that the effect of label noise does not degrade the overall performance.

Thus, we propose a new non-convex loss function for reject option classification by combining two sigmoids. We call it double sigmoid loss function .

 Lds(w⋅x,y)=2dσ(yw⋅x−ρ)+2(1−d)σ(yw⋅x+ρ)

where

is the sigmoid function. We also see that

upper bounds the loss . Figure 1(a) shows the double sigmoid loss function.

### 3.1 Query Probability Function

At a trial , set . Given the values and , we ask the label of example with following probability.

 probability pt=4σ(|ft(xt)|−ρt)(1−σ(|ft(xt)|−ρt)) (10)

Figure 1(b) shows the graph of the query probability function. We see that the probability function (eq.(10) has two peaks. One peak is at (decision boundary between positive class and rejection) and another at (decision boundary between rejection and negative class). The key idea behind such a query function is as follows. Examples which fall closer to one of these region capture more information about two decision boundaries of classifier. Examples which are correctly classified with good margin, examples mis-classified with huge margin and examples in the middle of reject region region do not carry much information. Thus, we ask the label in these regions with less probability.

Using eq.(3) and the query probability (10), we find the parameter update equations of as follows.

 Update(w,xt,yt)=∂Lds(wt⋅xt,yt)∂w =2ytαxt[dσ(ytft(xt)−ρt)(1−σ(ytft(xt)−ρt)) +(1−d)σ(ytft(xt)+ρt)(1−σ(ytft(xt)+ρt))] Update(ρ,xt,yt)=∂Lds(wt⋅xt,yt)∂ρ −(1−d)σ(ytft(xt)+ρt)(1−σ(ytft(xt)+ρt))]

Now, we will provide intuitive explanation about update of and . When an example is correctly classified with good margin (i.e. ) then the active learning algorithm will update by a small factor of and will reduce the rejection width because for , Similarly, when an example is misclassified with good margin (i.e. ) then the active learning algorithm will update by a large factor of and will increase the rejection width because for , . Therefore, the updates of double sigmoid loss based active learning are inline with the intuition. We use the acronym DSAL (double sigmoid based active learning). Due to the space constraints, we skipped the detailed description of DSAL.

## 4 Experiments

In this section, we show the effectiveness of the proposed active learning approaches. We show results on Phishing (68 dim.) and Gisette (5000 dim.) available on UCI machine learning repository (Lichman, 2013).

### 4.1 Experimental Setup

We evaluate the performance of our approaches for learning linear classifiers. In all our simulations, we initialize step size by a small value, and after every round, step size decreases by a small constant. Parameter in the double sigmoid loss function is chosen to minimize the average risk and average fraction of queried labels (averaged over 100 runs).

We need to show that the proposed active learning algorithms are effectively reducing the number of labeled examples required while achieving the same accuracy as online learning. Thus, we compare the active learning approaches with an online algorithm which updates the parameters using gradient descent on the double sigmoid loss at every trial. We call this online algorithm as DSOL (double sigmoid loss based online learning).

### 4.2 Simulation Results

We report the results for three different values of . The results provided here are based on 100 repetitions of a total number of rounds () equal to 10000. For every value of , we find the average of risk, the fraction of asked labels, fraction of misclassified examples and fraction of rejected examples over 100 repetitions. We plotted the average of each quantity (e.g. risk, the fraction of asked labels, etc.) as a function of

. Moreover, the standard deviation of the quantity is denoted by error bar in figures. Figure

2 and 3 shows results for experimental results for Gisette and Phishing dataset. We observe the following.

• Average Risk: In all the cases we see that the risk increases with increasing the value of . We see that the average risk of DSAL is higher than DRAL for both the datasets and all values of . DRAL gives a lesser risk than DSOL for all values of for Phishing dataset. For Gisette dataset, DSOL does better compared to DRAL except for . DSOL always does better than DSAL in terms of the risk.

• Average Fraction of Asked Labels: We observe that fraction of asked labels decreases with increasing . We also see that DSAL asks significantly less number of labels to achieve a similar risk as DRAL. This happens because DRAL asks label every time in a specific region and completely ignores other regions but DSAL asks labels in every region with some probability.

• Average Fraction of Misclassified Examples: We see that DRAL achieves minimum average misclassification rate for all values and both the datasets. DSAL gives higher average misclassification rate compared to DSOL for all the cases. We observe that the misclassification rate goes up with increasing .

• Average Fraction of Rejected Examples: We see that the average fraction of rejected examples is higher in DRAL than DSAL and DSOL. Also, the rejection rate decreases with increasing .

Thus, we see that the proposed active learning algorithms DRAL and DSAL effective reduce the number of labels required for learning the reject option classifier and perform better compared to online learning.

## 5 Conclusion and Future Work

In this paper, we have proposed novel active learning algorithms DRAL and DSAL. We presented mistake bounds for DRAL. We experimentally show that the proposed active learning algorithms reduce the number of labels required while maintaining a similar performance as online learning. We want to derive mistake bounds for DSAL and do the further theoretical analysis of double sigmoid loss function.

## Appendix A Proof of Lemma 1

 ∥wt−αw∥2 −∥wt+1−αw∥2 = ∥wt−αw∥2−∥wt+ηdytxt[Ct+R1t]+η(1−d)ytxt[R2t+Mt]−αw∥2

Note that only one of four indicator can be true at time therefore following equations will be true.

 [Ct+R1t]2=[Ct+R1t] [R2t+Mt]2=[R2t+Mt] [Ct+R1t][R2t+Mt]=0

Using above facts,

 ∥wt−αw∥2 −∥wt+1−αw∥2 = ∥wt∥2+α2∥w∥2−2α(w⋅wt)−[∥wt∥2+η2d2yt2∥xt∥2[Ct+R1t] +η2(1−d)2yt2∥xt∥2[R2t+Mt]+α2∥w∥2+2ηdyt(wt⋅xt)[Ct+R1t] +2η(1−d)yt(wt⋅xt)[R2t+Mt]−2α(w⋅wt)−2αηdyt(w⋅xt)[Ct+R1t] −2αη(1−d)yt(w⋅xt)[R2t+Mt]] = 2αηdyt(w⋅xt)[Ct+R1t]+2αη(1−d)yt(w⋅xt)[R2t+Mt] −2ηdyt(wt⋅xt)[Ct+R1t]−2η(1−d)yt(wt⋅xt)[R2t+Mt] −η2d2∥xt∥2[Ct+R1t]−η2(1−d)2∥xt∥2[R2t+Mt]

Combining the coefficient of ,

 ∥wt−αw∥2 −∥wt+1−αw∥2 = 2αηdyt(w⋅xt)[Ct+R1t]+2αη(1−d)yt(w⋅xt)[R2t+Mt] −2ηdyt(wt⋅xt)[Ct+R1t]−2η(1−d)yt(wt⋅xt)[R2t+Mt] −η2d2∥xt∥2[Ct+R1t]−η2(1−d)2∥xt∥2[R2t+Mt] = [Ct+R1t][2αηdyt(w⋅xt)−2ηdyt(wt⋅xt)−η2d2∥xt∥2] +[R2t+Mt][2αη(1−d)yt(w⋅xt)−2η(1−d)yt(wt⋅xt)−η2(1−d)2∥xt∥2]

Same procedure can be applied for ,

 (ρt−αρ)2−(ρt+1−αρ)2= (ρt−αρ)2−(ρt−ηd[Ct+R1t]+η(1−d)[R2t+Mt]−αρ)2 = ρt2+α2ρ2−2αρρt−[ρt2+η2d2[Ct+R1t]+η2(1−d)2[R2t+Mt]+α2ρ2 −2ηdρt[Ct+R1t]+2η(1−d)ρt[R2t+Mt]−2αρρt +2αηdρ[Ct+R1t]−2αη(1−d)ρ[R2t+Mt]] = [Ct+R1t][−η2d2+2ηdρt−2αηdρ] +[R2t+Mt][−η2(1−d)2−2η(1−d)ρt+2αη(1−d)ρ]

 ∥wt−αw∥2−∥wt+1−αw∥2 +(ρt−αρ)2−(ρt+1−αρ)2 = [Ct+R1t][2αηd(yt(w⋅xt)−ρ)−2ηd(yt(wt⋅xt)−ρt)−η2d2(∥xt∥2+1))] +[R2t+Mt][2αη(1−d)(yt(w⋅xt)+ρ)−2η(1−d)(yt(wt⋅xt)+ρt) −η2(1−d)2(∥xt∥2+1)]

If then Double ramp loss and when then . Using above fact, if we replace value of and then we will get

 ∥wt−αw∥2−∥wt+1−αw∥2+(ρt−αρ)2−(ρt+1−αρ)2 =[Ct+R1t][−2αη(Ldr,ρ(w⋅xt,yt)−d)+2η(Ldr,ρ(wt⋅xt,yt)−d)−η2d2(∥xt∥2+1)] +[R2t+Mt][−2αη(Ldr,ρ(w⋅xt,yt)−d−1)+2η(Ldr,ρ(wt⋅xt,yt)−d−1)−η2(1−d)2(∥xt∥2+1)]

Combining the above equation for all ,

 T∑t=1[Ct+R1t][2αη(d−Ldr,ρ(w⋅xt,yt))+2η(Ldr,ρ(wt⋅xt,yt)−d)−η2d2(∥xt∥2+1)] +T∑t=1[R2t+Mt][2αη(1+d−Ldr,ρ(w⋅xt,yt))+2η(Ldr,ρ(wt⋅xt,yt)−d−1)−η2(1−d)2(∥xt∥2+1)] =(∥w1−αw∥2−∥wT+1−αw∥2)+((ρ1−αρ)2−(ρT+1−αρ)2) ≤∥w1−αw∥2+(ρ1−αρ)2 =α2∥w∥2+(1−αρ)2

Here, we used the fact that we initialize with and . Rearranging terms, we will get required inequality.

 T∑t=1 [Ct+R1t][2αηd+2η(Ldr,ρ(wt⋅xt,yt)−d)−η2d2(∥xt∥2+1)]+T∑t=1[R2t+Mt][2αη(1+d) +2η(Ldr,ρ(wt⋅xt,yt)−d−1)−η2(1−d)2(∥xt∥2+1)] ≤α2∥w∥2+(1−αρ)2+T∑t=12αηLdr,ρ(w⋅xt,yt)[Ct+R1t]+T∑t=12αηLdr,ρ(w⋅xt,yt)[R2t+Mt] ≤α2∥w∥2+(1−αρ)2+T∑t=12αηLdr,ρ(w⋅xt,yt)

1. Putting