# Cost Sensitive Learning in the Presence of Symmetric Label Noise

In binary classification framework, we are interested in making cost sensitive label predictions in the presence of uniform/symmetric label noise. We first observe that 0-1 Bayes classifiers are not (uniform) noise robust in cost sensitive setting. To circumvent this impossibility result, we present two schemes; unlike the existing methods, our schemes do not require noise rate. The first one uses α-weighted γ-uneven margin squared loss function, l_α, usq, which can handle cost sensitivity arising due to domain requirement (using user given α) or class imbalance (by tuning γ) or both. However, we observe that l_α, usq Bayes classifiers are also not cost sensitive and noise robust. We show that regularized ERM of this loss function over the class of linear classifiers yields a cost sensitive uniform noise robust classifier as a solution of a system of linear equations. We also provide a performance bound for this classifier. The second scheme that we propose is a re-sampling based scheme that exploits the special structure of the uniform noise models and uses in-class probability estimates. Our computational experiments on some UCI datasets with class imbalance show that classifiers of our two schemes are on par with the existing methods and in fact better in some cases w.r.t. Accuracy and Arithmetic Mean, without using/tuning noise rate. We also consider other cost sensitive performance measures viz., F measure and Weighted Cost for evaluation.

Comments

There are no comments yet.

## Authors

• 6 publications
• 5 publications
07/05/2021

### Statistical Theory for Imbalanced Binary Classification

Within the vast body of statistical theory developed for binary classifi...
12/08/2021

### The perils of being unhinged: On the accuracy of classifiers minimizing a noise-robust convex loss

van Rooyen et al. introduced a notion of convex loss functions being rob...
09/26/2013

### Boosting in the presence of label noise

Boosting is known to be sensitive to label noise. We studied two approac...
11/18/2019

### Attribute noise robust binary classification

We consider the problem of learning linear classifiers when both feature...
11/14/2021

### Adaptive Cost-Sensitive Learning in Neural Networks for Misclassification Cost Problems

We design a new adaptive learning algorithm for misclassification cost p...
11/07/2018

### THORS: An Efficient Approach for Making Classifiers Cost-sensitive

In this paper, we propose an effective THresholding method based on ORde...
07/07/2021

### RISAN: Robust Instance Specific Abstention Network

In this paper, we propose deep architectures for learning instance speci...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We are interested in cost sensitive label predictions when only noise corrupted labels are available. The labels might be corrupted when the data has been collected by crowd scouring with not so high labeling expertise. We consider the basic case when the induced label noise is independent of the class or example, i.e., symmetric/uniform noise model. In real world, there are various applications requiring differential misclassification costs due to class imbalance or domain requirement or both; we elaborate on these below.

In case 1, there is no explicit need for different penalization of classes but the data has imbalance. Example: Predicting whether the age of an applicant for a vocational training course is above or below 15 years. As the general tendency is to apply after high school, data is imbalanced but there is no need for asymmetric misclassification cost. Here, asymmetric cost should be learnt from data.

In case 2, there is no imbalance in data but one class’s misclassification cost is higher than that of other. Example: Product recommendation by paid web advertisements. Even though the product is liked or disliked by approximately equal proportions of population, losing a potential customer by not recommending the product is more costlier than showing paid advertisement to a customer who doesn’t like it. Here, the misclassification cost has to come from the domain.

In case 3, there is both imbalance and need for differential costing. Example: Rare (imbalance) disease diagnosis where the cost of missing a patient with disease is higher than cost of wrongly diagnosing a person with disease. The model should incorporate both the cost from domain and the cost due to imbalance.
In Section 1.1, we provide a summary of how these 3 cases are handled. For cost and uniform noise, we have considered real datasets belonging to cases 2 and 3.
Contributions

• Show that, unlike - loss, weighted - loss is not cost sensitive uniform noise robust.

• Show -weighted -uneven margin squared loss with linear classifiers is both uniform noise robust and handles cost sensitivity. Present a performance bound of a classifier obtained from based regularized ERM.

• Propose a re-sampling based scheme for cost sensitive label prediction in the presence of uniform noise using in-class probability estimates.

• Unlike existing work, both the proposed schemes do not need true noise rate.

• Using a balanced dataset (Bupa) which requires domain cost too, we demonstrate that tuning on such corrupted datasets can be beneficial.

Related work For classification problems with label noise, particularly in Empirical Risk Minimization framework, the most recent work ([9, 13, 21, 14, 5]) aims to make the loss function noise robust and then develop algorithms. Cost sensitive learning has been widely studied by [4, 10, 17] and many more. An extensive empirical study on the effect of label noise on cost sensitive learning is presented in [23]. The problem of cost sensitive uniform noise robustness is considered in [12] where asymmetric misclassification cost is tuned and class dependent noise rates are cross validated over corrupted data. However, our work incorporates cost due to both imbalance () and domain requirement () with the added benefit that there is no need to know the true noise rate.
Organization Section 1.1 has some details about weighted uneven margin loss functions and in-class probability estimates. In Section 2, weighted - loss is shown to be non cost sensitive uniform noise robust. Sections 3 and 4 present two different schemes that make cost sensitive predictions in the presence of uniform label noise. Section 5 has empirical evidence of the performance of proposed methods. Some discussion and future directions are presented in Section 6.
Notations Let

be the joint distribution over

with and Let the in-class probability and class marginal on be denoted by and . Let the decision function be , hypothesis class of all measurable functions be and class of linear hypothesis be Let denote the distribution on obtained by inducing noise to with . The corrupted sample is The noise rate is constant across classes and the model is referred to as Symmetric Label Noise (SLN) model. In such cases, the corrupted in-class probability is and the corrupted class marginal is Symmetric and uniform noise are synonymous in this work.

### 1.1 Some relevant background

The first choice of loss function for cost sensitive learning is that of -weighted - loss defined as follows:

 l0−1,α(f(x),y)=(1−α)1{Y=1,f(x)≤0}+α1{Y=−1,f(x)>0},  ∀α∈(0,1) (1)

Let the -weighted - risk be . The minimizer of this risk is and referred to as cost-sensitive Bayes classifier. The corresponding surrogate based risk and the minimizer is defined as and respectively. Consider the following notion of -classification calibration.

###### Definition 1 (α-Classification Calibration [17])

For and a loss function , define the -weighted loss:

 lα(f(x),y)=(1−α)l1(f(x))+αl−1(f(x)), (2)

where and . is -classification calibrated (-CC) iff there exists a convex, non-decreasing and invertible transformation with , such that

 ψlα(RD,α(f)−RD,α(f∗α))≤RD,lα(f)−RD,lα(f∗lα). (3)

If the classifiers obtained from -CC losses are consistent w.r.t -risk then they are also consistent w.r.t -weighted - risk. We consider the -weighted uneven margin squared loss [17] which is by construction -CC and defined as follows:

 lα,usq(f(x),y)=(1−α)1{y=1}(1−f(x))2+α1{y=−1}1γ(1+γf(x))2,  γ>0 (4)

Interpretation of and The role of and can be related to the three cases of differential costing described at the start of this paper. In case 1, there are 3 options: fix and let tuned pick up the imbalance; fix and tune ; tune both and . Our experimental results suggest that latter two perform equally good. For case 2, is given and can be fixed at . However, we observe that even in this case tuning can be more informative. For case 3, is tuned and is given a priori. There would be a trade-off between and , i.e., for a given , there would be an optimal in a suitable sense. Above observations are based on the experiments described in Supplementary material Section D.
Choice of estimation method As estimates are required in re-sampling scheme, we investigated the performance of 4 methods: Lk-fun [16], uses classifier to get the estimate ; interpreting as a conditional expectation and obtaining it by a suitable squared deviation minimization; LSPC [19], density ratio estimation by norm minimization; KLIEP [20], density ratio estimation by KL divergence minimization. We chose to use Lk-fun with logistic loss and squared loss , LSPC, and , a normalized version of KLIEP because in re-sampling algorithm we are concerned with label prediction and these estimators performed equally well on Accuracy measure. A detailed study is available in Supplementary material Section F.

## 2 Cost sensitive Bayes classifiers using l0−1,α&lα,usq need not be uniform noise robust

The robustness notion for risk minimization in cost insensitive scenarios was introduced by [9]. They also proved that cost insensitive - loss based risk minimization is SLN robust. We extend this definition to cost-sensitive learning.

###### Definition 2 (Cost sensitive noise robust)

Let and be obtained from clean and corrupted distribution and using any arbitrary scheme , then the scheme is said to be cost sensitive noise robust if

 RD,α(~f∗α,A)=RD,α(f∗α,A).

If and are obtained from a cost sensitive loss function and noise induced is symmetric, then is said to be cost sensitive uniform noise robust.

Let the risk on be denoted by . If one is interested in cost sensitive learning with noisy labels, then the sufficient condition of [5] becomes . This condition is satisfied if and only if implying that it cannot be a sufficient condition for SLN robustness if there is a differential costing of .

Let and be the minimizers of and . Then, it is known that they have the following form:

(5)
(6)
The last equality in (6) follows from the fact that . In Example 1 below, we show that is not cost sensitive uniform noise robust with .

###### Example 1

Let

has a Bernoulli distribution with parameter

. Let be such that and . Then, the in-class probability is given as follows:

 η(x)=P(Y=1|X=x)=p=0.2

Suppose . Then, . If , Consider the -weighted - risk of and :

 RD,α(f∗α) = ED[l0−1,α(f∗α(x),y)]=(1−α)p,              since f∗α(x)≤0  ∀x RD,α(~f∗α) = ED[l0−1,α(~f∗α(x),y)]=α(1−p),              since ~f∗α(x)>0  ∀x

Therefore, implying that the -weighted - loss function is not uniform noise robust with . Details are in Supplementary material Section B.1. Note that due to , is linearly separable; a linearly inseparable variant can be obtained by . Another linearly inseparable distribution based counter example is available in Supplementary material Section B.4.

In view of the above example, one can try to use the principle of inductive bias, i.e., consider a strict subset of the above set of classifiers; however, Example 2 below says that the set of linear class of classifiers need not be cost sensitive uniform noise robust.

###### Example 2

Consider the training set

with uniform probability distribution. Let the linear classifier be of the form

. Let and the uniform noise be . Then,

 fl∗α = argminflED[l0−1,α(y,fl)]=b∗∈(−8,−12)   with   Rα,D(fl∗α)=0. ~fl∗α = argmin~flE~D[l0−1,α(~y,~fl)]=~b∗∈(−3,∞)   with   Rα,D(~fl∗α)=0.2.

Details of Example 2 are available in Supplementary material Section B.2 To avoid above counter examples, we resort to convex surrogate loss function and a type of regularization which restricts the hypothesis class. Consider an -weighted uneven margin loss functions [17] with its optimal classifiers on and denoted by and respectively. Regularized risk minimization defined below is known to avoid over-fitting.

 RrD,lα,un(f)=ED[lα,un(f(x),y)]+λ∥f∥22,     where   λ>0 (7)

Let the regularized risk of on be . Also, let the minimizers of clean and corrupted -regularized risks be and . Now, Definition 2 can be specialized to to assure cost sensitivity, classification calibration and uniform noise robustness as follows:

###### Definition 3 ((α,γ,ρ)-robustness of risk minimization)

For a loss function and classifiers and , risk minimization is said to be -robust if

 RD,α(~f∗lα,un)=RD,α(f∗lα,un). (8)

Further, if the classifiers in equation (8) are and then, we say that regularized risk minimization under is -robust.

Due to squared loss’s SLN robustness property [9], we check whether is robust or not. It is not with as shown in Example 3; details of Example 3 are available in Supplementary material Section B.3.

###### Example 3

Consider the settings as in Example 1. Let and . Then, for all ,

 And,   RD,α(f∗lα,usq)=(1−α)p=0.15,    RD,α(~f∗lα,usq)=α(1−p)=0.2.

Hence, . implying that based ERM may not be cost sensitive uniform noise robust.

We again have a negative result with when we consider hypothesis class . In next section, we present a positive result and show that regularized risk minimization under loss function is -robust if the hypothesis class is restricted to .

## 3 (lα,usq,Hlin) is (α,γ,ρ) robust

In this section, we consider the weighted uneven margin squared loss function from equation (4) with restricted hypothesis class and show a positive result that regularized cost sensitive risk minimization under loss function is -robust. A proof is available in Supplementary material Section A.1.

###### Theorem 3.1

is -robust, i.e., linear classifiers obtained from -weighted -uneven margin squared loss based regularized risk minimization are SLN robust.

###### Remark 1

The above results relating to counter examples and Theorem 3.1 about cost sensitive uniform noise robustness can be summarized as follows: There are two loss functions, and and two hypothesis classes, and . Out of the four combinations of loss functions and hypothesis classes only with is cost sensitive uniform noise robust, others are not.

Next, we provide a closed form expression for the classifier learnt on corrupted data by minimizing empirical -regularized risk. We also provide a performance bound on the clean risk of this classifier.

### 3.1 lα,usq based classifier from corrupted data & its performance

In this subsection, we present a descriptive scheme to learn a cost-sensitive linear classifier in the presence of noisy labels, by minimizing based regularized empirical risk, i.e.,

 ^fr:=argminf∈Hlin^Rr~D,lα,usq(f), (9)

where is user given, and regularization parameter are to be tuned by cross validation. A proof is available in Supplementary material Section A.2.

###### Proposition 1

Consider corrupted regularized empirical risk of Then, the optimal -robust linear classifier with has the following form:

 ^fr=¯w∗=(A+λI)−1c,    λ>0 (10)

where , a

dimensional vector of variables;

, a dimensional known symmetric matrix and , a dimensional known vector are as follows:

 A+λI=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣m∑i=1x2i1ai+λm∑i=1xi1xi2ai…m∑i=1xi1xinaim∑i=1xi1aim∑i=1xi2xi1aim∑i=1x2i2ai+λ…m∑i=1xi2xinaim∑i=1xi2ai⋮⋮⋱⋮⋮m∑i=1xinxi1aim∑i=1xinxi2ai…m∑i=1x2inai+λm∑i=1xinaim∑i=1xi1aim∑i=1xi2ai…m∑i=1xinaim∑i=1ai+λ⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,  c=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣m∑i=1xi1cim∑i=1xi2ci⋮m∑i=1xincim∑i=1ci⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

with and .

Next, we provide a result on the performance of in terms of the Rademacher complexity of the function class For this, we need Lemma 1 and 2 whose proofs are available in Supplementary material Section A.3 and A.4 respectively.

###### Lemma 1

Consider the -weighted uneven margin squared loss which is locally -Lipschitz with where , for . Then, with probability at least ,

 maxf∈Hlin|^R~D,lα,usq(f)−R~D,lα,usq(f)|≤2LR(Hlin)+√log(1/δ)2m,

where is the Rademacher complexity of the function class with

’s as independent uniform random variables taking values in

.

###### Lemma 2

For a classifier and user given , the risk on clean and corrupted distribution satisfy the following equation:

 (11)
###### Theorem 3.2

Under the settings of Lemma 1, with probability at least ,

 RD,lα,usq(^fr) ≤minf∈HlinRD,lα,usq(f)+4LR(Hlin)+2√log(1/δ)2m+2λW2+ 4ρ(1−2ρ)EX[(~fl∗lα,usq(x)−(1−2ρ)^fr(x))(η(x)−α)]

where is the linear minimizer of and is the in-class probability for . Furthermore, as is -CC, there exists a non-decreasing and invertible function with such that,

 RD,α(^fr)−RD,α(f∗α) ≤ ψ−1lα,usq(minf∈HlinRD,lα,usq(f)−minf∈HRD,lα,usq(f)+4LR(Hlin) (12) +2√log(1/δ)2m+4ρ(1−2ρ)EX[(~fl∗lα,usq(x)−^fr(x))(η(x)−α)] +8ρ2(1−2ρ)EX[^fr(x)(η(x)−α)]+2λW2).

A proof of Theorem 3.2 is available in Supplementary material Section A.5. The first two terms (involving the difference) in the right hand side of equation (12) denotes the approximation error which is small if is large and the third term involving the Rademacher complexity denotes the estimation error which is small if is small. The fourth term denotes the sample complexity which vanishes as the sample size increases. The bound in (12) can be used to show consistency of based regularized ERM if the argument of tends to zero as sample size increases. However, in this case, it is not obvious because the last two terms involving noise rates may not vanish with increasing sample size. In spite of this, our empirical experience with this algorithm is very good.

## 4 A re-sampling based Algorithm (~η,α)

In this section, we present a cost sensitive label prediction algorithm based on re-balancing (which is guided by the costs) the noisy training set given to the learning algorithm. Let us consider uneven margin version of -weighted - loss from equation (1) defined as follows:

 l0−1,α,γ(f(x),y)=(1−α)1{Y=1,f(x)≤0}+αγ1{Y=−1,γf(x)>0},  ∀α∈(0,1)

where is user given cost and , tunable cost handles the class imbalance. This definition is along the lines of the uneven margin losses defined in [17]. Let -risk on be and corresponding optimal classifier be :

 f∗0−1,α,γ=argminf∈HRD,α,γ(f)=sign(η(x)−αγ+(1−γ)α). (13)

Also, let -risk on be and the corresponding optimal classifier be as given below:

 ~f∗0−1,α,γ = sign(~η(x)−αγ+(1−γ)α). (14)

We propose Algorithm which is mainly based on two ideas: (i) predictions based on a certain threshold can correspond to predictions based on threshold if the number of negative examples in the training set is multiplied by (Theorem 1 of [4]) (ii) for a given , and lie on the same side of threshold when noise rate is . We first formalize the latter idea in terms of a general result. A proof is available in Supplementary material Section A.6.

###### Lemma 3

In SLN models, for a given noise rate , the clean and corrupted class marginals and satisfy the following condition:

 π⋚0.5⇒~π⋚0.5.

Further, the above monotonicity holds for and too.

In our case, the cost sensitive label prediction requires the desired threshold to be () but the threshold which we can use is implying that for us . If and are number of positive and negative examples in , then we should re-sample such that the size of balanced dataset is . As we have access to only corrupted data, the learning scheme is: re-balance the corrupted data using and then threshold at . Since, for SLN model, predictions made by thresholding at are same as the predictions made by thresholding at , for a test point from , predicted label is The main advantage of this algorithm is that it doesn’t require the knowledge of true noise rates. Also, unlike Section 3’s scheme involving based regularized ERM, this algorithm uses estimates and hence is a generative learning scheme.

Since, we do not want to lose any minority (rare) class examples, we reassign positive labels to the minority class WLOG, if needed, implying that negative class examples are always under-sampled. The performance of Algorithm is majorly dependent on sampling procedure and estimation methods used.

###### Remark 2

Algorithm exploits the fact that where is learnt on re-sampled data. This implies but due to counter examples in Section 2, these risks may not be equal to . Hence, this scheme is not in contradiction to Section 2. However, as estimation methods use a subset of (e.g., LSPC and KLIEP use linear combinations of finite Gaussian kernels as basis functions), these risks may be equal to where is estimate of obtained from strict subset of hypothesis class. Also, based on very good empirical performance of the scheme, we believe that where is an estimate of .

## 5 Comparison of lα,usq based regularized ERM and Algorithm (~η,α) to existing methods on UCI datasets

In this section, we consider some UCI datasets [2] and demonstrate that is -robust. Also, we demonstrate the performance of Algorithm with estimated using Lk-fun, LSPC and KLIEP. In addition to Accuracy (Acc), Arithmetic mean (AM) of True positive rate (TPR) and True negative rate (TNR), we also consider two measures suited for evaluating classifiers learnt on imbalanced data, viz., F measure and Weighted cost (WC) defined as below:

 F=2TP2TP+FP+FN    and    WC=(1−α)FN+αγFP

where TP, TN, FP, FN are number of true positives, true negatives, false positives and false negatives for a classifier.

To account for randomness in the flips to simulate a given noise rate, we repeat each experiment 10 times, with independent corruptions of the data set for same noise () setting. In every trial, the data is partitioned into train and test with - split. Uniform noise induced data is used for training and validation (if there are any parameters to be tuned like ). Finally, clean test data is used for evaluation. Regularization parameter, is tuned over the set On a synthetic dataset, we observed that the performance of our methods and cost sensitive Bayes classifier on clean data w.r.t. Accuracy, AM, F and Weighted Cost measure is comparable for moderate noise rates; details in Supplementary material Section E.3. In all the tables, values within of the best across a row are in bold.
Class imbalance and domain requirement of cost

We report the accuracy and AM values of logistic loss based unbiased estimator (MUB) approach and approach of surrogates for weighted 0-1 loss (S-W0-1), as it is from the work of

[12] and compare them to Accuracy and AM for our cost sensitive learning schemes. It is to be noted that [12] assumes that the true noise rate is known and cost is tuned. We are more flexible and user friendly as we don’t need the noise rate and allow for user given misclassification cost and tune .

It can be observed in Table 1 that as far as Accuracy is concerned Algorithm and based regularized ERM have comparable values to that from MUB and S-W0-1 on all datasets. As depicted in Table 2, the proposed algorithms have marginally better values of AM measure than that of MUB and S-W0-1 method. Due to lack of a benchmark w.r.t. F and WC, on these measures, we compared our schemes to the SVMs trained on clean data and observed that our schemes fare well w.r.t to these measures too. However, due to space constraint the details are presented in Supplementary material Section E.1.