1 Introduction
We are interested in cost sensitive label predictions when only noise corrupted labels are available. The labels might be corrupted when the data has been collected by crowd scouring with not so high labeling expertise. We consider the basic case when the induced label noise is independent of the class or example, i.e., symmetric/uniform noise model. In real world, there are various applications requiring differential misclassification costs due to class imbalance or domain requirement or both; we elaborate on these below.
In case 1, there is no explicit need for different penalization of classes but the data has imbalance. Example: Predicting whether the age of an applicant for a vocational training course is above or below 15 years. As the general tendency is to apply after high school, data is imbalanced but there is no need for asymmetric misclassification cost. Here, asymmetric cost should be learnt from data.
In case 2, there is no imbalance in data but one class’s misclassification cost is higher than that of other. Example: Product recommendation by paid web advertisements. Even though the product is liked or disliked by approximately equal proportions of population, losing a potential customer by not recommending the product is more costlier than showing paid advertisement to a customer who doesn’t like it. Here, the misclassification cost has to come from the domain.
In case 3, there is both imbalance and need for differential costing. Example: Rare (imbalance) disease diagnosis where the cost of missing a patient with disease is higher than cost of wrongly diagnosing a person with disease. The model should incorporate both the cost from domain and the cost due to imbalance.
In Section 1.1, we provide a summary of how these 3 cases are handled. For cost and uniform noise, we have considered real datasets belonging to cases 2 and 3.
Contributions

Show that, unlike  loss, weighted  loss is not cost sensitive uniform noise robust.

Show weighted uneven margin squared loss with linear classifiers is both uniform noise robust and handles cost sensitivity. Present a performance bound of a classifier obtained from based regularized ERM.

Propose a resampling based scheme for cost sensitive label prediction in the presence of uniform noise using inclass probability estimates.

Unlike existing work, both the proposed schemes do not need true noise rate.

Using a balanced dataset (Bupa) which requires domain cost too, we demonstrate that tuning on such corrupted datasets can be beneficial.
Related work
For classification problems with label noise, particularly in Empirical Risk Minimization framework, the most recent work ([9, 13, 21, 14, 5]) aims to make the loss function noise robust and then develop algorithms. Cost sensitive learning has been widely studied by [4, 10, 17] and many more.
An extensive empirical study on the effect of label noise on cost sensitive learning is presented in [23].
The problem of cost sensitive uniform noise robustness is considered in [12] where asymmetric misclassification cost is tuned and class dependent noise rates are cross validated over corrupted data. However, our work incorporates cost due to both imbalance () and domain requirement () with the added benefit that there is no need to know the true noise rate.
Organization
Section 1.1 has some details about weighted uneven margin loss functions and inclass probability estimates. In Section 2, weighted  loss is shown to be non cost sensitive uniform noise robust. Sections 3 and 4 present two different schemes that make cost sensitive predictions in the presence of uniform label noise. Section 5 has empirical evidence of the performance of proposed methods. Some discussion and future directions are presented in Section 6.
Notations Let
be the joint distribution over
with and Let the inclass probability and class marginal on be denoted by and . Let the decision function be , hypothesis class of all measurable functions be and class of linear hypothesis be Let denote the distribution on obtained by inducing noise to with . The corrupted sample is The noise rate is constant across classes and the model is referred to as Symmetric Label Noise (SLN) model. In such cases, the corrupted inclass probability is and the corrupted class marginal is Symmetric and uniform noise are synonymous in this work.1.1 Some relevant background
The first choice of loss function for cost sensitive learning is that of weighted  loss defined as follows:
(1) 
Let the weighted  risk be . The minimizer of this risk is and referred to as costsensitive Bayes classifier. The corresponding surrogate based risk and the minimizer is defined as and respectively. Consider the following notion of classification calibration.
Definition 1 (Classification Calibration [17])
For and a loss function , define the weighted loss:
(2) 
where and . is classification calibrated (CC) iff there exists a convex, nondecreasing and invertible transformation with , such that
(3) 
If the classifiers obtained from CC losses are consistent w.r.t risk then they are also consistent w.r.t weighted  risk. We consider the weighted uneven margin squared loss [17] which is by construction CC and defined as follows:
(4) 
Interpretation of and
The role of and can be related to the three cases of differential costing described at the start of this paper. In case 1, there are 3 options: fix and let tuned pick up the imbalance; fix and tune ; tune both and . Our experimental results suggest that latter two perform equally good. For case 2, is given and can be fixed at . However, we observe that even in this case tuning can be more informative. For case 3, is tuned and is given a priori.
There would be a tradeoff between and , i.e., for a given , there would be an optimal in a suitable sense. Above observations are based on the experiments described in Supplementary material Section D.
Choice of estimation method
As estimates are required in resampling scheme, we investigated the performance of 4 methods:
Lkfun [16], uses classifier to get the estimate ; interpreting as a conditional expectation and obtaining it by a suitable squared deviation minimization; LSPC [19], density ratio estimation by norm minimization; KLIEP [20], density ratio estimation by KL divergence minimization.
We chose to use Lkfun with logistic loss and squared loss , LSPC, and , a normalized version of KLIEP because in resampling algorithm we are concerned with label prediction and these estimators performed equally well on Accuracy measure. A detailed study is available in Supplementary material Section F.
2 Cost sensitive Bayes classifiers using need not be uniform noise robust
The robustness notion for risk minimization in cost insensitive scenarios was introduced by [9]. They also proved that cost insensitive  loss based risk minimization is SLN robust. We extend this definition to costsensitive learning.
Definition 2 (Cost sensitive noise robust)
Let and be obtained from clean and corrupted distribution and using any arbitrary scheme , then the scheme is said to be cost sensitive noise robust if
If and are obtained from a cost sensitive loss function and noise induced is symmetric, then is said to be cost sensitive uniform noise robust.
Let the risk on be denoted by . If one is interested in cost sensitive learning with noisy labels, then the sufficient condition of [5] becomes . This condition is satisfied if and only if implying that it cannot be a sufficient condition for SLN robustness if there is a differential costing of .
Let and be the minimizers of and . Then, it is known that they have the following form:
Example 1
Let
has a Bernoulli distribution with parameter
. Let be such that and . Then, the inclass probability is given as follows:Suppose . Then, . If , Consider the weighted  risk of and :
Therefore, implying that the weighted  loss function is not uniform noise robust with . Details are in Supplementary material Section B.1. Note that due to , is linearly separable; a linearly inseparable variant can be obtained by . Another linearly inseparable distribution based counter example is available in Supplementary material Section B.4.
In view of the above example, one can try to use the principle of inductive bias, i.e., consider a strict subset of the above set of classifiers; however, Example 2 below says that the set of linear class of classifiers need not be cost sensitive uniform noise robust.
Example 2
Consider the training set
with uniform probability distribution. Let the linear classifier be of the form
. Let and the uniform noise be . Then,Details of Example 2 are available in Supplementary material Section B.2 To avoid above counter examples, we resort to convex surrogate loss function and a type of regularization which restricts the hypothesis class. Consider an weighted uneven margin loss functions [17] with its optimal classifiers on and denoted by and respectively. Regularized risk minimization defined below is known to avoid overfitting.
(7) 
Let the regularized risk of on be . Also, let the minimizers of clean and corrupted regularized risks be and . Now, Definition 2 can be specialized to to assure cost sensitivity, classification calibration and uniform noise robustness as follows:
Definition 3 (robustness of risk minimization)
For a loss function and classifiers and , risk minimization is said to be robust if
(8) 
Further, if the classifiers in equation (8) are and then, we say that regularized risk minimization under is robust.
Due to squared loss’s SLN robustness property [9], we check whether is robust or not. It is not with as shown in Example 3; details of Example 3 are available in Supplementary material Section B.3.
Example 3
Consider the settings as in Example 1. Let and . Then, for all ,
Hence, . implying that based ERM may not be cost sensitive uniform noise robust.
We again have a negative result with when we consider hypothesis class . In next section, we present a positive result and show that regularized risk minimization under loss function is robust if the hypothesis class is restricted to .
3 is robust
In this section, we consider the weighted uneven margin squared loss function from equation (4) with restricted hypothesis class and show a positive result that regularized cost sensitive risk minimization under loss function is robust. A proof is available in Supplementary material Section A.1.
Theorem 3.1
is robust, i.e., linear classifiers obtained from weighted uneven margin squared loss based regularized risk minimization are SLN robust.
Remark 1
The above results relating to counter examples and Theorem 3.1 about cost sensitive uniform noise robustness can be summarized as follows: There are two loss functions, and and two hypothesis classes, and . Out of the four combinations of loss functions and hypothesis classes only with is cost sensitive uniform noise robust, others are not.
Next, we provide a closed form expression for the classifier learnt on corrupted data by minimizing empirical regularized risk. We also provide a performance bound on the clean risk of this classifier.
3.1 based classifier from corrupted data its performance
In this subsection, we present a descriptive scheme to learn a costsensitive linear classifier in the presence of noisy labels, by minimizing based regularized empirical risk, i.e.,
(9) 
where is user given, and regularization parameter are to be tuned by cross validation. A proof is available in Supplementary material Section A.2.
Proposition 1
Consider corrupted regularized empirical risk of Then, the optimal robust linear classifier with has the following form:
(10) 
where , a
dimensional vector of variables;
, a dimensional known symmetric matrix and , a dimensional known vector are as follows:with and .
Next, we provide a result on the performance of in terms of the Rademacher complexity of the function class For this, we need Lemma 1 and 2 whose proofs are available in Supplementary material Section A.3 and A.4 respectively.
Lemma 1
Consider the weighted uneven margin squared loss which is locally Lipschitz with where , for . Then, with probability at least ,
where is the Rademacher complexity of the function class with
’s as independent uniform random variables taking values in
.Lemma 2
For a classifier and user given , the risk on clean and corrupted distribution satisfy the following equation:
(11) 
Theorem 3.2
Under the settings of Lemma 1, with probability at least ,
where is the linear minimizer of and is the inclass probability for . Furthermore, as is CC, there exists a nondecreasing and invertible function with such that,
(12)  
A proof of Theorem 3.2 is available in Supplementary material Section A.5. The first two terms (involving the difference) in the right hand side of equation (12) denotes the approximation error which is small if is large and the third term involving the Rademacher complexity denotes the estimation error which is small if is small. The fourth term denotes the sample complexity which vanishes as the sample size increases. The bound in (12) can be used to show consistency of based regularized ERM if the argument of tends to zero as sample size increases. However, in this case, it is not obvious because the last two terms involving noise rates may not vanish with increasing sample size. In spite of this, our empirical experience with this algorithm is very good.
4 A resampling based Algorithm
In this section, we present a cost sensitive label prediction algorithm based on rebalancing (which is guided by the costs) the noisy training set given to the learning algorithm. Let us consider uneven margin version of weighted  loss from equation (1) defined as follows:
where is user given cost and , tunable cost handles the class imbalance. This definition is along the lines of the uneven margin losses defined in [17]. Let risk on be and corresponding optimal classifier be :
(13) 
Also, let risk on be and the corresponding optimal classifier be as given below:
(14) 
We propose Algorithm which is mainly based on two ideas: (i) predictions based on a certain threshold can correspond to predictions based on threshold if the number of negative examples in the training set is multiplied by (Theorem 1 of [4]) (ii) for a given , and lie on the same side of threshold when noise rate is . We first formalize the latter idea in terms of a general result. A proof is available in Supplementary material Section A.6.
Lemma 3
In SLN models, for a given noise rate , the clean and corrupted class marginals and satisfy the following condition:
Further, the above monotonicity holds for and too.
In our case, the cost sensitive label prediction requires the desired threshold to be () but the threshold which we can use is implying that for us . If and are number of positive and negative examples in , then we should resample such that the size of balanced dataset is . As we have access to only corrupted data, the learning scheme is: rebalance the corrupted data using and then threshold at . Since, for SLN model, predictions made by thresholding at are same as the predictions made by thresholding at , for a test point from , predicted label is The main advantage of this algorithm is that it doesn’t require the knowledge of true noise rates. Also, unlike Section 3’s scheme involving based regularized ERM, this algorithm uses estimates and hence is a generative learning scheme.
Since, we do not want to lose any minority (rare) class examples, we reassign positive labels to the minority class WLOG, if needed, implying that negative class examples are always undersampled. The performance of Algorithm is majorly dependent on sampling procedure and estimation methods used.
Remark 2
Algorithm exploits the fact that where is learnt on resampled data. This implies but due to counter examples in Section 2, these risks may not be equal to . Hence, this scheme is not in contradiction to Section 2. However, as estimation methods use a subset of (e.g., LSPC and KLIEP use linear combinations of finite Gaussian kernels as basis functions), these risks may be equal to where is estimate of obtained from strict subset of hypothesis class. Also, based on very good empirical performance of the scheme, we believe that where is an estimate of .
5 Comparison of based regularized ERM and Algorithm to existing methods on UCI datasets
In this section, we consider some UCI datasets [2] and demonstrate that is robust. Also, we demonstrate the performance of Algorithm with estimated using Lkfun, LSPC and KLIEP. In addition to Accuracy (Acc), Arithmetic mean (AM) of True positive rate (TPR) and True negative rate (TNR), we also consider two measures suited for evaluating classifiers learnt on imbalanced data, viz., F measure and Weighted cost (WC) defined as below:
where TP, TN, FP, FN are number of true positives, true negatives, false positives and false negatives for a classifier.
To account for randomness in the flips to simulate a given noise rate, we repeat each experiment 10 times, with independent corruptions of the data set for same noise () setting. In every trial, the data is partitioned into train and test with  split. Uniform noise induced data is used for training and validation (if there are any parameters to be tuned like ).
Finally, clean test data is used for evaluation.
Regularization parameter, is tuned over the set
On a synthetic dataset, we observed that the performance of our methods and cost sensitive Bayes classifier on clean data w.r.t. Accuracy, AM, F and Weighted Cost measure is comparable for moderate noise rates; details in Supplementary material Section E.3. In all the tables, values within of the best across a row are in bold.
Class imbalance and domain requirement of cost
We report the accuracy and AM values of logistic loss based unbiased estimator (MUB) approach and approach of surrogates for weighted 01 loss (SW01), as it is from the work of
[12] and compare them to Accuracy and AM for our cost sensitive learning schemes. It is to be noted that [12] assumes that the true noise rate is known and cost is tuned. We are more flexible and user friendly as we don’t need the noise rate and allow for user given misclassification cost and tune .It can be observed in Table 1 that as far as Accuracy is concerned Algorithm and based regularized ERM have comparable values to that from MUB and SW01 on all datasets. As depicted in Table 2, the proposed algorithms have marginally better values of AM measure than that of MUB and SW01 method. Due to lack of a benchmark w.r.t. F and WC, on these measures, we compared our schemes to the SVMs trained on clean data and observed that our schemes fare well w.r.t to these measures too. However, due to space constraint the details are presented in Supplementary material Section E.1.
Dataset  Cost  Algorithm : estimate from 

MUB  SW01  



LSPC  KLIEP  

0.2 










0.16 










0.3 










0.25 








Dataset  Cost  Algorithm : estimate from 

MUB  SW01  



LSPC  KLIEP  

0.2 










0.16 




Comments
There are no comments yet.