# Curriculum Loss: Robust Learning and Generalization against Label Corruption

Generalization is vital important for many deep network models. It becomes more challenging when high robustness is required for learning with noisy labels. The 0-1 loss has monotonic relationship between empirical adversary (reweighted) risk, and it is robust to outliers. However, it is also difficult to optimize. To efficiently optimize 0-1 loss while keeping its robust properties, we propose a very simple and efficient loss, i.e. curriculum loss (CL). Our CL is a tighter upper bound of the 0-1 loss compared with conventional summation based surrogate losses. Moreover, CL can adaptively select samples for training as a curriculum learning. To handle large rate of noisy label corruption, we extend our curriculum loss to a more general form that can automatically prune the estimated noisy samples during training. Experimental results on noisy MNIST, CIFAR10 and CIFAR100 dataset validate the robustness of the proposed loss.

## Authors

• 7 publications
• 47 publications
• ### Learning Not to Learn in the Presence of Noisy Labels

Learning in the presence of label noise is a challenging yet important t...
02/16/2020 ∙ by Liu Ziyin, et al. ∙ 11

• ### Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks

Our first contribution in this paper is a theoretical investigation of c...
02/11/2018 ∙ by Daphna Weinshall, et al. ∙ 0

• ### LongReMix: Robust Learning with High Confidence Samples in a Noisy Label Environment

Deep neural network models are robust to a limited amount of label noise...
03/06/2021 ∙ by Filipe R. Cordeiro, et al. ∙ 18

• ### Self-paced Resistance Learning against Overfitting on Noisy Labels

Noisy labels composed of correct and corrupted ones are pervasive in pra...
05/07/2021 ∙ by Xiaoshuang Shi, et al. ∙ 0

• ### Rethinking Curriculum Learning with Incremental Labels and Adaptive Compensation

Like humans, deep networks learn better when samples are organized and i...
01/13/2020 ∙ by Madan Ravi Ganesh, et al. ∙ 16

• ### Improved Natural Language Generation via Loss Truncation

Neural language models are usually trained to match the distributional p...
04/30/2020 ∙ by Daniel Kang, et al. ∙ 0

• ### Dynamic Curriculum Learning for Imbalanced Data Classification

Human attribute analysis is a challenging task in the field of computer ...
01/21/2019 ∙ by Yiru Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Noise corruption is a common phenomenon in our daily life. For instance, noisy corrupted (wrong) labels may be resulted from annotating for similar objects su2012crowdsourcing , crawling images and labels from websites Robustwebimage ; tanaka2018joint and creating training sets by program ratner2016data ; khetan2017learning . Learning with noisy labels is thus an promising area, however, it is challenging to train deep networks robustly with noisy labels.

Deep networks have great expressive power (model complexity) to learn challenging tasks. However, they undertake more risk of overfitting to the data. Although many regularization techniques such as adding regularization terms, data augmentation, weight decay, dropout and batch normalization have been proposed, generalization is still vital important for deep learning to fully exploit the super expressive power. It becomes more challenging when the robustness is required for learning with noisy labels. Zhang et al.

zhang2016understanding show that deep networks can even fully memorize samples with incorrectly corrupted labels. This will significantly degenerate the generalization performance of deep models.

Robustness of 0-1 loss: The problem resulted from label corruption is that test distribution is different from training distribution. Hu et al. hu2016does analyzed the adversarial risk that the test distribution density is adversarially changed within a limited

-divergence (e.g. KL-divergence) from the training distribution density. They show that there is a monotonic relationship between (empirical) risk and the (empirical) adversarial risk when 0-1 loss function is used. This suggests that minimizing empirical risk with the 0-1 loss function is equivalent to minimize the empirical adversarial risk. Note that we evaluate models on test data in terms of the 0-1 loss (empirical approximation of Bayes risk w.r.t. test distribution). Therefore, without making other assumptions, minimizing the 0-1 loss is the most robust loss against label corruption. Moreover, the 0-1 loss is more robust to outliers compared with an unbounded (convex) loss (e.g. hinge loss)

masnadi2009design . This is due to unbounded convex loss put much weight on the outliers (with large loss) when minimizing the loss masnadi2009design . If the unbounded (convex) loss is employed in deep network models, this becomes more prominent. Since training loss of deep networks can often be minimized to zero, outlier with a large loss has a large impact of the model. The 0-1 loss treats each training sample equally. Thus, each sample does not have too much influence on the model. Therefore, the model is tolerant to a small number of outliers. Although the 0-1 loss has many robust properties, it is difficult to optimize. One possible way to alleviate this problem is to seek a tighter upper bound of the 0-1 loss that is still efficient to optimize. Such a tighter upper bound of the 0-1 loss can reduce the influence of the noisy outliers compared with conventional (convex) losses; while it is easier to optimize compared with the 0-1 loss. When minimizing the upper bound surrogate, we expect that the 0-1 loss objective is also minimized.

Learnability under large noise rate: Even the 0-1 loss cannot deal with large noise rate. When the noise rate becomes large, the systematic error (due to label corruption) grows up and becomes not negligible. As a result, the model’s generalization performance will degenerate due to this systematic error. To reduce the systematic error produced by training with noisy labels, several methods have been proposed. They can be categorized into three kinds: transition matrix based method sukhbaatar2014training ; patrini2017making ; goldberger2016training , regularization based method miyato2016virtual and sample selection based method jiang2017mentornet ; han2018co . Among them, sample selection based method is one promising direction that selects samples to reduce noisy ratio for training. These methods are based on the idea of curriculum learning bengio2009curriculum

which is one successful method that trains the model gradually with samples ordered in a meaningful sequence. Although they achieve success to some extents, most of these methods are heuristic based.

To efficiently minimize the 0-1 loss while keeping the robust properties, we propose a novel loss that is a tighter upper bound of 0-1 loss compared with conventional surrogate losses. Specifically, giving any base loss function , our loss satisfies , where with be the classification margin of sample. We name it as Curriculum Loss (CL) because our loss automatically and adaptively select samples for training as a curriculum learning. The selection procedure can be done by a very simple and efficient algorithm in . Moreover, since our loss is tighter than conventional surrogate losses, it is more robust compared with them. To handle the case of learning with large noise rate, we propose Noise Pruned Curriculum Loss (NPCL) by extending our basic curriculum loss to a more general form. It reduces to our basic CL when the estimated noise rate is zero. Our NPCL automatically prunes estimated noisy samples during training process. Remarkably, our NPCL is very simple and efficient, it can be used as a plug-in for many deep models for robust learning. Our contributions are listed as follows:

• We propose a novel loss (i.e. curriculum loss) that is a tighter upper bound of 0-1 loss compared with conventional summation based surrogate loss. Our curriculum loss can automatically and adaptively select samples for training as a curriculum learning.

• We further propose Noise Pruned Curriculum Loss (NPCL) to address large rate of noise (label corruption) by extending our curriculum loss to a more general form. Our NPCL automatically prune the estimated noisy samples during training. Moreover, our NPCL is very simple and efficient, it can be used as a plug-in in many deep models.

## 2 Related Literature

Curriculum Learning: Curriculum learning is a general learning methodology that achieves success in many area. The very beginning work of curriculum learning bengio2009curriculum trains a model gradually with samples ordered in a meaningful sequence, which has improved performance on many problems. Since the curriculum in bengio2009curriculum is predetermined by prior knowledge and remained fixed later, which ignores the feedback of learners, Kumar et al. kumar2010self further propose Self-paced learning that selects samples by alternative minimization of an augmented objective. Jiang et al. jiang2014self propose a self-paced learning method to select samples with diversity. After that, Jiang et al. jiang2015self propose a self-paced curriculum strategy that takes different priors into consideration. Although these methods achieve success, the relation between the augmented objective of self-paced learning and the original objective (e.g. cross entropy loss for classification) is not clear. In addition, as stated in jiang2017mentornet , the alternative update in self-paced learning is not efficient for training deep networks.

Learning with Noisy Labels: The most related works are the sample selection based methods for robust learning. This kind of works are inspired by curriculum learning bengio2009curriculum . Among them, Jiang et al. jiang2017mentornet propose to learn the curriculum from data by a mentor net. They use the mentor net to select samples for training with noisy labels. Co-teaching han2018co employs two networks to select samples to train each other and achieve good generalization performance against large rate of label corruption. Compared with Co-teaching, our CL is a simple plugin for a single network. Thus both space and time complexity of CL are half of Co-teaching’s.

Construction of tighter bounds of 0-1 loss: Along the line of construction of tighter bounds of the 0-1 loss, many methods have been proposed. To name a few, Masnadi-Shirazi et al. masnadi2009design propose Savage loss, which is a non-convex upper bound of the 0-1 loss function. Bartlett et al. bartlett2006convexity analyze the properties of the truncated loss for conventional convex loss. Wu et al. wu2007robust study the truncated hinge loss for SVM. Although the results are fruitful, these works are mainly focus on loss function at individual data point, they do not have sample selection property. In contrast, our curriculum loss can automatically select samples for training. Moreover, it can be constructed in a tighter way than these individual losses by employing them as the base loss function.

## 3 Curriculum Loss

In this section, we present our Curriculum Loss (CL) that automatically selects samples for training. We begin with discussion about robustness of the 0-1 loss. We then show that our CL is a tighter upper bound of the 0-1 loss compared with conventional summation based surrogate loss. A tighter bound of the 0-1 loss means that it is less sensitive to the noisy outliers, and it better preserves the robustness of the 0-1 loss against label corruption. Thus, it can deal with noisy samples with small rate of label corruption. When the label corruption rate becomes large, even the 0-1 loss suffers. Thus, we propose Noise Pruned Curriculum Loss (NPCL) to address this issue. Our NPCL can automatically prune the estimated noisy samples during training process. It reduces to our basic CL when the estimated rate of label corruption is zero. Our CL and NPCL are very simple and efficient, which support mini-batch update. They can be used as plug-in for many deep models. A simple multi-class extension and a novel soft multi-hinge loss are included in the supplementary material. All the detailed proofs are also included in the supplemental material.

### 3.1 Robustness of 0-1 loss against label corruption

We rephrase Theorem 1 in hu2016does from a different perspective, which motivates us to employ the 0-1 loss for training against label corruption.

###### Theorem 1.

(Monotonic Relationship) (Hu et al. hu2016does ) Let and be the training and test density,respectively. Define and . Let and be 0-1 loss for binary classification and multi-class classification, respectively. Let be convex with . Define risk , empirical risk , adversarial risk and empirical adversarial risk as

where and . Then we have that

The same monotonic relationship holds between their empirical approximation: and .

Theorem 1 hu2016does shows that the monotonic relationship between (empirical) risk and the (empirical) adversarial risk when 0-1 loss function is used. It means that minimizing (empirical) risk is equivalent to minimize the (empirical) adversarial risk for 0-1 loss. Note that the problem resulted from label corruption is that test distribution is different from training distribution. Without further making other assumptions about the corruption distribution, the 0-1 loss is the most robust loss function because minimizing the 0-1 loss is equivalent to minimize the worst case risk, i.e., (empirical) adversarial risk for a changing test distribution within a limited -divergence from the given (empirical) training distribution. For label corruption with a small noise rate, the -divergence between test distribution and corrupted training distribution is small. In this situation, training with 0-1 loss is most robust against adversary changing test distribution without other assumptions. This motivates us to employ 0-1 loss for training against label corruption.

### 3.2 Tighter upper bounds of 0-1 Loss

The 0-1 loss is difficult to optimize, we thus propose a tighter upper bound surrogate loss. We use the classification margin to define the 0-1 loss. For binary classification, classification margin is , where and denotes the prediction and ground truth, respectively. (A simple multi-class extension is discussed in supplement.) Let be the classification margin of the sample for . Denote . The 0-1 loss objective can be defined as follows:

 J(u)=∑ni=11(ui<0). (7)

Given a base upper bound function , the conventional surrogate of the 0-1 loss can be defined as

 (8)

Our curriculum loss can be defined as Eq.(9). is a tighter upper bound of 0-1 loss compared with the conventional surrogate loss , which is summarized in Theorem 2:

###### Theorem 2.

(Tighter Bound) Suppose that base loss function is an upper bound of the 0-1 loss function. Let be the classification margin of the sample for . Denote as the maximum between two inputs. Let . Define as follows:

 Q(u)=minv∈{0,1}nmax(∑ni=1vil(ui),n−∑ni=1vi+∑ni=11(ui<0)). (9)

Then holds true.

Remark: For any fixed , we can obtain an optimum solution of the partial optimization. The index indicator can naturally select samples as a curriculum for training models. The partial optimization w.r.t index indicator can be solved by a very simple and efficient algorithm (Alg 1) in . Thus, the loss is very efficient to compute. Moreover, since is tighter than conventional surrogate loss , it is less sensitive to outliers compared with . Furthermore, it better preserves the robust property of the 0-1 loss against label corruption.

Updating with all the samples at once is not efficient for deep models, while training with mini-batch is more efficient and well supported for many deep learning tools. We thus propose a batch based curriculum loss given as Eq.(10). We show that is also a tighter upper bound of 0-1 loss objective compared with conventional loss . This property is summarized in Corollary 1.

###### Corollary 1.

(Mini-batch Update) Suppose that base loss function is an upper bound of the 0-1 loss function. Let , be the number of batches and batch size, respectively. Let be the classification margin of the sample in batch for and . Denote . Let . Define as follows:

 ˆQ(u)=∑bj=1minv∈{0,1}mmax(∑mi=1vijl(uij),m−∑mi=1vij+∑mi=11(uij<0)). (10)

Then holds true.

Remark: Corollary 1 shows that a batch-based curriculum loss is also a tighter upper bound of 0-1 loss compared with the conventional surrogate loss

. This enables us to train deep models with mini-batch update. Note that random shuffle in different epoch results in a different batch-based curriculum loss. Nevertheless, we at least know that all the induced losses are upper bounds of 0-1 loss objective and are tighter than

. Moreover, all these losses are induced by the same base loss function . Note that, our goal is to minimize the 0-1 loss. Random shuffle leads to a multiple surrogate training scheme. In addition, training deep models without shuffle does not have this issue.

We now present another curriculum loss which is tighter than . is an (scaled) upper bound of 0-1 loss. This property is summarized as Theorem 3.

###### Theorem 3.

(Scaled Bound) Suppose that base loss function is an upper bound of the 0-1 loss function. Let be the classification margin of the sample for . Denote . Define as follows:

 E(u)=minv∈{0,1}nmax(∑ni=1vil(ui),n−∑ni=1vi). (11)

Then holds true.

Remark: has similar properties to discussed above. Moreover, it is tighter than , i.e. . Thus, it is less sensitive to outliers compared with . However, can construct more adaptive curriculum by taking 0-1 loss into consideration during the training process.

Directly optimizing is not efficient similar to . We now present a batch loss objective given as Eq.(12). is also a tighter upper bound of 0-1 loss objective compared with conventional surrogate loss .

###### Corollary 2.

(Mini-batch Update for Scaled Bound) Suppose that base loss function is an upper bound of the 0-1 loss function. Let , be the number of batches and batch size, respectively. Let be the classification margin of the sample in batch for and . Denote . Let . Define as follows:

 ˆE(u)=∑bj=1minv∈{0,1}mmax(∑mi=1vijl(uij),m−∑mi=1vij). (12)

Then holds true.

All the curriculum losses defined above rely on minimizing a partial optimization problem (Eq.(13)) to find the selection index set . We now show that the optimization of with given classification margin can be done in .

###### Theorem 4.

(Partial Optimization) Suppose that base loss function is an upper bound of the 0-1 loss function. For fixed , , an minimum solution of the minimization problem in Eq. (13) can be achieved by Algorithm 1:

 minv∈{0,1}nmax(∑ni=1vil(ui),C−∑ni=1vi), (13)

where is the threshold parameter.

Remark: The time complexity of Algorithm 1 is . Moreover, it does not involve complex operations, and is very simple and efficient to compute.

Algorithm 1 can adaptively select samples for training. It has some useful properties to help us better understand the objective after partial minimization, we present them in Proposition 1.

###### Proposition 1.

(Optimum of Partial Optimization) Suppose that base loss function is an upper bound of the 0-1 loss function. Let for be fixed values. Without loss of generality, assume . Let be an optimum solution of the partial optimization problem in (13). Let and . Then we have

 LT∗≤C+1−T∗ (14) LT∗+1>C−T∗ (15) LT∗+1>max(LT∗,C−T∗) (16) minv∈{0,1}nmax(∑ni=1vil(ui),C−∑ni=1vi)=max(LT∗,C−T∗). (17)

Remark: When , Eq.(17) is tighter than the conventional loss . When , Eq. (17) is a scaled upper bound of 0-1 loss . From Eq.(17) , we know the optimum of the partial optimization problem (13) (i.e. our objective) is . When , we can directly optimize with the selected samples for training. When , note that from Eq.(16), we can optimize for training . Note that when , we have that , which is still tighter than the conventional loss . When , for the parameter , we have that . Thus we can optimize . In practice, when training with random mini-batch, we find that optimizing in both cases instead of does not have much influence.

### 3.3 Noise Pruned Curriculum Loss

The curriculum loss in Eq.(9) and Eq.(11) expect to minimize the upper bound of the 0-1 loss for all the training samples. When model capability (complexity) is high, (deep network) model will still attain small (zero) training loss and overfit to the noisy samples.

The ideal model is that it correctly classifies the clean training samples and misclassify the noisy samples with wrong labels. Suppose that the rate of noisy samples (by label corruption) is

. The ideal model is to correctly classify the clean training samples, and misclassify the noisy training samples. This is because the label is corrupted. Correctly classify the training samples with corrupted (wrong) label means that the model has already overfitted to noisy samples. This will harm the generalization to the unseen data.

Considering all the above reasons, we thus propose the Noise Pruned Curriculum Loss (NPCL) as

 L(u)=minv∈{0,1}nmax(∑ni=1vil(ui),C−∑ni=1vi), (18)

where or .

When we know there are noisy samples in the training set, we can leverage this as our prior. (The impact of misspecification of the prior is included in the supplement.) When (assume are integers for simplicity), from the selection procedure in Algorithm 1, we know 111When , samples will be pruned. Otherwise, samples will be pruned. samples with largest losses will be pruned. This is because when . Without loss of generality, assume . After pruning, we have , the pruned loss becomes

 ˜L(u)=minv∈{0,1}(1−ϵ)nmax(∑(1−ϵ)ni=1vil(ui),(1−ϵ)n−∑(1−ϵ)ni=1vi). (19)

It is the basic CL for samples and it is the upper bound of . If we prune more noisy samples than clean samples, it will reduce the noise ratio. Then the basic CL can handle. Fortunately, this assumption is supported by the "memorization" effect in deep networks arpit2017closer , i.e. deep networks tend to learn clean and easy pattern first. Thus, the loss of noisy or hard data tend to remain high for a period (before being overfitted). Therefore, the pruned samples with largest loss are more likely to be the noisy samples. After the rough pruning, the problem becomes optimizing basic CL for the remaining samples as in Eq.(19). Note that our CL is a tight upper bound approximation to the 0-1 loss, it preserves the robust property to some extent. Thus, it can handle case with small noise rate. Specifically, our CL(Eq.19) further select samples from the remaining samples for training adaptively according to the state of training process. This generally will further reduce the noise ratio. Thus, we may expect our NPCL to be robust to noisy samples. Note that, all the above can be done by the simple and efficient Algorithm 1 without explicit pruning samples in a separated step. That is equal to say, our loss can do all these automatically under a unified objective form in Eq.(18).

When , the NPCL in Eq.(18) reduces to basic CL in Eq.(11) with . When , for a target ideal model (that misclassifies noisy samples only), we know that . It has similar properties as choosing . Moreover, it is more adaptive by considering 0-1 loss during training at different stage. In this case, the NPCL in Eq.(18) reduces to the CL in Eq.(9) when . Note that is a prior, users can defined it based on their domain knowledge.

To leverage the benefit of deep learning, we present the batched NPCL as

 ˆL(u)=∑bj=1minv∈{0,1}mmax(∑mi=1vijl(uij),ˆCj−∑mi=1vij), (20)

where or as in Eq.(21):

 ˆCj=(1−ϵ)2m+(1−ϵ)∑mi=11(uij<0). (21)

Similar to Corollary 1, we know that . Thus, optimizing the batched NPCL is indeed minimizing the upper bound of NPCL. This enables us to train the model with mini-batch update, which is very efficient for modern deep learning tools. The training procedure is summarized in Algorithm 2. It uses Algorithm 1 to select a subset of samples from every mini-batch. Then, it uses the selected samples to perform gradient update.

## 4 Empirical Study

Evaluation of robustness against label corruption: We evaluate our NPCL by comparing Co-teaching han2018co , MentorNet jiang2017mentornet and standard network training on MNIST, CIFAR10 and CIFAR100 dataset as in han2018co ; patrini2017making ; goldberger2016training . Two types of random label corruption, i.e. Symmetry flipping van2015learning and Pair flipping han2018masking , are considered in this work. Symmetry flipping is that the corrupted label is uniformly assign to one of incorrect classes. Pair flipping is that the corrupted label is assign to one specific class similar to the ground truth. The noise rate of label flipping is chosen from

. We employ same network architecture and network hyperparameters as in Co-teaching

han2018co for all the methods in comparison. Specifically, the batch size and the number of epochs is set to and , respectively. The Adam optimizer with same parameter as han2018co is employed. For NPCL, we employ hinge loss as the base upper bound function of 0-1 loss. At first a few epochs, we train model using full batch with soft hinge loss (in the supplement) as a burn-in period suggested in jiang2017mentornet . Specifically, we start NPCL at epoch on MNIST and epoch on CIFAR10 and CIFAR100, respectively. For Co-teaching han2018co and MentorNet in jiang2017mentornet , we employ the open sourced code provided by the authors of Co-teaching han2018co

. We implement NPCL by Pytorch. Code of NPCL will be released on GitHub. Experiments are performed five independent runs. The error bar for STD is shaded.

For performance measurements, we employ both test accuracy and label precision as in han2018co . Label precision is defined as : number of clean samples / number of selected samples, which measures the selection accuracy for sample selection based methods. A higher label precision in the mini-batch after sample selection can lead to a update with less noisy samples, which means that model suffers less influence of noisy samples and thus preforms more robust to label corruption.

The picture of test accuracy and label precision vs. number of epochs on MNIST is presented in Figure 1. It shows that NCPL consistently outperforms Co-teaching in terms of test accuracy on all three cases. Moreover, NCPL achieves higher label precision compared with Co-teaching, which means that NPCL can select more clean samples than Co-teaching. The average of test accuracy over last ten epochs for different methods are reported in Table 1. Again, we can observe that NPCL obtains higher test accuracy than other methods on all three cases, which shows the robustness of our NPCL. More experimental results can be found in the supplemental material. Note that NPCL is a simple plug-in for a single network, while Co-teaching employs two networks to train the model concurrently. Thus, both the space complexity and time complexity of co-teaching is doubled compared with our NPCL.

## 5 Conclusion and Further Work

In this work, we proposed a curriculum loss (CL) for robust learning. Theoretically, we analyzed the properties of CL and proofed that it is tighter upper bound of the 0-1 loss compared with conventional summation based surrogate losses. We extended our CL to a more general form (NPCL) to handle large rate of label corruption. Empirically, experimental results on benchmark datasets show the robustness of the proposed loss. As a side product, we proposed a novel soft multi-hinge loss. Experimental results on CIFAR100 shows that our soft hinge loss is easier to optimize compared with the hard hinge loss. As a further work, we may improve our CL to handle imbalanced distribution by considering diversity for each class.

## References

• [1] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In ICML, pages 233–242, 2017.
• [2] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
• [3] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
• [4] Jacob Goldberger and Ehud Ben-Reuven.

Training deep neural-networks using a noise adaptation layer.

In ICLR, 2017.
• [5] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi Sugiyama. Masking: A new perspective of noisy supervision. In Advances in Neural Information Processing Systems, pages 5836–5846, 2018.
• [6] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, pages 8527–8537, 2018.
• [7] Mengqiu Hu, Yang Yang, Fumin Shen, Luming Zhang, Heng Tao Shen, and Xuelong Li. Robust web image annotation via exploring multi-facet and structural knowledge. IEEE Transactions on Image Processing, 26(10):4871–4884, 2017.
• [8] Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama.

Does distributionally robust supervised learning give robust classifiers?

2018.
• [9] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Self-paced learning with diversity. In Advances in Neural Information Processing Systems, pages 2078–2086, 2014.
• [10] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced curriculum learning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
• [11] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. 2018.
• [12] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-labeled data. ICLR, 2018.
• [13] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
• [14] Hamed Masnadi-Shirazi and Nuno Vasconcelos. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In Advances in neural information processing systems, pages 1049–1056, 2009.
• [15] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Virtual adversarial training for semi-supervised text classification. In ICLR, 2016.
• [16] Robert Moore and John DeNero. L1 and l2 regularization for multiclass hinge loss models. In

Symposium on Machine Learning in Speech and Language Processing

, 2011.
• [17] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pages 1944–1952, 2017.
• [18] Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems, pages 3567–3575, 2016.
• [19] Hao Su, Jia Deng, and Li Fei-Fei. Crowdsourcing annotations for visual object detection. In Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
• [20] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
• [21] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5552–5560, 2018.
• [22] Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning with symmetric label noise: The importance of being unhinged. In Advances in Neural Information Processing Systems, pages 10–18, 2015.
• [23] Yichao Wu and Yufeng Liu.

Robust truncated hinge loss support vector machines.

Journal of the American Statistical Association, 102(479):974–983, 2007.
• [24] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2017.

## Appendix A Proof of Theorem 2

###### Proof.

Because , we have . Then

 Q(u) =minv∈{0,1}nmax(∑ni=1vil(ui),n−∑ni=1vi+∑ni=11(ui<0)) (22) ≤max(∑ni=1l(ui),n−∑ni=11+∑ni=11(ui<0)) (23) =max(∑ni=1l(ui),∑ni=11(ui<0)) (24) =∑ni=1l(ui) (25)

Since loss , we obtain .

On the other hand, we have that

 Q(u) =minv∈{0,1}nmax(∑ni=1vil(ui),n−∑ni=1vi+∑ni=11(ui<0)) ≥minv∈{0,1}nn−∑ni=1vi+∑ni=11(ui<0) (26) =∑ni=11(ui<0) (27)

Since , we obtain

## Appendix B Proof of Corollary 1

###### Proof.

Since , similar to the proof of , we have

 ˆQ(u) =∑bj=1minv∈{0,1}mmax(∑mi=1vijl(uij),m−∑mi=1vij+∑mi=11(uij<0)) ≤∑bj=1max(∑mi=1l(uij),m−∑mi=11+∑mi=11(uij<0)) (28) =∑bj=1max(∑mi=1l(uij),∑mi=11(uij<0)) (29) =∑bj=1∑mi=1l(uij)=ˆJ(u) (30)

On the other hand, since the group (batch) separable sum structure, we have that

 ˆQ(u) =∑bj=1minv∈{0,1}mmax(∑mi=1vijl(uij),m−∑mi=1vij+∑mi=11(uij<0)) =minv∈{0,1}n∑bj=1max(∑mi=1vijl(uij),m−∑mi=1vij+∑mi=11(uij<0)) (31) ≥minv∈{0,1}nmax(b∑j=1m∑i=1vijl(uij),n−b∑j=1m∑i=1vij+b∑j=1m∑i=11(uij<0)) (32) =Q(u)≥J(u) (33)

## Appendix C Proof of Partial Optimization Theorem (Theorem 4)

###### Proof.

For simplicity, let , . Without loss of generality, assume . Let be the solution obtained by Algorithm 1. Assume there exits a such that

 max(n∑i=1vili,C−n∑i=1vi)

Let and .

Case 1: If , then there exists an and . From Algorithm 1, we know () and . Then we know . Thus, we can achieve that

 max(n∑i=1v∗ili,C−n∑i=1v∗i) =max(n∑i=1v∗ili,C−n∑i=1vi) (35) ≤max(n∑i=1vili,C−n∑i=1vi). (36)

This contradicts the assumption in Eq.(34)

Case 2: If , then there exists an and . Let . Since , we have . From Algorithm 1, we know that . Thus we obtain that

 max(n∑i=1vili,C−n∑i=1vi) ≥LT∗+lk (37) ≥max(LT∗,C−T∗) (38) =max(n∑i=1v∗ili,C−n∑i=1v∗i) (39)

This contradicts the assumption in Eq.(34)

Case 3: If , we obtain . Then we can achieve that

 max(n∑i=1v∗ili,C−n∑i=1v∗i) =max(LT∗,C−T∗) (40) ≤C+1−T∗ (41) ≤C−T (42) =C−∑ni=1vi (43) ≤max(n∑i=1vili,C−n∑i=1vi). (44)

This contradicts the assumption in Eq.(34).

Finally, we conclude that obtained by Algorithm 1 is the minimum of the optimization problem given in (13). ∎

## Appendix D Proof of Proposition 1

###### Proof.

Note that , from the condition of in Algorithm 1, we know that . From the condition of in Algorithm 1, we know that . Because for , we have . Thus, we obtain . By substitute the optimum into the optimization function, we obtain that

 minv∈{0,1}nmax(∑ni=1vil(ui),C−∑ni=1vi) (45) =max(∑ni=1v∗il(ui),C−∑ni=1v∗i) (46) =max(LT∗,C−T∗) (47)

## Appendix E Proof of Theorem 3

###### Proof.

We first prove that objective (11) is tighter than the loss objective in Eq.(8). After this, we prove that objective (11) is an upper bound of the 0/1 loss defined in equation (7).

For simplicity, let , we obtain that

 E(u) =minv∈{0,1}nmax(n∑i=1vil(ui),n−n∑i=1vi) (48) ≤max(n∑i=1l(ui),(n−n∑i=11)) (49) =n∑