1 Introduction
Noise corruption is a common phenomenon in our daily life. For instance, noisy corrupted (wrong) labels may be resulted from annotating for similar objects su2012crowdsourcing , crawling images and labels from websites Robustwebimage ; tanaka2018joint and creating training sets by program ratner2016data ; khetan2017learning . Learning with noisy labels is thus an promising area, however, it is challenging to train deep networks robustly with noisy labels.
Deep networks have great expressive power (model complexity) to learn challenging tasks. However, they undertake more risk of overfitting to the data. Although many regularization techniques such as adding regularization terms, data augmentation, weight decay, dropout and batch normalization have been proposed, generalization is still vital important for deep learning to fully exploit the super expressive power. It becomes more challenging when the robustness is required for learning with noisy labels. Zhang et al.
zhang2016understanding show that deep networks can even fully memorize samples with incorrectly corrupted labels. This will significantly degenerate the generalization performance of deep models.Robustness of 01 loss: The problem resulted from label corruption is that test distribution is different from training distribution. Hu et al. hu2016does analyzed the adversarial risk that the test distribution density is adversarially changed within a limited
divergence (e.g. KLdivergence) from the training distribution density. They show that there is a monotonic relationship between (empirical) risk and the (empirical) adversarial risk when 01 loss function is used. This suggests that minimizing empirical risk with the 01 loss function is equivalent to minimize the empirical adversarial risk. Note that we evaluate models on test data in terms of the 01 loss (empirical approximation of Bayes risk w.r.t. test distribution). Therefore, without making other assumptions, minimizing the 01 loss is the most robust loss against label corruption. Moreover, the 01 loss is more robust to outliers compared with an unbounded (convex) loss (e.g. hinge loss)
masnadi2009design . This is due to unbounded convex loss put much weight on the outliers (with large loss) when minimizing the loss masnadi2009design . If the unbounded (convex) loss is employed in deep network models, this becomes more prominent. Since training loss of deep networks can often be minimized to zero, outlier with a large loss has a large impact of the model. The 01 loss treats each training sample equally. Thus, each sample does not have too much influence on the model. Therefore, the model is tolerant to a small number of outliers. Although the 01 loss has many robust properties, it is difficult to optimize. One possible way to alleviate this problem is to seek a tighter upper bound of the 01 loss that is still efficient to optimize. Such a tighter upper bound of the 01 loss can reduce the influence of the noisy outliers compared with conventional (convex) losses; while it is easier to optimize compared with the 01 loss. When minimizing the upper bound surrogate, we expect that the 01 loss objective is also minimized.Learnability under large noise rate: Even the 01 loss cannot deal with large noise rate. When the noise rate becomes large, the systematic error (due to label corruption) grows up and becomes not negligible. As a result, the model’s generalization performance will degenerate due to this systematic error. To reduce the systematic error produced by training with noisy labels, several methods have been proposed. They can be categorized into three kinds: transition matrix based method sukhbaatar2014training ; patrini2017making ; goldberger2016training , regularization based method miyato2016virtual and sample selection based method jiang2017mentornet ; han2018co . Among them, sample selection based method is one promising direction that selects samples to reduce noisy ratio for training. These methods are based on the idea of curriculum learning bengio2009curriculum
which is one successful method that trains the model gradually with samples ordered in a meaningful sequence. Although they achieve success to some extents, most of these methods are heuristic based.
To efficiently minimize the 01 loss while keeping the robust properties, we propose a novel loss that is a tighter upper bound of 01 loss compared with conventional surrogate losses. Specifically, giving any base loss function , our loss satisfies , where with be the classification margin of sample. We name it as Curriculum Loss (CL) because our loss automatically and adaptively select samples for training as a curriculum learning. The selection procedure can be done by a very simple and efficient algorithm in . Moreover, since our loss is tighter than conventional surrogate losses, it is more robust compared with them. To handle the case of learning with large noise rate, we propose Noise Pruned Curriculum Loss (NPCL) by extending our basic curriculum loss to a more general form. It reduces to our basic CL when the estimated noise rate is zero. Our NPCL automatically prunes estimated noisy samples during training process. Remarkably, our NPCL is very simple and efficient, it can be used as a plugin for many deep models for robust learning. Our contributions are listed as follows:

We propose a novel loss (i.e. curriculum loss) that is a tighter upper bound of 01 loss compared with conventional summation based surrogate loss. Our curriculum loss can automatically and adaptively select samples for training as a curriculum learning.

We further propose Noise Pruned Curriculum Loss (NPCL) to address large rate of noise (label corruption) by extending our curriculum loss to a more general form. Our NPCL automatically prune the estimated noisy samples during training. Moreover, our NPCL is very simple and efficient, it can be used as a plugin in many deep models.
2 Related Literature
Curriculum Learning: Curriculum learning is a general learning methodology that achieves success in many area. The very beginning work of curriculum learning bengio2009curriculum trains a model gradually with samples ordered in a meaningful sequence, which has improved performance on many problems. Since the curriculum in bengio2009curriculum is predetermined by prior knowledge and remained fixed later, which ignores the feedback of learners, Kumar et al. kumar2010self further propose Selfpaced learning that selects samples by alternative minimization of an augmented objective. Jiang et al. jiang2014self propose a selfpaced learning method to select samples with diversity. After that, Jiang et al. jiang2015self propose a selfpaced curriculum strategy that takes different priors into consideration. Although these methods achieve success, the relation between the augmented objective of selfpaced learning and the original objective (e.g. cross entropy loss for classification) is not clear. In addition, as stated in jiang2017mentornet , the alternative update in selfpaced learning is not efficient for training deep networks.
Learning with Noisy Labels: The most related works are the sample selection based methods for robust learning. This kind of works are inspired by curriculum learning bengio2009curriculum . Among them, Jiang et al. jiang2017mentornet propose to learn the curriculum from data by a mentor net. They use the mentor net to select samples for training with noisy labels. Coteaching han2018co employs two networks to select samples to train each other and achieve good generalization performance against large rate of label corruption. Compared with Coteaching, our CL is a simple plugin for a single network. Thus both space and time complexity of CL are half of Coteaching’s.
Construction of tighter bounds of 01 loss: Along the line of construction of tighter bounds of the 01 loss, many methods have been proposed. To name a few, MasnadiShirazi et al. masnadi2009design propose Savage loss, which is a nonconvex upper bound of the 01 loss function. Bartlett et al. bartlett2006convexity analyze the properties of the truncated loss for conventional convex loss. Wu et al. wu2007robust study the truncated hinge loss for SVM. Although the results are fruitful, these works are mainly focus on loss function at individual data point, they do not have sample selection property. In contrast, our curriculum loss can automatically select samples for training. Moreover, it can be constructed in a tighter way than these individual losses by employing them as the base loss function.
3 Curriculum Loss
In this section, we present our Curriculum Loss (CL) that automatically selects samples for training. We begin with discussion about robustness of the 01 loss. We then show that our CL is a tighter upper bound of the 01 loss compared with conventional summation based surrogate loss. A tighter bound of the 01 loss means that it is less sensitive to the noisy outliers, and it better preserves the robustness of the 01 loss against label corruption. Thus, it can deal with noisy samples with small rate of label corruption. When the label corruption rate becomes large, even the 01 loss suffers. Thus, we propose Noise Pruned Curriculum Loss (NPCL) to address this issue. Our NPCL can automatically prune the estimated noisy samples during training process. It reduces to our basic CL when the estimated rate of label corruption is zero. Our CL and NPCL are very simple and efficient, which support minibatch update. They can be used as plugin for many deep models. A simple multiclass extension and a novel soft multihinge loss are included in the supplementary material. All the detailed proofs are also included in the supplemental material.
3.1 Robustness of 01 loss against label corruption
We rephrase Theorem 1 in hu2016does from a different perspective, which motivates us to employ the 01 loss for training against label corruption.
Theorem 1.
(Monotonic Relationship) (Hu et al. hu2016does ) Let and be the training and test density,respectively. Define and . Let and be 01 loss for binary classification and multiclass classification, respectively. Let be convex with . Define risk , empirical risk , adversarial risk and empirical adversarial risk as
(1)  
(2)  
(3)  
(4) 
where and . Then we have that
(5)  
(6) 
The same monotonic relationship holds between their empirical approximation: and .
Theorem 1 hu2016does shows that the monotonic relationship between (empirical) risk and the (empirical) adversarial risk when 01 loss function is used. It means that minimizing (empirical) risk is equivalent to minimize the (empirical) adversarial risk for 01 loss. Note that the problem resulted from label corruption is that test distribution is different from training distribution. Without further making other assumptions about the corruption distribution, the 01 loss is the most robust loss function because minimizing the 01 loss is equivalent to minimize the worst case risk, i.e., (empirical) adversarial risk for a changing test distribution within a limited divergence from the given (empirical) training distribution. For label corruption with a small noise rate, the divergence between test distribution and corrupted training distribution is small. In this situation, training with 01 loss is most robust against adversary changing test distribution without other assumptions. This motivates us to employ 01 loss for training against label corruption.
3.2 Tighter upper bounds of 01 Loss
The 01 loss is difficult to optimize, we thus propose a tighter upper bound surrogate loss. We use the classification margin to define the 01 loss. For binary classification, classification margin is , where and denotes the prediction and ground truth, respectively. (A simple multiclass extension is discussed in supplement.) Let be the classification margin of the sample for . Denote . The 01 loss objective can be defined as follows:
(7) 
Given a base upper bound function , the conventional surrogate of the 01 loss can be defined as
(8) 
Our curriculum loss can be defined as Eq.(9). is a tighter upper bound of 01 loss compared with the conventional surrogate loss , which is summarized in Theorem 2:
Theorem 2.
(Tighter Bound) Suppose that base loss function is an upper bound of the 01 loss function. Let be the classification margin of the sample for . Denote as the maximum between two inputs. Let . Define as follows:
(9) 
Then holds true.
Remark: For any fixed , we can obtain an optimum solution of the partial optimization. The index indicator can naturally select samples as a curriculum for training models. The partial optimization w.r.t index indicator can be solved by a very simple and efficient algorithm (Alg 1) in . Thus, the loss is very efficient to compute. Moreover, since is tighter than conventional surrogate loss , it is less sensitive to outliers compared with . Furthermore, it better preserves the robust property of the 01 loss against label corruption.
Updating with all the samples at once is not efficient for deep models, while training with minibatch is more efficient and well supported for many deep learning tools. We thus propose a batch based curriculum loss given as Eq.(10). We show that is also a tighter upper bound of 01 loss objective compared with conventional loss . This property is summarized in Corollary 1.
Corollary 1.
(Minibatch Update) Suppose that base loss function is an upper bound of the 01 loss function. Let , be the number of batches and batch size, respectively. Let be the classification margin of the sample in batch for and . Denote . Let . Define as follows:
(10) 
Then holds true.
Remark: Corollary 1 shows that a batchbased curriculum loss is also a tighter upper bound of 01 loss compared with the conventional surrogate loss
. This enables us to train deep models with minibatch update. Note that random shuffle in different epoch results in a different batchbased curriculum loss. Nevertheless, we at least know that all the induced losses are upper bounds of 01 loss objective and are tighter than
. Moreover, all these losses are induced by the same base loss function . Note that, our goal is to minimize the 01 loss. Random shuffle leads to a multiple surrogate training scheme. In addition, training deep models without shuffle does not have this issue.We now present another curriculum loss which is tighter than . is an (scaled) upper bound of 01 loss. This property is summarized as Theorem 3.
Theorem 3.
(Scaled Bound) Suppose that base loss function is an upper bound of the 01 loss function. Let be the classification margin of the sample for . Denote . Define as follows:
(11) 
Then holds true.
Remark: has similar properties to discussed above. Moreover, it is tighter than , i.e. . Thus, it is less sensitive to outliers compared with . However, can construct more adaptive curriculum by taking 01 loss into consideration during the training process.
Directly optimizing is not efficient similar to . We now present a batch loss objective given as Eq.(12). is also a tighter upper bound of 01 loss objective compared with conventional surrogate loss .
Corollary 2.
(Minibatch Update for Scaled Bound) Suppose that base loss function is an upper bound of the 01 loss function. Let , be the number of batches and batch size, respectively. Let be the classification margin of the sample in batch for and . Denote . Let . Define as follows:
(12) 
Then holds true.
All the curriculum losses defined above rely on minimizing a partial optimization problem (Eq.(13)) to find the selection index set . We now show that the optimization of with given classification margin can be done in .
Theorem 4.
Remark: The time complexity of Algorithm 1 is . Moreover, it does not involve complex operations, and is very simple and efficient to compute.
Algorithm 1 can adaptively select samples for training. It has some useful properties to help us better understand the objective after partial minimization, we present them in Proposition 1.
Proposition 1.
(Optimum of Partial Optimization) Suppose that base loss function is an upper bound of the 01 loss function. Let for be fixed values. Without loss of generality, assume . Let be an optimum solution of the partial optimization problem in (13). Let and . Then we have
(14)  
(15)  
(16)  
(17) 
Remark: When , Eq.(17) is tighter than the conventional loss . When , Eq. (17) is a scaled upper bound of 01 loss . From Eq.(17) , we know the optimum of the partial optimization problem (13) (i.e. our objective) is . When , we can directly optimize with the selected samples for training. When , note that from Eq.(16), we can optimize for training . Note that when , we have that , which is still tighter than the conventional loss . When , for the parameter , we have that . Thus we can optimize . In practice, when training with random minibatch, we find that optimizing in both cases instead of does not have much influence.
3.3 Noise Pruned Curriculum Loss
The curriculum loss in Eq.(9) and Eq.(11) expect to minimize the upper bound of the 01 loss for all the training samples. When model capability (complexity) is high, (deep network) model will still attain small (zero) training loss and overfit to the noisy samples.
The ideal model is that it correctly classifies the clean training samples and misclassify the noisy samples with wrong labels. Suppose that the rate of noisy samples (by label corruption) is
. The ideal model is to correctly classify the clean training samples, and misclassify the noisy training samples. This is because the label is corrupted. Correctly classify the training samples with corrupted (wrong) label means that the model has already overfitted to noisy samples. This will harm the generalization to the unseen data.Considering all the above reasons, we thus propose the Noise Pruned Curriculum Loss (NPCL) as
(18) 
where or .
When we know there are noisy samples in the training set, we can leverage this as our prior. (The impact of misspecification of the prior is included in the supplement.) When (assume are integers for simplicity), from the selection procedure in Algorithm 1, we know ^{1}^{1}1When , samples will be pruned. Otherwise, samples will be pruned. samples with largest losses will be pruned. This is because when . Without loss of generality, assume . After pruning, we have , the pruned loss becomes
(19) 
It is the basic CL for samples and it is the upper bound of . If we prune more noisy samples than clean samples, it will reduce the noise ratio. Then the basic CL can handle. Fortunately, this assumption is supported by the "memorization" effect in deep networks arpit2017closer , i.e. deep networks tend to learn clean and easy pattern first. Thus, the loss of noisy or hard data tend to remain high for a period (before being overfitted). Therefore, the pruned samples with largest loss are more likely to be the noisy samples. After the rough pruning, the problem becomes optimizing basic CL for the remaining samples as in Eq.(19). Note that our CL is a tight upper bound approximation to the 01 loss, it preserves the robust property to some extent. Thus, it can handle case with small noise rate. Specifically, our CL(Eq.19) further select samples from the remaining samples for training adaptively according to the state of training process. This generally will further reduce the noise ratio. Thus, we may expect our NPCL to be robust to noisy samples. Note that, all the above can be done by the simple and efficient Algorithm 1 without explicit pruning samples in a separated step. That is equal to say, our loss can do all these automatically under a unified objective form in Eq.(18).
When , the NPCL in Eq.(18) reduces to basic CL in Eq.(11) with . When , for a target ideal model (that misclassifies noisy samples only), we know that . It has similar properties as choosing . Moreover, it is more adaptive by considering 01 loss during training at different stage. In this case, the NPCL in Eq.(18) reduces to the CL in Eq.(9) when . Note that is a prior, users can defined it based on their domain knowledge.
To leverage the benefit of deep learning, we present the batched NPCL as
(20) 
where or as in Eq.(21):
(21) 
Similar to Corollary 1, we know that . Thus, optimizing the batched NPCL is indeed minimizing the upper bound of NPCL. This enables us to train the model with minibatch update, which is very efficient for modern deep learning tools. The training procedure is summarized in Algorithm 2. It uses Algorithm 1 to select a subset of samples from every minibatch. Then, it uses the selected samples to perform gradient update.
4 Empirical Study
Evaluation of robustness against label corruption: We evaluate our NPCL by comparing Coteaching han2018co , MentorNet jiang2017mentornet and standard network training on MNIST, CIFAR10 and CIFAR100 dataset as in han2018co ; patrini2017making ; goldberger2016training . Two types of random label corruption, i.e. Symmetry flipping van2015learning and Pair flipping han2018masking , are considered in this work. Symmetry flipping is that the corrupted label is uniformly assign to one of incorrect classes. Pair flipping is that the corrupted label is assign to one specific class similar to the ground truth. The noise rate of label flipping is chosen from
. We employ same network architecture and network hyperparameters as in Coteaching
han2018co for all the methods in comparison. Specifically, the batch size and the number of epochs is set to and , respectively. The Adam optimizer with same parameter as han2018co is employed. For NPCL, we employ hinge loss as the base upper bound function of 01 loss. At first a few epochs, we train model using full batch with soft hinge loss (in the supplement) as a burnin period suggested in jiang2017mentornet . Specifically, we start NPCL at epoch on MNIST and epoch on CIFAR10 and CIFAR100, respectively. For Coteaching han2018co and MentorNet in jiang2017mentornet , we employ the open sourced code provided by the authors of Coteaching han2018co. We implement NPCL by Pytorch. Code of NPCL will be released on GitHub. Experiments are performed five independent runs. The error bar for STD is shaded.
For performance measurements, we employ both test accuracy and label precision as in han2018co . Label precision is defined as : number of clean samples / number of selected samples, which measures the selection accuracy for sample selection based methods. A higher label precision in the minibatch after sample selection can lead to a update with less noisy samples, which means that model suffers less influence of noisy samples and thus preforms more robust to label corruption.
The picture of test accuracy and label precision vs. number of epochs on MNIST is presented in Figure 1. It shows that NCPL consistently outperforms Coteaching in terms of test accuracy on all three cases. Moreover, NCPL achieves higher label precision compared with Coteaching, which means that NPCL can select more clean samples than Coteaching. The average of test accuracy over last ten epochs for different methods are reported in Table 1. Again, we can observe that NPCL obtains higher test accuracy than other methods on all three cases, which shows the robustness of our NPCL. More experimental results can be found in the supplemental material. Note that NPCL is a simple plugin for a single network, while Coteaching employs two networks to train the model concurrently. Thus, both the space complexity and time complexity of coteaching is doubled compared with our NPCL.
FlippingRate  Standard  MentorNet  Coteaching  NPCL 

Symmetry20%  93.78% 0.04%  96.71% 0.04%  97.18% 0.03%  99.41% 0.01% 
Symmetry50%  65.81% 0.14%  90.27% 0.16%  91.36% 0.09%  98.53% 0.02% 
Pair35%  70.50% 0.16%  89.09% 0.20%  91.55% 0.17%  97.90% 0.04% 
5 Conclusion and Further Work
In this work, we proposed a curriculum loss (CL) for robust learning. Theoretically, we analyzed the properties of CL and proofed that it is tighter upper bound of the 01 loss compared with conventional summation based surrogate losses. We extended our CL to a more general form (NPCL) to handle large rate of label corruption. Empirically, experimental results on benchmark datasets show the robustness of the proposed loss. As a side product, we proposed a novel soft multihinge loss. Experimental results on CIFAR100 shows that our soft hinge loss is easier to optimize compared with the hard hinge loss. As a further work, we may improve our CL to handle imbalanced distribution by considering diversity for each class.
References
 [1] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In ICML, pages 233–242, 2017.
 [2] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
 [3] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.

[4]
Jacob Goldberger and Ehud BenReuven.
Training deep neuralnetworks using a noise adaptation layer.
In ICLR, 2017.  [5] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi Sugiyama. Masking: A new perspective of noisy supervision. In Advances in Neural Information Processing Systems, pages 5836–5846, 2018.
 [6] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Coteaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, pages 8527–8537, 2018.
 [7] Mengqiu Hu, Yang Yang, Fumin Shen, Luming Zhang, Heng Tao Shen, and Xuelong Li. Robust web image annotation via exploring multifacet and structural knowledge. IEEE Transactions on Image Processing, 26(10):4871–4884, 2017.

[8]
Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama.
Does distributionally robust supervised learning give robust classifiers?
2018.  [9] Lu Jiang, Deyu Meng, ShoouI Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Selfpaced learning with diversity. In Advances in Neural Information Processing Systems, pages 2078–2086, 2014.
 [10] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Selfpaced curriculum learning. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 [11] Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. 2018.
 [12] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singlylabeled data. ICLR, 2018.
 [13] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
 [14] Hamed MasnadiShirazi and Nuno Vasconcelos. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In Advances in neural information processing systems, pages 1049–1056, 2009.
 [15] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Virtual adversarial training for semisupervised text classification. In ICLR, 2016.

[16]
Robert Moore and John DeNero.
L1 and l2 regularization for multiclass hinge loss models.
In
Symposium on Machine Learning in Speech and Language Processing
, 2011. 
[17]
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and
Lizhen Qu.
Making deep neural networks robust to label noise: A loss correction
approach.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1944–1952, 2017.  [18] Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems, pages 3567–3575, 2016.
 [19] Hao Su, Jia Deng, and Li FeiFei. Crowdsourcing annotations for visual object detection. In Workshops at the TwentySixth AAAI Conference on Artificial Intelligence, 2012.
 [20] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
 [21] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5552–5560, 2018.
 [22] Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning with symmetric label noise: The importance of being unhinged. In Advances in Neural Information Processing Systems, pages 10–18, 2015.

[23]
Yichao Wu and Yufeng Liu.
Robust truncated hinge loss support vector machines.
Journal of the American Statistical Association, 102(479):974–983, 2007.  [24] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2017.
Appendix A Proof of Theorem 2
Proof.
Because , we have . Then
(22)  
(23)  
(24)  
(25) 
Since loss , we obtain .
On the other hand, we have that
(26)  
(27) 
Since , we obtain
∎
Appendix B Proof of Corollary 1
Proof.
Since , similar to the proof of , we have
(28)  
(29)  
(30) 
On the other hand, since the group (batch) separable sum structure, we have that
(31)  
(32)  
(33) 
∎
Appendix C Proof of Partial Optimization Theorem (Theorem 4)
Proof.
For simplicity, let , . Without loss of generality, assume . Let be the solution obtained by Algorithm 1. Assume there exits a such that
(34) 
Let and .
Case 1: If , then there exists an and . From Algorithm 1, we know () and . Then we know . Thus, we can achieve that
(35)  
(36) 
This contradicts the assumption in Eq.(34)
Case 2: If , then there exists an and . Let . Since , we have . From Algorithm 1, we know that . Thus we obtain that
(37)  
(38)  
(39) 
This contradicts the assumption in Eq.(34)
Case 3: If , we obtain . Then we can achieve that
(40)  
(41)  
(42)  
(43)  
(44) 
This contradicts the assumption in Eq.(34).