Recent studies can be roughly categorized into two classes in terms of the classifier count, i.e., single-network structure and dual-network structure. For the former, it usually refers to robust surrogate losses, noise transition and self-paced learning. For instance, Bootstrap [reed2014training] applies the perceptional consistency to the cross-entropy loss to mitigate the influence of label noise. Forward [patrini2017making]
build a transition matrix on top of the classifier to absorb the noise. Self-paced MentorNet[jiang2018mentornet] selects small loss samples as clean instances and learn only from these instances. For the latter, it introduces twin classifiers to be the teacher of each other by which boosts the classification performance of the single network. Decouple [malach2017decoupling] utilizes the prediction disagreement of twin networks to select more informative samples as supervision. CLC [wu2019collaborative] leverages the entropy criterion to collaboratively correct the labels. Co-distillation [anil2018large, song2018collaborative] distills the knowledge of the one to supervise the other one and vice versa. Co-teaching [han2018co] leverages two networks to select small loss instances for cross update.
Although the current dual-network structure empirically shows improvement over the single version, it lacks of theoretical analysis and guarantee that it can always work. Besides, a natural question is whether introducing more learners can further benefit the learning with noisy supervision. To explore these limits and give a more general scope, we propose a Cooperative Learning (CooL) paradigm that multiple classifiers work cooperatively for noisy supervision. Specifically, we firstly demonstrate the dual-network structure yields lower risk than that associated with the single network in some suitable cooperation. Then, we give a sufficient condition for the case of more learners, where the risk is negatively correlated to the number of classifiers. Generally, even although the classifiers are imperfect with noisy supervision, a lower risk can be achieved when the more disagreement is introduced. Finally, based on these analysis, a cooperative learning framework is introduced that the cooperation supervision is utilized to improve the performace. The main contribution can be summarized into the following three points. We demonstrate in the presence of noisy supervision, the linear combination of predictions from multiple networks yields a more reliable supervision than predictions from either single classifier in some conditions. A Cooperative Learning framework is introduced, where multiple different imperfect classifiers produce supervision cooperatively to iteratively boost the performance. We empirically verify the proposed method on CIFAR-10, CIFAR-100 with synthetic noise and three large-scale real-world datasets namely Clothing1M, Food-101N and WebVision. Comprehensive experiments show that CooL outperforms several state-of-the-art methods.
2 The Proposed Method
Given a noisy dataset , where is the sample number, denotes an image instance and is the corresponding noisy label, we target to produce more reliable supervision for the classifiers to learn in the presence of label noise. Assume we use to represent a classifier with index and indicates the prediction of . We explore to achieve this goal via the cooperation of multiple classifiers.
2.2 Dual-Network Cooperative Learning
As empirically indicated in several works [anil2018large, song2018collaborative, han2018co, malach2017decoupling], the dual-network structure is easy to acquire a robust classification performance when learning with noisy supervision. To understand the law of this phenomenon, we deduce the theoretical analysis in light of risk minimization. We term this methodology as dual-network Cooperative Learning (CooL) to ease the explanation and unify the notion of this work.
Suppose there are two classifiers and and we use the combination of the predictions and respectively output from and as the new cooperation supervision:
is the cooperation parameter to balance between the predictions from the two classifiers. To measure the reliability of a certain supervision with respect to the ground-truth label , we define the noisy supervision risk on the training set 111 We take inspiration from Li et al. [li2017learning] while our scenarios are different. We propose to measure risks on the training set with no clean labels. In the following, we will show that with a suitable choice of . leveraging the cooperation supervision in Eq. (1) yields lower risk than the individual risk of the either model.
There always exists a that makes the risk of the dual-network cooperation lower than222We exclude the situation where one classifier strictly dominates the other otherwise there is no need for cooperation since will be set to 0 or 1 and the cooperation risk will be equal to the lower one. individual risks of two non-identical networks that do not always produce the incorrect predictions at the same time, i.e.,
First, the risks of the individual predictions from and are quantified respectively as the following terms,
Then, the cooperation risk can be decomposed as follows,
where . For two non-identical classifiers and that do not always produce the incorrect predictions at the same time, as , are label distributions and is the one-hot label, we will have
By setting , we obtain
which concludes the proof of Theorem 1. ∎
From Theorem 1, we know the suitable dual-network cooperation can achieve a lower risk than the individual network. Furthermore, it can be found the minimum risk obtained in Eq. (3) is positively correlated to . This term reflects divergences between the two classifiers on the samples where both of the classifiers make incorrect predictions. The optimal situation is that two classifiers never make mistakes at the same, then we have . Intuitively, both and are deviated from the true label . However, these deviations are towards random directions in the presence of stochastic label noise, the proposed cooperation supervision can be closer to the true label.
2.2.1 Connection and Difference.
Here, we rethink two representative dual-network methods namely Co-distillation and Co-teaching through the lens of Cooperative Learning.
Co-distillation represents a branch of studies which leverage the model predictions to rectify the noisy labels. It is the case where in Eq. (1) is substituted to the noisy label . Correspondingly setting , the optimal risk for can be simplified as , which is lower than the risk . This explains why Co-distillation shows improvement. Nevertheless, it also points out one defect that is fixed and cannot be improved along with the decrease of .
Co-teaching represents a line of studies which select samples with a certain criterion for training. Thus, we can adjust the risk by modifying the correlated dataset (removing the unreliable samples). According to [han2018co], the supervision of is a candidate set selected by and vice versa. The risk for is then denoted as . Generally, it is a lower risk than that on with the help of the small loss trick. If we linearly combine the supervision like CooL, the cooperation risk is also a linear combination with respect to . And the minimum will be obtained at the boundaries, i.e., the smaller one in and . In this case, a more reasonable way is to choose one of the two candidate sets, which has a lower risk. In Co-teaching, they utilize both sets for cross update which may impair the performance.
2.3 Generalized Cooperative Learning
In this section, we aim to generalize our dual-network CooL to a multi-network variant, which is able to achieve a even lower risk. Given non-identical classifiers, we denote the new cooperation supervision as , where is a
-dimension row vector with summation equal to 1 andis a stack of in rows. We define where . The corresponding diagonal elements are the individual risks associated with the predictions of the classifiers respectively. In the following, we analyze the cooperation risk for multiple classifiers.
Given the cooperation supervision , the associated risk is . An invertible yields,
If all non-identical classifiers are independently trained in the same settings, so that the following conditions satisfy
Then, the minimum cooperation risk in Eq. (4) will be
which is lower than the individual risks of all classifiers.
The risk associated with the new supervision is
Leveraging the Lagrange multipliers, we now minimize,
By setting and , we obtain,
Thus, the minimum risk associated with is
which concludes the proof of Theorem 2. ∎
From Theorem 2, we can see that the first term on the RHS of Eq. (6) indicates by leveraging more classifiers (increasing n), we can monotonically obtain a lower risk. In this case, being diagonally dominant yields a necessary and sufficient condition where the risk is inversely proportional to the number of the classifiers. Besides, similar to the claim in Remark 1, the off-diagonal element in characterizes the divergence of two networks. If two classifies are complementary with each other, they will work better cooperatively even though they are imperfect.
2.4 The Cooperative Learning Framework
The theoretical analysis in previous section tells us that the cooperation of multiple classifiers can lower the supervision risk and the lower bound is determintered by the divergence between the classifiers. Based on this, we introduce a new Cooperative Learning (CooL) framework where the proposed cooperation supervision namely the combination of the predictions from the multiple classifiers is adopted to re-train the individual networks. As claimed in Remark 1 and 2, the prerequisite of a better performance on noisy datasets via cooperation, is to generate diverse classifiers. We thus make the classifiers learn from different sources of information to construct pattern bias. Specifically, we pre-train the classifiers respectively on , the different partitions of . Note this pre-training style relies on the assumption that the subset is still sufficient enough to learn a classifier exhibiting the same risk on the whole dataset. Thus we are not able to infinitely add classifiers to lower the risk due to limited data.
After obtaining multiple different pre-trained classifiers, we can utilize the combination of their predictions to train better classifiers. Instead of training another student network with such supervision like [li2017learning], we iteratively train the classifiers with the objective function as follows,
In the first term of Eq. (7), the network is supervised by the cooperation supervision , which is the key module of this paper. The second part is an auxiliary term that supervises with the original labels in the early phase but will be gradually canceled out as the model is capable of memorizing the noisy labels. The third term is the entropy of the model predictions which prevents the output of the networkand , we empirically assign small weights like [song2018collaborative]. The complete training process is summarized in Algorithm 1.
Complexity Analysis The time complexity of CooL is not a big issue since we can distribute the computation into individual classifiers parallelly. Assume is the mini-batch size and is the parameter size, then in each mini-batch update, the time complexity for each classifier is . However, for the space complexity, it might be a bottleneck as the storage cost is linearly related to the number of the classifier, i.e., . Thus, when implementing multiple-network CooL in practices, we have to consider the resource limit.
3.1 Datasets and Baselines
|noise ratio ()||0.2||0.45||0.2||0.45||0.2||0.5||0.2||0.45||0.2||0.45||0.2||0.5|
To demonstrate the effectiveness of CooL, we experiment with CIFAR10 and CIFAR100 [krizhevsky2009learning] with pairwise noise [han2018co], asymmetric noise [patrini2017making], symmetric noise [van2015learning] and Clothing1M [xiao2015learning], Food-101N [lee2018cleannet], WebVision [li2017webvision]
for real-world noise. We compare CooL with the following two categories of noisy-supervised learning methods.Single-network Methods: Standard, which directly trains a vanilla classifier on noisy datasets; Forward [patrini2017making], which uses a noise transition matrix for the forward loss correction; LCCN [yao2019safeguarded], which dynamically adjust the transition matrix to safeguard the learning process. We only directly report the result of LCCN on WebVision to save huge labor to reproduce as we adopt the same settings; Bootstrap [reed2014training], which linearly combines the model predictions and original labels; MentorNet [jiang2018mentornet]. We deploy self-paced MentorNet namely a single model determines small loss samples as useful information and learn with these samples; Dual-network Methods: Decouple [malach2017decoupling], which updates the parameters when the two models disagree; Co-distillation [anil2018large, song2018collaborative], which is the dual network version of Bootstrap; Co-teaching [han2018co], which is the dual network version of self-paced MentorNet; Bagging [breiman1996bagging] which takes a vote of multiple classifiers pre-trained on random data partitions. Specifically we use two classifiers.
For CIFAR-10 and CIFAR-100, we follow the same implementation in [han2018co] For real-world datasets, a 50-layer ResNet architecture [he2016deep]
pre-trained on ImageNet is adopted as the classifier. The images are resized to 256 with respect to shorter sides and then randomly cropped to 224224 with random flip, brightness, contrast and saturation. For Clothing1M and Food-101N, the batch size is set to 64 and we run 20 epochs on Clothing1M and 40 epochs on Food-101N. For WebVision, we align the learning settings with Yao et al. [yao2019safeguarded] to save the labor in reproducing the results.
For all experiments, we use the same architecture for two networks when implementing Decouple, Co-teaching, Co-distillation and CooL as done in [malach2017decoupling, song2018collaborative, han2018co]. To be fair, all methods use pre-training as warming-up following [patrini2017making, song2018collaborative, han2018co]
. Specifically the models are trained as Standard for 4 epochs on Clothing1M and 10 epochs on all other datasets. For dual network methods, two branches are pre-trained separately.For Forward, we use the normalized ground-truth confusion matrix provided in[xiao2015learning] on Clothing1M. For MentorNet and Co-teaching, the noise ratio is provided as side information as required in [han2018co] for the pre-defined curriculum. For CooL, we set since two networks have the same architecture and are trained in the same manner. We empirically set on CIFAR-10 and Clothing1M while we set on datasets containing much more categories. For the CIFAR datasets, we linearly decrease since deep models easily fit the small-scale datasets and we fix .
3.3 Results on CIFAR10 and CIFAR-100
Table 1 summarizes the average test accuracy of dual network CooL and all baselines on CIFAR-10 and CIFAR-100 over the last ten epochs. We can see from the results that dual-network methods such as Co-distillation and Co-teaching generally perform better than their single version namely Bootstrap and MentorNet. Bagging works well on CIFAR-10 but fails on CIFAR-100. The reason is that CIFAR-100 contains more classes thus the data after random partition is insufficient to train a good individual learner. However, adopting similar style of pre-training, CooL manages to achieve the best performance in all of the noise settings. This indicates that CooL can effectively boost two imperfect individual learners. Pointedly for low-level pairwise noise , CooL outperforms the best baseline by 6.24% on CIFAR-10 and 7.24% on CIFAR-100. When raises to 0.45, all baselines degenerate hard while CooL shows great robustness dealing with high-level noise. According to the results, CooL outperforms the best baseline by 15.88% on CIFAR-10 even impressively surpassing the results of all baselines under low-level noise. For asymmetric noise, CooL achieves the best performances in all settings and manages to outperform the best baseline by 14.47% when on CIFAR-100. Symmetric noise is the hardest noise pattern as the overall test accuracies are low. However, CooL manages to outperform all the baselines.
3.3.1 Counteracting the Memorization Effects
Memorization effects [arpit2017closer] refer to the behavior of DNNs under noisy supervision that the model will firstly learn from clean data and eventually fit the noisy labels. This phenomenon can be visualized as rise followed by drop in the test accuracy curve. In the left panel of Figure 1, we trace the test accuracy of all baselines and CooL under high-level pairwise noise on CIFAR-10. For all baselines, the test accuracy increases at first and then decreases as the training proceeds, which matches the memorization effects. However, the curve of CooL keeps increasing and then persists at a high level which indicates that the proposed cooperation supervision is reliable enough to counteract the memorization effects.
3.3.2 Reliability of the Supervision
To further assess the reliability of the cooperation supervision in CooL, We report the label precision which is the ratio of correct supervisions to total supervisions. We compare CooL with the sample selection methods that leverage different criteria to select the reliable supervision. The label precision curves in setting 2 are depicted in middle panel of Figure 1. We can see that the label precision of CooL consistently increases with regard to iterations of optimizing and surpasses all the sample selection methods. In the advanced stage of training, the label precision of the cooperation supervision is 94.29% which means CooL can guarantee the classifiers to learn on a relatively clean dataset. This empirically verifies the reliability of the proposed cooperation supervision.
3.3.3 Two Learners and Beyond
Here we carry out experiments to examine our theoretical findings on the effects of utilizing multiple networks. We implement triple-network CooL (CooL-3) and quadruple-network CooL (CooL-4) and depict the accuracy along with CooL and Standard in right panel of Figure 1. We can see from the curves that CooL-3 generally shows improvement or comparable results with CooL. This matches our analysis that increasing the number of the classifiers will result in a smaller risk which is closer to the lower bound. CooL-4 shows improvement in setting 4 while its accuracy slight drops in setting 2 & 6.This is due to insufficient information under the partition of the limited data as we have discussed formerly. As training quadruple classifiers also requires more resources, we may only resort to CooL or CooL-3 practically.
3.4 Results on Clothing1M, Food-101N and WebVision
In this section, we empirically verify the effectiveness of CooL on three large-scale datasets with real-world noise.
For Clothing1M, the results are reported in the left column of Table 3. We can see that dual-network methods generally perform better than the single versions from which they are derived. Although MentorNet does not work well in this setting, Co-teaching manages to surpass Standard by 1.16%. Among all baselines, Forward achieves the best performance with the usage of the ground-truth transition matrix. However CooL surpasses Forward by 3.09% without using any side information. CooL-3 further improves CooL by 0.33% indicating the effectiveness of leveraging multiple classifiers.
For Food-101N, Bootstrap degenerates slightly compared to Standard, while the dual-network version method Co-distillation enjoys a 1.86% gain. Adopting the same small loss trick, both MentorNet and Co-teaching perform well on Food-101N. MentorNet outperforms Standard by 1.48% and Co-teaching outperforms MentorNet by 0.94% with the use of dual-network structure. Without the knowledge of the ground-truth transition matrix, Forward only improves the test accuracy by 0.47% compared to Standard. Again, CooL manages to outperform the best baseline by 3.07%. Adding another classifier, CooL-3 further obtains a 0.14% gain.
For WebVision, we report both top-1 and top-5 accuracies in Table 3. The results of Co-teaching is vacant due to the absence of the ground-truth noise ratio. We can see from the results that our CooL achieves the best performance and CooL-3 further surpasses CooL. However, the gap between all the methods is trivial which may be on account of the strong open-set noise as suggested in Yao et al. [yao2019safeguarded].
4 Conclusion and Future Work
In this paper, we propose a Cooperative Learning paradigm that multiple classifiers work cooperatively with noisy supervision. We demonstrate that our proposed cooperation risk is lower than that associated with individual learners. Then we present a sufficient condition where the risk is negatively correlated to the number of the classifiers. Finally, we introduce the Cooperative Learning framework where the reliable cooperation supervision iteratively boosts the performance of the classifiers. We conduct a range of experiments on the CIFAR datasets to demonstrate the robustness of CooL under synthetic noise and we verify the effectiveness of CooL on three real-world large-scale datasets We further implement CooL-3 and CooL-4 to show that leveraging more classifiers can have potential gain nonetheless adding more classifiers will consume more resources. Future research directions include finding new means to generate multiple divergent classifiers to achieve lower risk and reducing the parameter space for multiple-network CooL via parameter sharing.