In many supervised learning applications, a clean labeled dataset is the key to success. However, in real-world scenarios, label noise inevitably originates from different sources such as inconsistent labelers or the difficulty of the labeling task itself. In many classification tasks for example samples that cannot be squeezed into a strict categorical scheme will lead to inconsistent labels.
With traditional supervised learning, the present label noise decreases the performance of classification models since they tend to over-fit to the samples with noisy labels. This results in lower accuracy and inferior generalization properties. To avoid the negative influence of noisy labels, a common approach is to use sample-dependent loss weights as learning regularizers (Jiang et al., 2017; Ren et al., 2018)
. However, the performance of these mechanisms strongly depends on the respective hyperparameters that are difficult to set.
Typically the loss weights are restricted to
by design to resemble a probability of a noisy label given a sample. In a supervised learning framework, however, even with tiny (e.g.,) loss weights, the model could still receive a strong learning signal from noisy samples (as, e.g., in (Ren et al., 2018)). A perfect case is assigning weights to samples with noisy labels which, however, implies ignoring those samples and results in a smaller training dataset.
In this paper, instead of training in a supervised framework, we learn from the samples with noisy labels in an unsupervised way. Since the input data are not noisy but only the labels, semi-supervised learning can still exploit the raw data samples. By keeping those samples rather than removing them from training our proposed method can be more strict when it comes to removing potentially noisy labels.
In more detail, we propose a learning scheme consisting of (1) iterative filtering of noisy labels and (2) semi-supervised learning to regularize the problem in the vicinity of noisy samples. Fig. 2 shows a simplified overview of the concept. We refer to the proposed training procedure as Iterative Filtering with Semi-Supervised Learning (IF-SSL). To the best of our knowledge, we propose the first approach that only removes the noisy labels instead of the complete data samples using filtering. Our approach requires no new dedicated mechanism for robust learning and utilizes only existing standard components for learning.
The proposed algorithm was evaluated on classification tasks for CIFAR-10 & CIFAR-100 with a varying label noise ratio from 0% to 80%. We show results both for a clean validation set and a noisy one. In both cases, we show that using the filtered data as unlabeled samples significantly outperforms complete removal of the data. As a consequence, the proposed model consistently outperforms state of the art at all levels of label noise; see Fig. 1. Despite the simplicity of the training pipeline, our approach shows robust performance even in case of high noise ratios. The source code will be made available together with the published paper.
2 Robust Learning with Iterative noise-filtering
Fig. 2 shows an overview of our proposed approach. In the beginning, we assume that the labels of the training set are noisy (up to a certain noise ratio). We use a small validation set to measure the improvement in model performance. In each iteration, we first apply semi-supervised training until we find the best model w.r.t. the performance on the validation set (e.g., by early-stopping). In the next step, we use the moving-average-prediction results of the best model to filter out potentially noisy labels based on the strategy defined in Section 2.2
. In the next iteration, we again use all data and the new filtered label set as input for the model training. The iterative training procedure stops when no better model can be found. Our filtering pipeline only requires a standard component of training deep learning models.
To provide a powerful regularizer against label noise, the semi-supervised model treats all data points as additional unlabeled samples. Concretely, in the first iteration, the model learns from supervised and unsupervised learning objectives on the complete dataset. Subsequently, the unsupervised learning objective continuously derives learning signals from all data points while the supervised learning objective is computed only on a filtered set of labeled samples. Over these iterations, the label noise in the training set is expected to reduce.
In the following, we give more details about the combination of this training and filtering procedure with existing techniques from semi-supervised learning.
2.2 Iterative Filtering
Let us start with an initial noisy training dataset and and the validation set . Assume each example might have one of the following labels where denotes the unlabeled/noisy case. By
we denote the model which in the current training epoch maps each exampleto a set where is the score and . Let be the accuracy of over the validation set .
Let denote a training procedure which will be explained in detail in Section 2.3.
Using these notations, the label filtering algorithm is given in Algorithm 1.
The label filtering is performed on the original label set from iteration
. In this way, clean labels erroneously removed in an earlier iteration (e.g., labels of hard to classify samples) can be used for the model training again. This is a major difference to typical iterative filtering approaches where the filtering at iterationis restricted to training samples from the respective iteration only.
We apply a variant of easy sample mining and filter out training samples based on the model’s agreement with the provided label. That means the labels are only used for supervised training if in the current epoch the model predicts the respective label to be the correct class with the highest likelihood. This is reflected in Algorithm 1 line 12 to line 14.
The model’s predictions required for filtering can be stored during training directly. However, the predictions for noisy samples tend to fluctuate. For example, take a cat wrongly labeled as a tiger. Other cat samples would encourage the model to predict the given cat image as a cat. Contrary, the wrong label tiger regularly pulls the model back to predict the cat as a tiger. Hence, using the model’s predictions gathered in one single training epoch for filtering is sub-optimal.
Instead, we propose to collect the sample predictions over multiple training epochs. This scheme is displayed in Fig. 3. For each sample, we store the moving averaged predictions, accumulated over the last iterations. Besides having a more stable basis for the filtering step, our proposed procedure also leads to negligible memory and computation overhead.
Due to continuous training of the best model from the previous model, computation time can be significantly reduced, compared to re-training the model from scratch. On the new filtered dataset, the model must only slowly adapt to the new noise ratio contained in the training set. Depending on the computation budget, a maximal number of iterations for filtering can be set to save time.
Moreover, the new training procedure does not require specific mechanisms or algorithms which need to be implemented or fine-tuned. Implementation-wise, it can be realized by looping the standard training procedure and filter potentially noisy samples at the end of each training run.
2.3 Unsupervised learning to counter label noise
Although the proposed learning procedure is not restricted to classification tasks, in this work, we explain the procedure for classification as a use-case.
Model training is performed using two types of learning objectives: (1) supervised and (2) unsupervised losses. Supervised learning from noisy-labeled samples is straightforward and can be done with typical n-way-classification losses. The unsupervised learning objective, however, requires a design choice of which data to be used (defined in Section 2.2) and how to learn from them.
2.3.1 Learning from unlabeled data
We learn from all data points in a semi-supervised fashion. Concretely, in addition to supervised learning with filtered labels, unsupervised learning is applied to the entire dataset. Our learning strategy can take advantage of unsupervised learning from a large dataset, and therefore it has a potentially large regularization effect against label noise. Unsupervised learning objectives impose additional constraints on all samples, which are hard to follow for wrongly labeled samples. These constraints could be a preference of extreme predictions (Entropy-loss) or non-fluctuating model predictions over many past iterations (Mean-teacher-loss). Both constraints are explained in the following.
The typical entropy loss for semi-supervised learning is shown in Fig. 8. It encourages the model to provide extreme predictions (such as or ) for each sample. Over a large number of samples, the model should balance its predictions over all classes.
The entropy loss can easily be applied to all samples to express the uncertainty about the provided labels. Alternatively, the loss can be combined with a strict filtering strategy, as in our work, which removes the labels of potentially wrongly labeled samples.
For a large noise ratio, predictions of wrongly labeled samples fluctuate strongly over previous training iterations. Amplifying these network decisions could lead to even noisier models model. Combined with iterative filtering, the framework will have to rely on a single noisy model snapshot. In the case of an unsuitable snapshot, the filtering step will make many wrong decisions.
Mean Teacher model
A better way to perform semi-supervised learning and counteract label noise is to employ the Mean Teacher model (Tarvainen & Valpola, 2017). The Mean Teacher model follows the student-teacher learning procedure from (Hinton et al., 2015). The main idea is to create a virtuous learning cycle, in which the student continually learns to surpass the (better) teacher. Concretely, the Mean Teacher is an exponential moving average of the student models over training iterations.
In contrast to learning from the entropy-loss, the Mean-Teacher solves precisely the problem of noisy models snapshots. The teacher-model is a moving-average from the past training iterations and hence much more stable than a single snapshot. The training of such a model is shown in Fig. 5
Mean Teacher model for iterative filtering
Given the setting in Section 2.2, we apply the Mean Teacher algorithm in each iteration in the procedure as follows.
Input: examples with potentially clean labels from the filtering procedure. In the beginning (),
Initialize a supervised neural network as the student model.
Initialize the Mean Teacher model as a copy of the student model with all weights detached.
Let the loss function be the sum of normal classification loss ofand the consistency loss between the outputs of and
Select an optimizer
In each training iteration:
Update the weights of using the selected optimizer
Update the weights of as an exponential moving average of the student weights
Evaluate performance of and over to verify the early stopping criteria.
Return the best
The consistency loss
between students and teachers output distribution can be realized with Mean-Square-Error or Kullback-Leibler-divergence.
Overlapping data split between labeled and unlabeled samples
While traditionally the dataset is strictly divided into non-overlapping labeled and unlabeled sets, we treat all samples also as unsupervised samples, even if they are in the set of filtered, labeled samples.
This is important since despite the filtering the provided labels can be wrong. By considering them additionally as unsupervised samples, the consistency of the model prediction for a potentially noisy sample is evaluated among many other samples, resulting in more consistent model predictions. Therefore, learning from all samples in an unsupervised fashion provides a stronger regularization effect against label noise.
3 Related Works
Different approaches to counter label noise have been proposed in (Azadi et al., 2015; Reed et al., 2014; Ren et al., 2018; Jiang et al., 2017; Jenni & Favaro, 2018). Some of these works (Azadi et al., 2015; Ren et al., 2018) require additional clean training data. Often, the loss for potentially noisy labels is re-weighted softly to push the model away from the wrong label (Jiang et al., 2017; Ren et al., 2018).
Compared to these works, we perform an extreme filtering by setting the sample weight of the potentially wrongly labeled samples to . These labels are no longer used for the supervised objective of the task. Moreover, we perform the filtering step very seldom, in contrast to epoch-wise-samples re-weighting of previous approaches. Furthermore, contrary to all previous robust learning approaches, we utilize iterative training combined with semi-supervised learning to combat label noise for the first time.
Despite recent advances in semi-supervised learning (Rasmus et al., 2015; Makhzani et al., 2015; Kingma et al., 2014; Kumar et al., 2017; Springenberg, 2015; Miyato et al., 2018; Dai et al., 2017), it has not been considered as a regularization technique against label noise. Semi-supervised learning often uses generative modeling (Kingma & Welling, 2013; Kingma et al., 2016; Rezende et al., 2014; Goodfellow et al., 2014) as an auxiliary task. In contrast to using generative models, the Mean Teacher model proposed in (Tarvainen & Valpola, 2017) has a more stable training procedure. The Mean Teacher does not require any additional generative model. More details are explained in Section 2.3.
Typically, unsupervised learning is only applied to unlabeled data. Contrary, in our approach, unsupervised learning is applied to all samples to expresses the uncertainty of the provided labels.
Although previous robust learning approaches such as (Wang et al., 2018) also use iterative training and filtering, their approach does not employ learning from removed samples in an unsupervised fashion. Furthermore, they always filter strictly, i.e., each sample removal decision is final.
In IF-SSL we only filter potentially noisy labels from the original label set, but still, use the corresponding instances for unsupervised learning. This gives the model a chance to revert a wrong filtering decision in earlier iterations.
Further, our framework is intentionally kept more simple and generic than previous techniques. The focus of our framework is the iterative filtering of noisy labels while learning from all samples in an unsupervised fashion as a form of regularization. This paradigm is hence easily transferable to other tasks than classification.
4.1 Description of Experiments
Dataset description. Classification tasks on CIFAR-10 and CIFAR-100 with uniform noise. Note that the noise on the training and validation set is not correlated. Hence, maximizing the accuracy on the noisy set provides a useful (but noisy) estimate for the generalization ability on unseen test data.
4.1.2 Comparisons to related works
We compare our framework IF-SSL (Iterative Filtering + Semi-supervised Learning) to previous robust learning approaches such as MentorNet (Jiang et al., 2017), Learned and random sample weights from (Ren et al., 2018), S-Model (Goldberger & Ben-Reuven, 2016), bi-level learning (Jenni & Favaro, 2018), Reed-Hard (Reed et al., 2014) and Iterative learning in open-set problems (Wang et al., 2018).
Hyperparameters and early-stopping are determined on the noisy validation set. This is possible because the noise of the validation and training sets is not correlated. Hence, higher validation performance often results in superior test performance.
Additionally, (Ren et al., 2018) considered the setting of having a small clean validation set of 1000 images. For comparison purposes, we also experiment with a small clean set for early stopping.
Whenever possible, we adopt the performances of their methods from the corresponding publications. Sometimes, not all numbers are reported in these publications.
4.1.3 Network configuration and training
For the basic training of semi-supervised models, we use a Mean Teacher model (Tarvainen & Valpola, 2017) available on GitHub 111https://github.com/CuriousAI/mean-teacher. The students and teacher networks are residual networks (He et al., 2016) with 26 layers. They are trained with Shake-Shake-regularization (Gastaldi, 2017)
. We use the PyTorch(Paszke et al., 2017) implementation of the network and keep the training settings close to (Tarvainen & Valpola, 2017)
. The network is trained with Stochastic Gradient Descent. In each filtering iteration, the model is trained for a maximum ofepochs, with a patience of epochs. For more training details, see the appendix.
To filter the noise iteratively, we use the early stopping strategy based on the validation set. After the best model is found, we use it to filter out potentially noisy samples from the noisy training label set at iteration . In the next iteration, the previously best model is fine-tuned on the new dataset. All data is used for unsupervised learning, while supervised learning only considers the filtered labels set at the current iteration. We stop the iterative filtering if no better model is found.
4.1.4 Structure of analysis
We start with the analysis of our model’s performance under different noise ratios. We compare our performance to other previously reported approaches in learning under different noise ratios using the accuracy metric on CIFAR-10 and CIFAR-100. The subsequent ablation study highlights the importance of each component in our framework.
Further, we analyze the consequence of applying our iterative filtering scheme to different network architectures. Afterwards, we show the performance of simple unsupervised learning objectives, with and without our iterative filtering scheme. For more experiments, we refer to the supplemental material.
4.2 Robust Learning Performance Evaluation
4.2.1 Model accuracy under label noise
|Noise ratio||40%||80 %||40%||80 %|
|USING NOISY DATASET ONLY|
|Reed-Hard (Reed et al., 2014)||69.66||-||51.34||-|
|S-model (Goldberger & Ben-Reuven, 2016)||70.64||-||49.10||-|
|(Wang et al., 2018)||78.15||-||-||-|
|Rand. weights (Ren et al., 2018)||86.06||-||58.01||-|
|Bi-level-model (Jenni & Favaro, 2018)||89||20||61||13|
|MentorNet (Jiang et al., 2017)||89||49||68||35|
|USING 1000 CLEAN IMAGES|
|Mentornet (Jiang et al., 2017)*||78||-||59||-|
|Rand. weights (Ren et al., 2018)*||86.55||-||58.34||-|
|Ren et al (Ren et al., 2018)*||86.92||-||61.31||-|
Results for typical scenarios with noise ratio of 40% or 80% on CIFAR-10 and CIFAR-100 are shown in Tab. 2. More results are visualized in Fig. 1 (CIFAR-10) and Fig. 5(a) (CIFAR-100). The baseline model is the typical ResNet-26 with a n-way-classification loss (Negative-Log-likelihood-objective).
Compared to the model baseline and other previously reported approaches, IF-SSL outperforms them by a large margin. Even in areas of high noise ratio up to 80%, the classification performance of our model remains highly robust. Despite the noisy validation set, our model still identifies the noisy labels and filters them out. On CIFAR-10 and CIFAR-100, our model IF-SSL achieves 20% and 7% absolute improvement over previously reported results.
A small clean validation set gives the model an even better estimate of the generalization error on unseen data (IF-SSL*). Due to the iterative filtering scheme, our model always attempts to improve the performance on the validation set as much as possible, without doing gradient steps on it. At the time of convergence, the model always has a loss very close to . Contrary, to prevent over-fitting, a simple early stopping scheme usually leads to a high remaining training loss. Our filtering framework indicates that it is meaningful to learn further from easy samples and to treat the other samples as unlabeled. See the appendix for training visualizations.
Previous works utilize strict filtering, where removed samples are not re-considered in later filtering iterations, whereas iterative filtering always filters based on the provided label set at iteration . The experiments show the enormous benefit of this. The IF-SSL* using clean validation set only achieves 70.93 % at 80% noise when the samples are completely removed. The improvement also stagnates after one single filtering iteration. Hence, for a fair comparison with all filtering baselines, we always use the filtered data as unlabeled samples if not stated otherwise. More details and experiments can be found in the appendix.
4.2.2 Ablation Study
|noise ratio||40%||80 %||40%||80 %|
Tab. 3 indicates the importance of the iterative filtering and semi-supervised learning procedure in our framework. Performing semi-supervised learning (on all samples) or iterative filtering alone leads to similar performances. When combined (IF-SSL without moving-average-predictions), the model is highly robust at 40% noise.
With a higher noise ratio of 80% however, the model’s predictions on training samples fluctuate strongly. Hence, merely taking the model’s predictions at one specific epoch leads to a sub-optimal filtering step. Contrary, our approach IF-SSL proposes to utilize moving-average predictions which are significantly more stable. Compared to the baseline IF-SSL without moving-average predictions, this technique leads to 12% and 3.5 % absolute improvement on CIFAR-10 and CIFAR-100 respectively.
Naive training or leaving out any of the proposed mechanism leads to rapid performance decrease. Our framework combines the strength of both techniques to form an extremely effective regularizer against learning from label noise.
4.2.3 Iterative filtering with different architectures
Tab. 4 shows the effect of iterative filtering on various architectures. For traditional network training, Resnet26 performs best and slightly better than its shallower counterpart Resnet18. Extremely deep architectures like Resnet101 suffer more from the high-noise ratios.
|Noise ratio||40%||80 %||40%||80 %|
|With Iterative Filtering|
With the proposed iterative filtering, the performance gaps between different models are massively reduced. After iterative filtering, Resnet26 and Resnet18 perform similarly well and provide a very strong baseline. IF-SSL achieves up to 19% absolute improvement over the best Resnet26+IF-baseline at 80% noise ratio.
4.2.4 Semi-supervised learning techniques + iterative filtering
|Noise ratio||40%||80 %|
|Mean Teacher (all-samples)||90.4||52.5|
|With Iterative Filtering|
|Entropy (all samples)+IF||90.4||52.46|
|Entropy (unlabeled samples)+IF||90.02||53.44|
|Mean Teacher + IF (ours)||93.7||69.91|
Tab. 5 shows different semi-supervised learning strategies with and without iterative filtering. The push-away-loss corresponds to assigning negative weights to potentially noisy labels. The entropy loss minimizes the network’s uncertainty on a set of samples. Since our labels are all potentially noisy, it is meaningful to apply this loss to all training samples instead of removed samples only. Hence we compare both variants. The Mean-teacher loss is always applied to all samples (details in the appendix).
Without filtering: Learning from the entropy-loss performs second-best, when the uncertainty is minimized on all samples. Without the previous filtering step, there is no set of unlabeled samples to perform a traditional semi-supervised-learning. The Mean-teacher performs best since the teacher represents a stable model state, aggregated over multiple iterations.
With filtering: Applying entropy-loss to all samples or only unsupervised samples leads to very similar performance. Both are better than the standard push-away-loss. Our Mean Teacher achieves by far the best performance, due to the temporal ensemble of models and sample predictions for filtering.
In this work, we propose a training pipeline for robust learning. Our method relies on two key components: (1) iterative filtering of potentially noisy labels, and (2) regularization by learning from all raw data samples in an unsupervised fashion.
We have shown that neither iterative noise filtering (IF) nor semi-supervised learning (SSL) alone is sufficient to achieve competitive performance. Contrary, we combine IF and SSL and extend them with crucial novel components for more robust learning.
Unlike previous filtering approaches, we always filter the initial label set provided at the beginning. Furthermore, we utilize a temporal ensemble of model predictions as the basis for the filtering step.
The proposed algorithm is evaluated on classification tasks for CIFAR-10 and CIFAR-100 with a varying label noise ratio from 0% to 80%. We show results both for a clean validation set and a noisy one. In both cases, we show that using the filtered data as unlabeled samples significantly outperforms complete removal of the data. As a consequence, the proposed model consistently outperforms state of the art at all levels of label noise. Despite the simplicity of the training pipeline, our approach shows robust performance even in case of high noise ratios.
- Azadi et al. (2015) Azadi, S., Feng, J., Jegelka, S., and Darrell, T. Auxiliary image regularization for deep cnns with noisy labels. arXiv preprint arXiv:1511.07069, 2015.
- Dai et al. (2017) Dai, Z., Yang, Z., Yang, F., Cohen, W. W., and Salakhutdinov, R. R. Good semi-supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pp. 6510–6520, 2017.
- Gastaldi (2017) Gastaldi, X. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
- Goldberger & Ben-Reuven (2016) Goldberger, J. and Ben-Reuven, E. Training deep neural-networks using a noise adaptation layer. 2016.
- (5) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. pp. 9.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In
- Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Jenni & Favaro (2018) Jenni, S. and Favaro, P. Deep bilevel learning. In ECCV, 2018.
- Jiang et al. (2017) Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. arXiv:1712.05055 [cs], December 2017. URL http://arxiv.org/abs/1712.05055. arXiv: 1712.05055.
- Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kingma et al. (2014) Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589, 2014.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
- Kumar et al. (2017) Kumar, A., Sattigeri, P., and Fletcher, T. Semi-supervised learning with gans: manifold invariance with improved inference. In Advances in Neural Information Processing Systems, pp. 5534–5544, 2017.
- Loshchilov & Hutter (2016) Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
- Miyato et al. (2018) Miyato, T., Maeda, S.-i., Ishii, S., and Koyama, M. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
- Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
- Rasmus et al. (2015) Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
- Reed et al. (2014) Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
- Ren et al. (2018) Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to Reweight Examples for Robust Deep Learning. arXiv:1803.09050 [cs, stat], March 2018. URL http://arxiv.org/abs/1803.09050. arXiv: 1803.09050.
- Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- Springenberg (2015) Springenberg, J. T. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
- Sutskever et al. (2013) Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. On the importance of initialization and momentum in deep learning. ICML (3), 28(1139-1147):5, 2013.
- Tarvainen & Valpola (2017) Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204, 2017.
- Wang et al. (2018) Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., and Xia, S.-T. Iterative learning with open-set noisy labels. arXiv preprint arXiv:1804.00092, 2018.
- Xie et al. (2017) Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500, 2017.
Appendix A Large-scale classification on ImageNet-ILSVRC-2015
Tab. 7 shows the precision@1 and @5 of various models, given 40% label noise in the training set. Our networks are based on ResNext18 and Resnext50. Note that MentorNet (Jiang et al., 2017) uses Resnet101 (P@1:78.25) (Goyal et al., 2017), which has similar performance compared to Resnext50 (P@1: 77.8)(Xie et al., 2017) on the standard ImageNet validation set. Although Resnext50 is a weaker model, we opt for the Resnext counterparts because of the significantly shorter training time. Hence, our performance reported with ResNext50 is a lower-bound of our approach with Resnet-101. Results with Resnext18 and Resnext50 indicates, that stronger models results in higher accuracy in our framework.
Despite the weaker model, IF-SSL (ResNext50) surpasses the best previously reported results by more than 5% absolute improvement. Even the significantly weaker model ResNext18 outperforms MentorNet based on a very powerful ResNet101 network.
Appendix B Complete removal of samples
|Noise ratio||40%||80 %||40%||80 %|
|Using noisy data only|
|With clean validation set|
Tab. 7 shows the results of deleting samples from the training set. It leads to large performances gaps compared to our strategy (IF-SSL), which considers the removed samples as unlabeled data. In case of a considerable label noise of 80%, the gap is close to 9%.
Continuously using the filtered samples lead to significantly better results. The unsupervised-loss provides meaningful learning signals, which should be used for better model training.
Appendix C Training process
Fig. 7 shows the sample training processes of IF-SSL under 60% and 80% noise on CIFAR-100. The mean-teacher always outperform the student models. Further, note that regular training leads to rapid over-fitting to label noise.
Contrary, with our effective filtering strategy, both models slowly increase their performance while the training accuracy approaches 100%. Hence, by using iterative filtering, our model could erase the inconsistency in the provided labels set.
Appendix D Training details
d.1 CIFAR-10 and CIFAR-100
For the training our model IF-SSL, we use the standard configuration provided by (Tarvainen & Valpola, 2017) 222https://github.com/CuriousAI/mean-teacher. Concretely, we use the SGD-optimizer with Nesterov (Sutskever et al., 2013) momentum, a learning rate of 0.05 with cosine learning rate annealing (Loshchilov & Hutter, 2016), a weight decay of 2e-4, max iteration per filtering step of 300, patience of 50 epochs, total epochs count of 600.
For basic training of baselines models without semi-supervised learning, we had to set the learning rate to 0.01. In the case of higher learning rates, the loss typically explodes. Every other option is kept the same.
For the mean teacher training, additional hyperparameters are required. In both cases of CIFAR-10 and CIFAR-100, we again take the standard configuration with the consistency loss to mean-squared-error and a consistency weight: 100.0, logit distance cost: 0.01, consistency-ramp-up:5. The total batch-size is 512, with 124 samples being reserved for labeled samples, 388 for unlabeled data. Each epoch is defined as a complete processing of all unlabeled data. When training without semi-supervised-learning, the entire batch is used for labeled data.
The data are normalized to zero-mean and standard-variance of one. Further, we use real-time data augmentation with random translation and reflection, subsequently random horizontal flip. The standard PyTorch-library provides these transformations.
The network used for evaluation were ResNet (He et al., 2016) and Resnext (Xie et al., 2017) for training. All ResNext variants use a cardinality of 32 and base width of 4 (32x4d). ResNext models follow the same structure as their Resnet counterparts, except for the cardinality and base width.
Due to the large images, the batch size is set to 40 in total with 20/20 for labeled and unlabeled samples respectively. We found the Kullback-divergence leads to no meaningful network training. Hence, we set the consistency loss to mean-squared-error, with a weight of 1000. We use consistency ramp up of 5 epochs to give the mean teacher more time in the beginning. Weight decay is set to 5e-5; patience is four epochs to stop training in the current filtering iteration.
We filter noisy samples with the topk=5 strategy, instead of topk=1 as on CIFAR-10 and CIFAR-100. That means the samples are kept for supervised training if their provided label lies within the top 5 predictions of the model. The main reason is that each image of ImageNet might contain multiple objects. Filtering with topk=1 is too strict and would lead to a small recall of the correct samples detection.
For all data, we normalize the RGB-images by the mean: (0.485, 0.456, 0.406) and the standard variance (0.229, 0.224, 0.225). For training data, we perform random rotation of up to 10 degrees, randomly resize images to 224x224, apply random horizontal flip and random color jittering. This noise is needed in regular mean-teacher training. The jittering setting are: brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1. The validation data are resized to 256x256 and randomly cropped to 224x224
Appendix E Losses
For the learning of wrongly labeled samples, Fig. 8 shows the relationship between the typical reweighting scheme and our baseline push-away-loss. Typically, reweighting is applied directly to the losses with samples weights for each sample as shown in Eq. 4
is the dataset, and are the samples and its noisy label. is the samples weight for the sample at step . Negative samples weights are often assigned to push the network away from the wrong labels. Let with , then we have:
Which results in:
In other words, we perform gradient ascent for wrongly labeled samples. However, the Negative-log-likelihood is not designed for gradient ascent. Hence the gradients of wrongly labeled samples vanish if the prediction is too close to the noisy label. This effect is similar to the training of Generative Adversarial Network (GAN) (Goodfellow et al., ). In the GAN-framework, the generator loss is not simply set to the negated version of the discriminator’s loss for the same reason.
Therefore, to provide a fair comparison with our framework, we suggest the push-away-loss with improved gradients as follows:
Whereby is the set of all classes in the training set. This loss has improved gradients to push the model away from the potentially wrong labels.