1 Introduction
In many supervised learning applications, a clean labeled dataset is the key to success. However, in realworld scenarios, label noise inevitably originates from different sources such as inconsistent labelers or the difficulty of the labeling task itself. In many classification tasks for example samples that cannot be squeezed into a strict categorical scheme will lead to inconsistent labels.
With traditional supervised learning, the present label noise decreases the performance of classification models since they tend to overfit to the samples with noisy labels. This results in lower accuracy and inferior generalization properties. To avoid the negative influence of noisy labels, a common approach is to use sampledependent loss weights as learning regularizers (Jiang et al., 2017; Ren et al., 2018)
. However, the performance of these mechanisms strongly depends on the respective hyperparameters that are difficult to set.
Typically the loss weights are restricted to
by design to resemble a probability of a noisy label given a sample. In a supervised learning framework, however, even with tiny (e.g.,
) loss weights, the model could still receive a strong learning signal from noisy samples (as, e.g., in (Ren et al., 2018)). A perfect case is assigning weights to samples with noisy labels which, however, implies ignoring those samples and results in a smaller training dataset.In this paper, instead of training in a supervised framework, we learn from the samples with noisy labels in an unsupervised way. Since the input data are not noisy but only the labels, semisupervised learning can still exploit the raw data samples. By keeping those samples rather than removing them from training our proposed method can be more strict when it comes to removing potentially noisy labels.
In more detail, we propose a learning scheme consisting of (1) iterative filtering of noisy labels and (2) semisupervised learning to regularize the problem in the vicinity of noisy samples. Fig. 2 shows a simplified overview of the concept. We refer to the proposed training procedure as Iterative Filtering with SemiSupervised Learning (IFSSL). To the best of our knowledge, we propose the first approach that only removes the noisy labels instead of the complete data samples using filtering. Our approach requires no new dedicated mechanism for robust learning and utilizes only existing standard components for learning.
The proposed algorithm was evaluated on classification tasks for CIFAR10 & CIFAR100 with a varying label noise ratio from 0% to 80%. We show results both for a clean validation set and a noisy one. In both cases, we show that using the filtered data as unlabeled samples significantly outperforms complete removal of the data. As a consequence, the proposed model consistently outperforms state of the art at all levels of label noise; see Fig. 1. Despite the simplicity of the training pipeline, our approach shows robust performance even in case of high noise ratios. The source code will be made available together with the published paper.
2 Robust Learning with Iterative noisefiltering
2.1 Overview
Fig. 2 shows an overview of our proposed approach. In the beginning, we assume that the labels of the training set are noisy (up to a certain noise ratio). We use a small validation set to measure the improvement in model performance. In each iteration, we first apply semisupervised training until we find the best model w.r.t. the performance on the validation set (e.g., by earlystopping). In the next step, we use the movingaverageprediction results of the best model to filter out potentially noisy labels based on the strategy defined in Section 2.2
. In the next iteration, we again use all data and the new filtered label set as input for the model training. The iterative training procedure stops when no better model can be found. Our filtering pipeline only requires a standard component of training deep learning models.
To provide a powerful regularizer against label noise, the semisupervised model treats all data points as additional unlabeled samples. Concretely, in the first iteration, the model learns from supervised and unsupervised learning objectives on the complete dataset. Subsequently, the unsupervised learning objective continuously derives learning signals from all data points while the supervised learning objective is computed only on a filtered set of labeled samples. Over these iterations, the label noise in the training set is expected to reduce.
In the following, we give more details about the combination of this training and filtering procedure with existing techniques from semisupervised learning.
2.2 Iterative Filtering
Let us start with an initial noisy training dataset and and the validation set . Assume each example might have one of the following labels where denotes the unlabeled/noisy case. By
we denote the model which in the current training epoch maps each example
to a set where is the score and . Let be the accuracy of over the validation set .Let denote a training procedure which will be explained in detail in Section 2.3.
Using these notations, the label filtering algorithm is given in Algorithm 1.
The label filtering is performed on the original label set from iteration
. In this way, clean labels erroneously removed in an earlier iteration (e.g., labels of hard to classify samples) can be used for the model training again. This is a major difference to typical iterative filtering approaches where the filtering at iteration
is restricted to training samples from the respective iteration only.We apply a variant of easy sample mining and filter out training samples based on the model’s agreement with the provided label. That means the labels are only used for supervised training if in the current epoch the model predicts the respective label to be the correct class with the highest likelihood. This is reflected in Algorithm 1 line 12 to line 14.
The model’s predictions required for filtering can be stored during training directly. However, the predictions for noisy samples tend to fluctuate. For example, take a cat wrongly labeled as a tiger. Other cat samples would encourage the model to predict the given cat image as a cat. Contrary, the wrong label tiger regularly pulls the model back to predict the cat as a tiger. Hence, using the model’s predictions gathered in one single training epoch for filtering is suboptimal.
Instead, we propose to collect the sample predictions over multiple training epochs. This scheme is displayed in Fig. 3. For each sample, we store the moving averaged predictions, accumulated over the last iterations. Besides having a more stable basis for the filtering step, our proposed procedure also leads to negligible memory and computation overhead.
Due to continuous training of the best model from the previous model, computation time can be significantly reduced, compared to retraining the model from scratch. On the new filtered dataset, the model must only slowly adapt to the new noise ratio contained in the training set. Depending on the computation budget, a maximal number of iterations for filtering can be set to save time.
Moreover, the new training procedure does not require specific mechanisms or algorithms which need to be implemented or finetuned. Implementationwise, it can be realized by looping the standard training procedure and filter potentially noisy samples at the end of each training run.
2.3 Unsupervised learning to counter label noise
Although the proposed learning procedure is not restricted to classification tasks, in this work, we explain the procedure for classification as a usecase.
Model training is performed using two types of learning objectives: (1) supervised and (2) unsupervised losses. Supervised learning from noisylabeled samples is straightforward and can be done with typical nwayclassification losses. The unsupervised learning objective, however, requires a design choice of which data to be used (defined in Section 2.2) and how to learn from them.
2.3.1 Learning from unlabeled data
We learn from all data points in a semisupervised fashion. Concretely, in addition to supervised learning with filtered labels, unsupervised learning is applied to the entire dataset. Our learning strategy can take advantage of unsupervised learning from a large dataset, and therefore it has a potentially large regularization effect against label noise. Unsupervised learning objectives impose additional constraints on all samples, which are hard to follow for wrongly labeled samples. These constraints could be a preference of extreme predictions (Entropyloss) or nonfluctuating model predictions over many past iterations (Meanteacherloss). Both constraints are explained in the following.
Entropy minimization
The typical entropy loss for semisupervised learning is shown in Fig. 8. It encourages the model to provide extreme predictions (such as or ) for each sample. Over a large number of samples, the model should balance its predictions over all classes.
The entropy loss can easily be applied to all samples to express the uncertainty about the provided labels. Alternatively, the loss can be combined with a strict filtering strategy, as in our work, which removes the labels of potentially wrongly labeled samples.
For a large noise ratio, predictions of wrongly labeled samples fluctuate strongly over previous training iterations. Amplifying these network decisions could lead to even noisier models model. Combined with iterative filtering, the framework will have to rely on a single noisy model snapshot. In the case of an unsuitable snapshot, the filtering step will make many wrong decisions.
Mean Teacher model
A better way to perform semisupervised learning and counteract label noise is to employ the Mean Teacher model (Tarvainen & Valpola, 2017). The Mean Teacher model follows the studentteacher learning procedure from (Hinton et al., 2015). The main idea is to create a virtuous learning cycle, in which the student continually learns to surpass the (better) teacher. Concretely, the Mean Teacher is an exponential moving average of the student models over training iterations.
In contrast to learning from the entropyloss, the MeanTeacher solves precisely the problem of noisy models snapshots. The teachermodel is a movingaverage from the past training iterations and hence much more stable than a single snapshot. The training of such a model is shown in Fig. 5
Mean Teacher model for iterative filtering
Given the setting in Section 2.2, we apply the Mean Teacher algorithm in each iteration in the procedure as follows.

Input: examples with potentially clean labels from the filtering procedure. In the beginning (),

Initialize a supervised neural network as the student model
. 
Initialize the Mean Teacher model as a copy of the student model with all weights detached.

Let the loss function be the sum of normal classification loss of
and the consistency loss between the outputs of and 
Select an optimizer

In each training iteration:

Update the weights of using the selected optimizer

Update the weights of as an exponential moving average of the student weights

Evaluate performance of and over to verify the early stopping criteria.


Return the best
The consistency loss
between students and teachers output distribution can be realized with MeanSquareError or KullbackLeiblerdivergence.
Overlapping data split between labeled and unlabeled samples
While traditionally the dataset is strictly divided into nonoverlapping labeled and unlabeled sets, we treat all samples also as unsupervised samples, even if they are in the set of filtered, labeled samples.
This is important since despite the filtering the provided labels can be wrong. By considering them additionally as unsupervised samples, the consistency of the model prediction for a potentially noisy sample is evaluated among many other samples, resulting in more consistent model predictions. Therefore, learning from all samples in an unsupervised fashion provides a stronger regularization effect against label noise.
3 Related Works
Different approaches to counter label noise have been proposed in (Azadi et al., 2015; Reed et al., 2014; Ren et al., 2018; Jiang et al., 2017; Jenni & Favaro, 2018). Some of these works (Azadi et al., 2015; Ren et al., 2018) require additional clean training data. Often, the loss for potentially noisy labels is reweighted softly to push the model away from the wrong label (Jiang et al., 2017; Ren et al., 2018).
Compared to these works, we perform an extreme filtering by setting the sample weight of the potentially wrongly labeled samples to . These labels are no longer used for the supervised objective of the task. Moreover, we perform the filtering step very seldom, in contrast to epochwisesamples reweighting of previous approaches. Furthermore, contrary to all previous robust learning approaches, we utilize iterative training combined with semisupervised learning to combat label noise for the first time.
Despite recent advances in semisupervised learning (Rasmus et al., 2015; Makhzani et al., 2015; Kingma et al., 2014; Kumar et al., 2017; Springenberg, 2015; Miyato et al., 2018; Dai et al., 2017), it has not been considered as a regularization technique against label noise. Semisupervised learning often uses generative modeling (Kingma & Welling, 2013; Kingma et al., 2016; Rezende et al., 2014; Goodfellow et al., 2014) as an auxiliary task. In contrast to using generative models, the Mean Teacher model proposed in (Tarvainen & Valpola, 2017) has a more stable training procedure. The Mean Teacher does not require any additional generative model. More details are explained in Section 2.3.
Typically, unsupervised learning is only applied to unlabeled data. Contrary, in our approach, unsupervised learning is applied to all samples to expresses the uncertainty of the provided labels.
Although previous robust learning approaches such as (Wang et al., 2018) also use iterative training and filtering, their approach does not employ learning from removed samples in an unsupervised fashion. Furthermore, they always filter strictly, i.e., each sample removal decision is final.
In IFSSL we only filter potentially noisy labels from the original label set, but still, use the corresponding instances for unsupervised learning. This gives the model a chance to revert a wrong filtering decision in earlier iterations.
Further, our framework is intentionally kept more simple and generic than previous techniques. The focus of our framework is the iterative filtering of noisy labels while learning from all samples in an unsupervised fashion as a form of regularization. This paradigm is hence easily transferable to other tasks than classification.
4 Evaluation
4.1 Description of Experiments
4.1.1 Tasks
Type  CIFAR10  CIFAR100  

Task  classification  10way  100way 
Resolution  32x32  
Data  Train (noisy)  45000  45000 
Valid (noisy)  5000  5000  
Test (clean)  10000  10000 
Dataset description. Classification tasks on CIFAR10 and CIFAR100 with uniform noise. Note that the noise on the training and validation set is not correlated. Hence, maximizing the accuracy on the noisy set provides a useful (but noisy) estimate for the generalization ability on unseen test data.
4.1.2 Comparisons to related works
We compare our framework IFSSL (Iterative Filtering + Semisupervised Learning) to previous robust learning approaches such as MentorNet (Jiang et al., 2017), Learned and random sample weights from (Ren et al., 2018), SModel (Goldberger & BenReuven, 2016), bilevel learning (Jenni & Favaro, 2018), ReedHard (Reed et al., 2014) and Iterative learning in openset problems (Wang et al., 2018).
Hyperparameters and earlystopping are determined on the noisy validation set. This is possible because the noise of the validation and training sets is not correlated. Hence, higher validation performance often results in superior test performance.
Additionally, (Ren et al., 2018) considered the setting of having a small clean validation set of 1000 images. For comparison purposes, we also experiment with a small clean set for early stopping.
Whenever possible, we adopt the performances of their methods from the corresponding publications. Sometimes, not all numbers are reported in these publications.
4.1.3 Network configuration and training
For the basic training of semisupervised models, we use a Mean Teacher model (Tarvainen & Valpola, 2017) available on GitHub ^{1}^{1}1https://github.com/CuriousAI/meanteacher. The students and teacher networks are residual networks (He et al., 2016) with 26 layers. They are trained with ShakeShakeregularization (Gastaldi, 2017)
. We use the PyTorch
(Paszke et al., 2017) implementation of the network and keep the training settings close to (Tarvainen & Valpola, 2017). The network is trained with Stochastic Gradient Descent. In each filtering iteration, the model is trained for a maximum of
epochs, with a patience of epochs. For more training details, see the appendix.To filter the noise iteratively, we use the early stopping strategy based on the validation set. After the best model is found, we use it to filter out potentially noisy samples from the noisy training label set at iteration . In the next iteration, the previously best model is finetuned on the new dataset. All data is used for unsupervised learning, while supervised learning only considers the filtered labels set at the current iteration. We stop the iterative filtering if no better model is found.
4.1.4 Structure of analysis
We start with the analysis of our model’s performance under different noise ratios. We compare our performance to other previously reported approaches in learning under different noise ratios using the accuracy metric on CIFAR10 and CIFAR100. The subsequent ablation study highlights the importance of each component in our framework.
Further, we analyze the consequence of applying our iterative filtering scheme to different network architectures. Afterwards, we show the performance of simple unsupervised learning objectives, with and without our iterative filtering scheme. For more experiments, we refer to the supplemental material.
4.2 Robust Learning Performance Evaluation
4.2.1 Model accuracy under label noise
Cifar10  CIFAR100  
Noise ratio  40%  80 %  40%  80 % 
USING NOISY DATASET ONLY  
ReedHard (Reed et al., 2014)  69.66    51.34   
Smodel (Goldberger & BenReuven, 2016)  70.64    49.10   
(Wang et al., 2018)  78.15       
Rand. weights (Ren et al., 2018)  86.06    58.01   
Bilevelmodel (Jenni & Favaro, 2018)  89  20  61  13 
MentorNet (Jiang et al., 2017)  89  49  68  35 
Resnet26 baseline  83.2  41.37  53.18  19.12 
(Ours) IF+SSL  93.7  69.91  71.98  42.09 
USING 1000 CLEAN IMAGES  
Mentornet (Jiang et al., 2017)*  78    59   
Rand. weights (Ren et al., 2018)*  86.55    58.34   
Ren et al (Ren et al., 2018)*  86.92    61.31   
(Ours) IF+SSL*  95.1  79.93  74.76  46.43 
Results for typical scenarios with noise ratio of 40% or 80% on CIFAR10 and CIFAR100 are shown in Tab. 2. More results are visualized in Fig. 1 (CIFAR10) and Fig. 5(a) (CIFAR100). The baseline model is the typical ResNet26 with a nwayclassification loss (NegativeLoglikelihoodobjective).
Compared to the model baseline and other previously reported approaches, IFSSL outperforms them by a large margin. Even in areas of high noise ratio up to 80%, the classification performance of our model remains highly robust. Despite the noisy validation set, our model still identifies the noisy labels and filters them out. On CIFAR10 and CIFAR100, our model IFSSL achieves 20% and 7% absolute improvement over previously reported results.
A small clean validation set gives the model an even better estimate of the generalization error on unseen data (IFSSL*). Due to the iterative filtering scheme, our model always attempts to improve the performance on the validation set as much as possible, without doing gradient steps on it. At the time of convergence, the model always has a loss very close to . Contrary, to prevent overfitting, a simple early stopping scheme usually leads to a high remaining training loss. Our filtering framework indicates that it is meaningful to learn further from easy samples and to treat the other samples as unlabeled. See the appendix for training visualizations.
Previous works utilize strict filtering, where removed samples are not reconsidered in later filtering iterations, whereas iterative filtering always filters based on the provided label set at iteration . The experiments show the enormous benefit of this. The IFSSL* using clean validation set only achieves 70.93 % at 80% noise when the samples are completely removed. The improvement also stagnates after one single filtering iteration. Hence, for a fair comparison with all filtering baselines, we always use the filtered data as unlabeled samples if not stated otherwise. More details and experiments can be found in the appendix.
4.2.2 Ablation Study
Cifar10  CIFAR100  
noise ratio  40%  80 %  40%  80 % 
(Ours) IF+SSL  93.7  69.91  71.98  42.09 
 mvapredictions  93.77  57.4  71.69  38,61 
SSLonly  93.7  52.5  65.85  26.31 
IF  87.35  49.58  61.4  23.42 
Resnet26  83.2  41.37  53.18  19.92 
Tab. 3 indicates the importance of the iterative filtering and semisupervised learning procedure in our framework. Performing semisupervised learning (on all samples) or iterative filtering alone leads to similar performances. When combined (IFSSL without movingaveragepredictions), the model is highly robust at 40% noise.
With a higher noise ratio of 80% however, the model’s predictions on training samples fluctuate strongly. Hence, merely taking the model’s predictions at one specific epoch leads to a suboptimal filtering step. Contrary, our approach IFSSL proposes to utilize movingaverage predictions which are significantly more stable. Compared to the baseline IFSSL without movingaverage predictions, this technique leads to 12% and 3.5 % absolute improvement on CIFAR10 and CIFAR100 respectively.
Naive training or leaving out any of the proposed mechanism leads to rapid performance decrease. Our framework combines the strength of both techniques to form an extremely effective regularizer against learning from label noise.
4.2.3 Iterative filtering with different architectures
Tab. 4 shows the effect of iterative filtering on various architectures. For traditional network training, Resnet26 performs best and slightly better than its shallower counterpart Resnet18. Extremely deep architectures like Resnet101 suffer more from the highnoise ratios.
Cifar10  CIFAR100  
Noise ratio  40%  80 %  40%  80 % 
ResNet18  75.03  34.9  43.34  5.34 
ResNet26  83.2  41.37  53.18  19.92 
ResNet101  68.14  32.5  36.02  13.24 
With Iterative Filtering  
ResNet18IF  85.75  42.84  57.86  21.27 
ResNet26IF  87.35  49.58  61.4  23.42 
ResNet101IF  82.46  33.2  47.11  6.50 
(Ours) IF+SSL  93.7  69.91  71.98  42.09 
With the proposed iterative filtering, the performance gaps between different models are massively reduced. After iterative filtering, Resnet26 and Resnet18 perform similarly well and provide a very strong baseline. IFSSL achieves up to 19% absolute improvement over the best Resnet26+IFbaseline at 80% noise ratio.
4.2.4 Semisupervised learning techniques + iterative filtering
Cifar10  
Noise ratio  40%  80 % 
Resnet26  83.2  41.37 
Entropy (allsamples)  85.98  46.93 
Mean Teacher (allsamples)  90.4  52.5 
With Iterative Filtering  
Push Away+IF  90.47  50.79 
Entropy (all samples)+IF  90.4  52.46 
Entropy (unlabeled samples)+IF  90.02  53.44 
Mean Teacher + IF (ours)  93.7  69.91 
Tab. 5 shows different semisupervised learning strategies with and without iterative filtering. The pushawayloss corresponds to assigning negative weights to potentially noisy labels. The entropy loss minimizes the network’s uncertainty on a set of samples. Since our labels are all potentially noisy, it is meaningful to apply this loss to all training samples instead of removed samples only. Hence we compare both variants. The Meanteacher loss is always applied to all samples (details in the appendix).
Without filtering: Learning from the entropyloss performs secondbest, when the uncertainty is minimized on all samples. Without the previous filtering step, there is no set of unlabeled samples to perform a traditional semisupervisedlearning. The Meanteacher performs best since the teacher represents a stable model state, aggregated over multiple iterations.
With filtering: Applying entropyloss to all samples or only unsupervised samples leads to very similar performance. Both are better than the standard pushawayloss. Our Mean Teacher achieves by far the best performance, due to the temporal ensemble of models and sample predictions for filtering.
5 Conclusion
In this work, we propose a training pipeline for robust learning. Our method relies on two key components: (1) iterative filtering of potentially noisy labels, and (2) regularization by learning from all raw data samples in an unsupervised fashion.
We have shown that neither iterative noise filtering (IF) nor semisupervised learning (SSL) alone is sufficient to achieve competitive performance. Contrary, we combine IF and SSL and extend them with crucial novel components for more robust learning.
Unlike previous filtering approaches, we always filter the initial label set provided at the beginning. Furthermore, we utilize a temporal ensemble of model predictions as the basis for the filtering step.
The proposed algorithm is evaluated on classification tasks for CIFAR10 and CIFAR100 with a varying label noise ratio from 0% to 80%. We show results both for a clean validation set and a noisy one. In both cases, we show that using the filtered data as unlabeled samples significantly outperforms complete removal of the data. As a consequence, the proposed model consistently outperforms state of the art at all levels of label noise. Despite the simplicity of the training pipeline, our approach shows robust performance even in case of high noise ratios.
References
 Azadi et al. (2015) Azadi, S., Feng, J., Jegelka, S., and Darrell, T. Auxiliary image regularization for deep cnns with noisy labels. arXiv preprint arXiv:1511.07069, 2015.
 Dai et al. (2017) Dai, Z., Yang, Z., Yang, F., Cohen, W. W., and Salakhutdinov, R. R. Good semisupervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pp. 6510–6520, 2017.
 Gastaldi (2017) Gastaldi, X. Shakeshake regularization. arXiv preprint arXiv:1705.07485, 2017.
 Goldberger & BenReuven (2016) Goldberger, J. and BenReuven, E. Training deep neuralnetworks using a noise adaptation layer. 2016.
 (5) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. pp. 9.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Jenni & Favaro (2018) Jenni, S. and Favaro, P. Deep bilevel learning. In ECCV, 2018.
 Jiang et al. (2017) Jiang, L., Zhou, Z., Leung, T., Li, L.J., and FeiFei, L. MentorNet: Learning DataDriven Curriculum for Very Deep Neural Networks on Corrupted Labels. arXiv:1712.05055 [cs], December 2017. URL http://arxiv.org/abs/1712.05055. arXiv: 1712.05055.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. Semisupervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589, 2014.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
 Kumar et al. (2017) Kumar, A., Sattigeri, P., and Fletcher, T. Semisupervised learning with gans: manifold invariance with improved inference. In Advances in Neural Information Processing Systems, pp. 5534–5544, 2017.
 Loshchilov & Hutter (2016) Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
 Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Miyato et al. (2018) Miyato, T., Maeda, S.i., Ishii, S., and Koyama, M. Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
 Rasmus et al. (2015) Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
 Reed et al. (2014) Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
 Ren et al. (2018) Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to Reweight Examples for Robust Deep Learning. arXiv:1803.09050 [cs, stat], March 2018. URL http://arxiv.org/abs/1803.09050. arXiv: 1803.09050.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Springenberg (2015) Springenberg, J. T. Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
 Sutskever et al. (2013) Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. On the importance of initialization and momentum in deep learning. ICML (3), 28(11391147):5, 2013.
 Tarvainen & Valpola (2017) Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204, 2017.
 Wang et al. (2018) Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., and Xia, S.T. Iterative learning with openset noisy labels. arXiv preprint arXiv:1804.00092, 2018.
 Xie et al. (2017) Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500, 2017.
Appendix A Largescale classification on ImageNetILSVRC2015
Resnext18  Resnext50  

Accurracy  P@1  P@5  P@1  P@5 
Mentornet*      65.10  85.90 
Basic model  50.6  75.99  56.25  80.90 
SSL  58.04  81.82  62.96  85.72 
IFSSL (Ours)  66.92  86.65  71.31  89.92 
Tab. 7 shows the precision@1 and @5 of various models, given 40% label noise in the training set. Our networks are based on ResNext18 and Resnext50. Note that MentorNet (Jiang et al., 2017) uses Resnet101 (P@1:78.25) (Goyal et al., 2017), which has similar performance compared to Resnext50 (P@1: 77.8)(Xie et al., 2017) on the standard ImageNet validation set. Although Resnext50 is a weaker model, we opt for the Resnext counterparts because of the significantly shorter training time. Hence, our performance reported with ResNext50 is a lowerbound of our approach with Resnet101. Results with Resnext18 and Resnext50 indicates, that stronger models results in higher accuracy in our framework.
Despite the weaker model, IFSSL (ResNext50) surpasses the best previously reported results by more than 5% absolute improvement. Even the significantly weaker model ResNext18 outperforms MentorNet based on a very powerful ResNet101 network.
Appendix B Complete removal of samples
Cifar10  CIFAR100  
Noise ratio  40%  80 %  40%  80 % 
Using noisy data only  
Compl. Removal  93.4  59.98  68.99  35.53 
IFSSL (Ours)  93.7  69.91  71.98  42.09 
With clean validation set  
Compl. Removal  94.39  70.93  71.86  36.61 
IFSSL (ours)  95.1  79.93  74.76  46.43 
Tab. 7 shows the results of deleting samples from the training set. It leads to large performances gaps compared to our strategy (IFSSL), which considers the removed samples as unlabeled data. In case of a considerable label noise of 80%, the gap is close to 9%.
Continuously using the filtered samples lead to significantly better results. The unsupervisedloss provides meaningful learning signals, which should be used for better model training.
Appendix C Training process
Fig. 7 shows the sample training processes of IFSSL under 60% and 80% noise on CIFAR100. The meanteacher always outperform the student models. Further, note that regular training leads to rapid overfitting to label noise.
Contrary, with our effective filtering strategy, both models slowly increase their performance while the training accuracy approaches 100%. Hence, by using iterative filtering, our model could erase the inconsistency in the provided labels set.
Appendix D Training details
d.1 CIFAR10 and CIFAR100
Network training
For the training our model IFSSL, we use the standard configuration provided by (Tarvainen & Valpola, 2017) ^{2}^{2}2https://github.com/CuriousAI/meanteacher. Concretely, we use the SGDoptimizer with Nesterov (Sutskever et al., 2013) momentum, a learning rate of 0.05 with cosine learning rate annealing (Loshchilov & Hutter, 2016), a weight decay of 2e4, max iteration per filtering step of 300, patience of 50 epochs, total epochs count of 600.
For basic training of baselines models without semisupervised learning, we had to set the learning rate to 0.01. In the case of higher learning rates, the loss typically explodes. Every other option is kept the same.
Semisupervised learning
For the mean teacher training, additional hyperparameters are required. In both cases of CIFAR10 and CIFAR100, we again take the standard configuration with the consistency loss to meansquarederror and a consistency weight: 100.0, logit distance cost: 0.01, consistencyrampup:5. The total batchsize is 512, with 124 samples being reserved for labeled samples, 388 for unlabeled data. Each epoch is defined as a complete processing of all unlabeled data. When training without semisupervisedlearning, the entire batch is used for labeled data.
Data augmentation
The data are normalized to zeromean and standardvariance of one. Further, we use realtime data augmentation with random translation and reflection, subsequently random horizontal flip. The standard PyTorchlibrary provides these transformations.
d.2 ImageNetILSVRC2015
Network Training
Semisupervised learning
Due to the large images, the batch size is set to 40 in total with 20/20 for labeled and unlabeled samples respectively. We found the Kullbackdivergence leads to no meaningful network training. Hence, we set the consistency loss to meansquarederror, with a weight of 1000. We use consistency ramp up of 5 epochs to give the mean teacher more time in the beginning. Weight decay is set to 5e5; patience is four epochs to stop training in the current filtering iteration.
Filtering
We filter noisy samples with the topk=5 strategy, instead of topk=1 as on CIFAR10 and CIFAR100. That means the samples are kept for supervised training if their provided label lies within the top 5 predictions of the model. The main reason is that each image of ImageNet might contain multiple objects. Filtering with topk=1 is too strict and would lead to a small recall of the correct samples detection.
Data Augmentation
For all data, we normalize the RGBimages by the mean: (0.485, 0.456, 0.406) and the standard variance (0.229, 0.224, 0.225). For training data, we perform random rotation of up to 10 degrees, randomly resize images to 224x224, apply random horizontal flip and random color jittering. This noise is needed in regular meanteacher training. The jittering setting are: brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1. The validation data are resized to 256x256 and randomly cropped to 224x224
Appendix E Losses
For the learning of wrongly labeled samples, Fig. 8 shows the relationship between the typical reweighting scheme and our baseline pushawayloss. Typically, reweighting is applied directly to the losses with samples weights for each sample as shown in Eq. 4
(1) 
is the dataset, and are the samples and its noisy label. is the samples weight for the sample at step . Negative samples weights are often assigned to push the network away from the wrong labels. Let with , then we have:
(2) 
Which results in:
(3) 
In other words, we perform gradient ascent for wrongly labeled samples. However, the Negativeloglikelihood is not designed for gradient ascent. Hence the gradients of wrongly labeled samples vanish if the prediction is too close to the noisy label. This effect is similar to the training of Generative Adversarial Network (GAN) (Goodfellow et al., ). In the GANframework, the generator loss is not simply set to the negated version of the discriminator’s loss for the same reason.
Therefore, to provide a fair comparison with our framework, we suggest the pushawayloss with improved gradients as follows:
(4) 
Whereby is the set of all classes in the training set. This loss has improved gradients to push the model away from the potentially wrong labels.
Comments
There are no comments yet.