Transform consistency for learning with noisy labels

03/25/2021 ∙ by Rumeng Yi, et al. ∙ BEIJING JIAOTONG UNIVERSITY 0

It is crucial to distinguish mislabeled samples for dealing with noisy labels. Previous methods such as Coteaching and JoCoR introduce two different networks to select clean samples out of the noisy ones and only use these clean ones to train the deep models. Different from these methods which require to train two networks simultaneously, we propose a simple and effective method to identify clean samples only using one single network. We discover that the clean samples prefer to reach consistent predictions for the original images and the transformed images while noisy samples usually suffer from inconsistent predictions. Motivated by this observation, we introduce to constrain the transform consistency between the original images and the transformed images for network training, and then select small-loss samples to update the parameters of the network. Furthermore, in order to mitigate the negative influence of noisy labels, we design a classification loss by using the off-line hard labels and on-line soft labels to provide more reliable supervisions for training a robust model. We conduct comprehensive experiments on CIFAR-10, CIFAR-100 and Clothing1M datasets. Compared with the baselines, we achieve the state-of-the-art performance. Especially, in most cases, our proposed method outperforms the baselines by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs) achieve a remarkable success on many computer vision tasks due to the large-scale datasets with reliable and clean annotations 

[chen2018encoder] [girshick2015fast] [he2016deep]. However, collecting such datasets with precise annotations is expensive and time-consuming. There are two alternative solutions to alleviate this issue: crowd-sourcing from non-experts and online queries by search engines. Unfortunately, the obtained annotations inevitably contain noisy labels. As DNNs have the capability to memorize all training samples, they will eventually overfit the noisy labels, leading to poor generalization performance [tanaka2018joint] [zhang2016understanding].

(a) (a) Symmetric 50%
(b) (b) Asymmetric 40%
Fig. 1: KL Divergence between two inputs (original and horizontally flip images) on CIFAR-10 under (a) symmetric 50% and (b) asymmetric 40% label noise for clean and noisy samples.

In order to reduce the effect of noisy labels, an efficient strategy is to select clean samples out of the noisy ones and then only use these clean ones to update the parameters of the network. Following this research line, many studies have shown impressive progresses, where Decoupling [malach2017decoupling], Co-teaching [han2018co] and Co-teaching+ [yu2019does] are three representative methods. All these methods train two networks simultaneously, but the difference is that Decoupling and Co-teaching+ select training samples depending on the disagreement between two different networks, while in Co-teaching, each network views its small-loss instances as clean ones, and teaches them to its peer network. Recently, the state-of-the-art method named JoCoR [wei2020combating] achieves the excellent performance. JoCoR applies an agreement maximization principle that trains two networks with a joint loss (including the conventional supervised loss and the co-regularization loss) and use the joint loss to select small-loss instances.

In this paper, different from the above methods which require to train two networks simultaneously, we propose a simple and effective method to distinguish clean samples only using one single network. We find that the prediction consistency under different image transforms (such as scaling, rotation, flipping) in one network is beneficial to select clean samples. As shown in Figure 1, we feed the original and transformed (horizontally flip) images into one single network, and observe the Kullback-Leibler (KL) Divergence of clean and noisy samples on CIFAR-10 dataset with 50% symmetric noise and 40% asymmetric noise. From Figure 1, we can see that the KL Divergence of clean samples are much smaller than noisy samples under two different noise levels. It suggests that the network can reach an agreement on the original and transformed images for clean samples during training. While for noisy samples, the KL Divergence remains a larger value. This observation motivates us to distinguish clean and noisy samples by considering the KL Divergence between the original images and the transformed images.

Based on the above observation, we propose a novel approach to identify mislabeled samples. Specifically, we follow the agreement maximization principle to train a network with a joint loss, including the classification losses and the KL loss between two inputs (the original images and transformed images). Moreover, we further adapt a self-ensembling method to construct a teacher model, the predictions of the teacher model are used as the on-line soft labels. Then we combine off-line hard labels with on-line soft labels to provide robust supervisory information for training and further alleviate the noise effects.

Intuitively, our proposed method can be regarded as adding some perturbations on the training samples. Since we observe that the perturbation can have stronger effect on the noisy samples than that on clean samples, making clean and noisy samples being distinguished more easily. Furthermore, our method only requires to train one single network, which is more suitable for real applications.

In summary, our contribution in this paper are three-fold:

  • We propose a simple and effective approach to learn from noisy labels, which introduces the prediction consistency of the images under different spatial transforms (such as scaling, rotation, flipping) in one single network to select clean samples efficiently.

  • We design a classification loss by using the off-line hard labels and on-line soft labels to provide reliable supervisions for network training.

  • We conduct comprehensive experiments on CIFAR-10, CIFAR-100 and Clothing1M datasets, and achieve the state-of-the-art performance. In most cases, such as symmetry-50%, symmetry-80% and asymmetry-40% on CIFAR-10, and symmetry-20%, symmetry-50% and symmetry-80% on CIFAR-100, the performance of our method exceeds the second best method by up to 3.70%-12.78%.

Ii Related Works

Ii-a Learning with Noisy Labels

The techniques to alleviate the effect of noisy labels can be divided into two categories: (1) detecting noisy labels and then cleansing potential noisy labels or reduce their impacts. (2) directly training noise-robust models with noisy labels.

One solution of detecting and cleansing noisy labels is to select clean samples out of the noisy ones, and then use them to update the network. For example, Decoupling [malach2017decoupling], Co-teaching [han2018co], Co-teaching+ [yu2019does] and the recent state-of-the-art JoCoR [wei2020combating] train two networks by introducing “agreement” or “disagreement” strategies to select clean samples for training. O2U-Net [huang2019o2u]

adjusts the hyperparameters of the network and ranks the normalized average loss of every sample to detect the noisy samples. MentorNet 

[jiang2018mentornet] learns a data-driven curriculum to supervise the training of a student network. The second attempt is based on re-weighting or relabeling methods. Han et al. [han2019deep] propose a deep self-learning framework to relabel the noisy samples by extracting multiple prototypes for one category. Ren et al. [ren2018learning]

weights each sample in the loss function based on the gradient directions compared to those on a clean set. Arazo et al. 

[arazo2019unsupervised] model the per-sample loss distribution with a beta mixture model and correct the loss by relying on the network prediction. In a similar spirit, Li et al. [li2020dividemix]

use two networks to perform sample selection via a gaussian mixture model and further apply the semi-supervised learning technique, MixMatch 

[berthelot2019mixmatch], to handle noisy labels.

For the second type, Xiao et al. [xiao2015learning]

model the relationships between images, class labels and label noises with a probabilistic graphical model and further integrate it into an end-to-end deep learning model. Guo et al. 

[guo2018curriculumnet] propose a CurriculumNet, where training data are divided into several subsets by ranking their complexity via distribution density, and then the subsets are formed as a curriculum to teach the model in understanding noisy labels gradually. Li et al. [li2019learning] propose a meta-learning based method to avoid the model overfitting to the specific noise by generating synthetic noisy labels.

Motivated by the above studies, we propose a simple and effective method to select clean samples out of the noisy ones by observing the category distribution and visual attention map between the original images and the transformed images in one single network. This achieves an aggregated inference which combines the predictions from different spatial transforms to improve the classification accuracy.

Ii-B Image Transformation

Human visual perception shows good consistency under certain spatial transforms, such as scaling, rotation and flipping, which has motivated the data augmentation strategy widely used in DNNs. [gidaris2018unsupervised] learns image representations by training convolutional networks to recognize the geometric transformation that is applied to the input image. [guo2019visual] proposes a two-branch network with original images and its transformed images as inputs and introduces a new attention consistency loss that measures the attention heatmap consistency between two branches and achieves a new state-of-the-art for multi-label image classification.  [lee20_sla]

augments original labels via self-supervision of input transformation to learn a single unified task with respect to the joint distribution of the original and self-supervised labels.

In this paper, we propose the first usage of the transform consistency for noisy-labeled image classification. Specifically, we regard the transformation operation performed on the images as the added disturbance. For clean samples, the network’s predictions are not affected by the added disturbance, resulting in a consistent prediction between two inputs, while for noisy samples, the network’s predictions may be inconsistent or oscillate strongly, which allow us to distinguish the clean samples out of the noisy ones.

Fig. 2: Overview of the proposed approach. Specifically, an original image and its transformed image are both fed into one single network, and then apply a joint loss, i.e., two classification loss ( and ) and one KL loss () to train the network. Finally we select the samples with small joint loss to update the model’s parameters (). Besides, we use the exponential moving average (EMA) to construct a teacher model, the prediction of the two inputs are used as the on-line soft labels for optimizing .

Ii-C Self-ensemble Learning

Self-ensemble has been widely studied in semi-supervised learning. These methods form a consensus prediction of the unknown labels using the outputs of the network in training on different epochs. For example, temporal ensembling 

[laine2016temporal] maintains an exponential moving average (EMA) of label predictions on each training example, and penalizes predictions that are inconsistent with this target. Mean teacher [tarvainen2017mean] averages model weights over training steps to form a target-generating teacher model. Recently, the self-ensembling strategy is used for noisy-labeled image classification. MLNT [li2019learning] designs a noise-tolerant algorithm which constructs a teacher model to give more reliable predictions unaffected by the synthetic noise. SELF [nguyen2019self] attempts to identify clean labels progressively using self-forming ensembles of models and predictions.

To effectively mitigate the negative influence of noisy labels, in this paper, we introduce the self-ensembling method to construct a teacher model by using the exponential moving average (EMA) of the model snapshots in each training iterations. Then the predictions of the teacher model are used as the on-line soft labels, which provide a more stable supervisory signal than noisy hard labels to guide the model’s training. Note that this teacher model does not require additional training.

In this section, we give the details of our proposed method. Our approach aims at identifying potentially noisy labels during training and keeping the network from receiving supervision of the clean labels. The overall framework is illustrated in Figure 2. Specifically, an original image and its transformed image are both fed into one single network, and then we apply a joint loss, i.e., two classification losses and one KL loss, to train the network. Based on our observed principle that the model can keep consistent predictions on two inputs of clean samples and inconsistent predictions on noisy labels, we select the samples with small joint loss to update the model’s parameters. In order to further alleviate the negative influence of label noise, apart from the off-line hard labels, our framework further incorporates on-line soft pseudo labels by using the self-ensembling strategy into the training process.

Problem statement. Image classification can be formulated with a training set , where is an image and is the one-hot label over classes which may contain noise. Let

denote the model’s output softmax probability parameterized by

.

Classification loss. In our framework, the two inputs of the network are and , respectively, where and represents the transformation operation such as scaling, rotation and flipping. Their label confidences can be predicted as and . In general, the objective function is the empirical risk of cross-entropy loss, which is formulated by:

(1)
(2)

where is the total number of samples. However, since contains noise, the model would overfit the noisy labels and result a poor classification performance.

To address this problem, we construct a teacher model parameterized by following the self-ensembling method to generate reliable soft pseudo labels for supervising the model’s training. Specifically, at current iteration , we update with

(3)

where is a smoothing coefficient hyperparameter within the range [0,1), and indicates the parameter of the teacher model in the previous iteration . Then the soft classification loss for optimizing with the soft pseudo labels generated from the teacher model can be formulated as:

(4)
(5)

Therefore, the classification loss of and are given by:

(6)
(7)

KL loss.

Based on the observation that the model can make consistent predictions on the original and transformed images of clean samples, and inconsistent predictions on most noisy samples, we apply the Kullback-Leibler (KL) Divergence as KL loss to measure the probability distributions of the network for two inputs. In practice, the KL loss can be expressed as

(8)

where

(9)
(10)

Here is the number of categories. and represent the predicted probability distributions, i.e., and , repectively.

Overall loss. We integrate the above-mentioned losses. The total loss can be formulated as:

(11)

where is the parameter weighting the classification loss and KL loss.

Small-loss selection. We use the total loss to select the small-loss instances. Specifically, the network feeds forward a mini-batch of data from first, then we select a small proportion of instances in which have small total loss:

(12)

where is a parameter that controls how many small-loss instances should be selected. For , we follow the same update strategy as [han2018co] [yu2019does] [wei2020combating]. Specifically, since the DNNs start to learn from clean samples in initial phases and gradually adapt to noisy ones during training, we apply a large to use more small-loss instances to train the model in the beginning. With the increase of epoch, as the DNNs will overfit the noisy data gradually, we should update the with a smaller value to keep less small-loss instances selected in each mini batch. The update strategy for is given by:

(13)

where is the current iteration, represents how many epochs for linear drop rate, and is the actual noise rate in the whole dataset which is closely related to the noise ratio in the noisy classes. If is not known in advanced, can be inferred using validation set [liu2015classification] [yu2018efficient].

1: and , training set , learning rate , fixed , epoch and , iteration , ensembling momentum ;
2:for =1,…, do
3:     Shuffle training set ;
4:     for =1,…, do
5:         Fetch mini-batch from ;
6:         Calculate the total loss by Eq. (11);
7:         Obtain small-loss sets by Eq. (12) from ;
8:         Obtain by Eq. (14) on ;
9:         Update ;
10:     end for
11:     Update by Eq. (3);
12:     Update by Eq. (13);
13:end for
14: and .
Algorithm 1

Finally, we calculate the average loss of the selected instances and back-propagate them to update the network’s parameters:

(14)

Algorithm 1 delineates the overall training process.

Iii Experiments

Iii-a Datasets and implementation details

Datasets. We extensively evaluate our approach on CIFAR-10, CIFAR-100 [krizhevsky2009learning] and Clothing1M [xiao2015learning] datasets. Both CIFAR-10 and CIFAR-100 contain 50K training images and 10K test images of size 32 32, which involve 10 classes and 100 classes, respectively. Clothing1M contains 1 million images, which consists of 50k training images, 14k validation images and 10k test images. Human annotators are asked to mark a set of 25k labels as a clean set, but we do not use these in our experiments. Note that the overall label accuracy of Clothing1M is 61.54%.

Implementation. For all experiments in our paper, we follow the same setting as JoCoR [wei2020combating]. Specifically, for CIFAR-10 and CIFAR-100, we use a 7-layer CNN network architecture. For Clothing1M, we use a 18-layer ResNet.

We use the Adam optimizer with a momentum of 0.9 for all experiments. For CIFAR-10 and CIFAR-100, the network is trained for 200 epochs. We set the initial learning rate as 0.001 and the batch size as 128. For Clothing1M, the network is trained for 15 epochs. We set the constant learning rate as and the batch size as 64. In all experiments, we take the horizontal flipping as the transformation operation, and the ensemble momentum in Eq. 3 is set to 0.999. The hyperparameters of total loss weight and linear drop rate for all experiments are given in Table I.

Dataset Flipping-Rate
CIFAR-10 Symmetry-20%  0.8  10
Symmetry-50% 0.5 20
Symmetry-80% 0.3 40
Asymmetry-40% 0.7 50
CIFAR-100 Symmetry-20% 0.8 40
Symmetry-50% 0.8 30
Symmetry-80% 0.2 30
Asymmetry-40% 0.9 30
Clothing1M 0.6 5
TABLE I: The hyperparameters of total loss weight and linear drop rate for all experiments.

Following previous works [han2018co] [yu2019does] [wei2020combating], we verify our method with two types of label noise: symmetric and asymmetric

. The symmetric label noise is generated by using a random one-hot vector to replace the ground-truth label of a sample. The asymmetric label noise is designed to mimic the structure of real-world label noise, such as CAT

DOG, BIRDAIRPLANE. In each experiment, we also follow the same setting to assume the noise rate is known, therefore the in Eq. (13) can be obtained. Specifically, for symmetric label noise, equals to (i.e., ). For asymmetric label noise, due to the fact that the actual noise rate in the whole dataset is half of the noisy rate in the noisy classes, we set .

Measurements. We use the test accuracy and label prediction to measure the performance, i.e., test accuracy = (# of correct predictions) / (# of test dataset) and label precision = (# of clean labels) / (# of all selected labels). Intuitively, the algorithm with higher label precision is more robust to the label noise. All experiments are repeated five times and the error bar for STD in each figure has been highlighted as a shade.

Baseline. We use the conventional training with the cross-entropy loss on noisy datasets (abbreviated as Standard) as our baseline and compare our method with following state-of-the-art approaches:

  1. Decoupling [malach2017decoupling], which trains two networks simultaneously, and then updates models only using the instances that have different predictions from these two networks.

  2. Co-teaching [han2018co], which trains two networks simultaneously, and each network uses its small-loss instances to update its peer network’s parameters.

  3. Co-teaching+ [yu2019does], which trains two networks simultaneously using the disagreement-update step (data update) and cross-update step (parameters update).

  4. JoCoR [wei2020combating], which trains two networks with a pseudo-siamese paradigm and update their parameters simultaneously by a joint loss.

(a) (a) Symmetry-20%
(b) (b) Symmetry-50%
(c) (c) Symmetry-80%
(d) (d) Asymmetry-40%
Fig. 3: Results on CIFAR-10 dataset. Top: test accuracy (%) vs. epochs; bottom: label precision (%) vs. epochs.
Flipping-Rate Standard Decoupling Co-teaching Co-teaching+ JoCoR Ours
Symmetry-20%
Symmetry-50%
Symmetry-80%
Asymmetry-40%
TABLE II: Average test accuracy (%) on CIFAR-10 over the last 10 epochs.

Iii-B Comparison with state-of-the-art methods

Iii-B1 Results on CIFAR-10 dataset

The test accuracy on CIFAR-10 dataset is shown in Table II. From the comparison results, we can clearly see that our method performs the best in all four cases. Specifically, in the easiest Symmetry-20% case, all methods work well, but our method still achieves an improvement more than 1.85% over the best baseline method JoCoR (89.85% vs. 88.00%). When the noise rate raises to 50%, the test accuracy of Decoupling decreases lower than 80%, but the other three methods remain above 80%, where Co-teaching performs better than JoCoR and Co-teaching+. Our method exceeds Co-teaching by 3.70% (86.85% vs. 83.15%). In the Symmetry-80% case, which means that the network trains with extremely noisy labels, methods based on “disagreement” strategy, i.e., Decoupling and Co-teaching+, cannot work well in this case. Co-teaching and JoCoR only reach to 63.01% and 60.59% respectively. Surprisingly, our method largely exceeds these two methods by 10.83% and 13.25% respectively. In the hardest Asymmetry-40%, Decoupling and Co-teaching+ also fail, where Decoupling works much worse than the Standard method. In contrast, Co-teaching, JoCoR and our method perform better. Among them, our method is still the best and exceeds the second best method Co-teaching by 6.64% (85.87% vs. 79.23%).

In the top of Figure 3, we show test accuracy vs. number of epochs, which can help us to clearly see the memorization effects of DNNs. For example, the Standard method first learns from clean samples in initial stages and then gradually adapts to noisy ones, therefore the curve of test accuracy first reaches a high level and then decreases gradually. While for Decoupling, the curve of test accuracy first rises from 0 to 100 epochs, then gradually decreases from 100 epochs in two cases (i.e., Symmetry-50% and Symmetry-80%). In other two cases (i.e., Symmetry-20% and Asymmetry-40%), the test accuracy gradually increases and finally stabilizes. The test accuracy curves of the other four methods follow the similar trends, i.e., they first increase and then level off, but the Co-teaching+ drops slightly in Asymmetry-40%.

In order to further explain such good performance, we plot the label precision vs. number of epochs on the bottom of Figure 3. Only Decoupling, Co-teaching, Co-teaching+, JoCoR and our method are considered here because these algorithms contain the operation of clean instances selection. We can see that the label precision curves of Decoupling and Co-teaching+ fail to select the clean instances in all four cases. The label precision curve of Co-teaching+ first increases at a high level and then gradually decreases, while for Decoupling, the label precision is always at a lower level. These curves show that the “disagreement” strategy cannot deal with noisy labels at all because this strategy does not utilize the memorization effects of DNNs during training. In contrast, the label precision curves of the other three methods increase during the training, and then remain a high precision, which means that they can pick up clean instances successfully. Among them, our method can achieve a higher label precision only after training a few epochs, and has always provided more accurate supervisions for the subsequent training process, resulting in a best classification performance.

Iii-B2 Results on CIFAR-100 dataset

Table III shows the test accuracy on CIFAR-100 dataset. Different from CIFAR-10 dataset, CIFAR-100 dataset is more difficult for noisy-labeled image classification because it contains 100 classes. Similarly, the test accuracy of our method on CIFAR-100 dataset achieves the excellent performance again. Specifically, in the easiest Symmetry-20% case, all methods work well, but our method exceeds the second best method JoCoR 4.85% (62.00% vs. 57.15%). When the noise rate raises to 50%, our method is still the best and outperforms the second method JoCoR 6.59% (55.54% vs. 48.95%). In the extreme Symmetry-80% case, Decoupling, Co-teaching+ and JoCoR fail to achieve good performance, i.e., they are all below 20%. Co-teaching also only achieves 22.08%, but our method achieves 34.86%, which is much higher than Co-teaching by 12.78%. In the hardest case (i.e., Asymmetry-40%), Decoupling, Co-teaching and JoCoR are below 35%. Co-teaching+ achieves the best performance (45.19%), but our method is comparable with Co-teaching+ (43.56%). Meanwhile, our method achieves the better performance on label precision than Co-teaching+ in this case.

(a) (a) Symmetry-20%
(b) (b) Symmetry-50%
(c) (c) Symmetry-80%
(d) (d) Asymmetry-40%
Fig. 4: Results on CIFAR-100 dataset. Top: test accuracy (%) vs. epochs; bottom: label precision (%) vs. epochs.
Flipping-Rate Standard Decoupling Co-teaching Co-teaching+ JoCoR Ours
Symmetry-20%
Symmetry-50%
Symmetry-80%
Asymmetry-40%
TABLE III: Average test accuracy (%) on CIFAR-100 over the last 10 epochs.

Figure 4 shows the test accuracy and label precision vs. epochs. We can observe the memorization effects of DNNs using different methods again. Similar with the results on CIFAR-10 dataset, the Standard method first increases a high level and then decreases gradually. Decoupling and Co-teaching+ always worse than the other three approaches in Symmetry-20%, 50% and 80%, and the label precision curves of them also explain the results that they have no ability to deal with the noisy labels. Instead, Co-teaching, JoCoR and our method mitigate the memorization issue. They can select clean instances out of the noisy ones successfully, resulting in a good performance. Only in Asymmetry-40%, our method is slightly lower than Co-teaching+, but our method achieves the better performance on label precision than Co-teaching+. Besides, by observing the convergence process of the model trained with five methods, we can see that our method can stabilize and achieve the better classification performance only after about 100 epochs.

Iii-B3 Results on Clothing1M dataset

We demonstrate the efficiency of our proposed method on real-world noisy labels using Clothing1M dataset. The results are shown in Table IV. We report the best test accuracy across all epochs and the last test accuracy at the end of training. Our method gains the best performance compared to the state-of-the-art methods in best and last test accuracy and achieves a significant improvement in last accuracy of 3.14% over Standard method, and an improvement of 0.24% over the second best method JoCoR.

 Methods   Best Last
 Standard   67.44   66.99
 Decoupling   68.48   67.32
 Co-teaching   69.21   68.51
 Co-teaching+   59.32   58.79
 JoCoR   70.67   69.89
 Ours   70.77   70.13
TABLE IV: Classification accuracy (%) on the Clothing1M test set.

Iii-C Ablation study

In this section, we evaluate two components, i.e., transform consistency and soft classification loss, in our proposed method by conducting ablation studies on CIFAR-10 and CIFAR-100 datasets. Results are shown in Table V. We create a baseline model that utilizes only off-line hard labels for training (the first row).

Dataset CIFAR-10 CIFAR-100
Noise type Symmetry Asymmetry Symmetry Asymmetry
Methods/Noise ratio 20% 50% 80% 40% 20% 50% 80% 40%
& w/o
87.38
84.21
72.64
79.26
55.52
48.32
25.20
34.79
87.92
84.87
73.45
80.25
55.58
48.89
25.99
35.62
w/o
89.23
86.21
72.85
84.37
61.62
54.09
33.15
42.42
Ours
89.85
86.85
73.84
85.87
62.00
55.54
34.86
43.56
TABLE V: Ablation study results in terms of test accuracy (%) on CIFAR-10 and CIFAR-100 datasets.
Dataset CIFAR10 CIFAR100
Noise type Symmetry Asymmetry Symmetry Asymmetry
Transform/Noise ratio 20% 50% 80% 40% 20% 50% 80% 40%
Scaling
83.06
81.78
69.63
71.96
49.52
43.53
22.72
34.18
Rotation-
86.51
83.02
68.70
78.77
54.97
49.40
29.13
35.93
Rotation-
87.23
82.36
67.24
78.51
57.46
51.77
32.88
40.21
Rotation-
86.38
83.09
68.37
77.86
54.83
48.70
29.56
35.87
Rotation-
87.31
83.33
71.63
80.29
56.99
50.53
31.41
41.87
Vertical flipping
86.13
80.59
67.03
78.25
55.95
49.09
30.09
38.07
Horizontal flipping
89.85
86.85
73.84
85.87
62.00
55.54
34.86
43.56
TABLE VI: Test accuracy (%) base00d0 on various transforms on CIFAR-10 and CIFAR-100 datasets.

Effectiveness of transform consistency. To observe the effectiveness of the KL loss, we set in Eq.11. The results are shown in the second row of Table V (denoted as ). The test accuracy of four cases drops by 0.39% to 5.62% on CIFAR-10 dataset and 6.42% to 8.87% on CIFAR-100 dataset. These experiments fully demonstrate that the clean and noisy samples can be distinguished more easily by adding KL loss and further verify the effectiveness of transform consistency.

Effectiveness of soft classification loss. In order to investigate the effectiveness of the soft classification loss, we perform the experiment by only utilizing the off-line hard labels as the supervisory information (i.e., in Eq. 6 and in Eq. 7). The performance is presented in the third row of Table V (denoted as w/o ). The test accuracy of four cases drops by 0.62% to 1.50% on CIFAR-10 dataset and 0.38% to 1.71% on CIFAR-100 dataset. These results suggest that the on-line soft labels are more reliable due to the better decoupling between the past average models of the two networks, which can effectively mitigate the negative influence of hard noisy labels and avoid bias amplification even when the networks have much erroneous outputs in the early training epochs.

Iii-D Effectiveness of different transforms

We conduct experiments to investigate that the prediction consistency under different image transforms on CIFAR-10 and CIFAR-100 datasets. Specifically, we focus on a subset of frequently used transforms: scaling, rotation, and flipping. Scaling means the input images are resized to 48 48, and then cropped to 32 32. Rotation contains , , and , and flipping contains vertical flipping and horizontal flipping. The results are shown in Table VI. Among them, the test accuracy of horizontal flipping gains the best performance on two datasets, and the test accuracy of scaling is the worst on two datasets except two cases on CIFAR-10, i.e., Symmetry-50% and Symmetry-80%. While for rotation and vertical flipping, the results in four cases on two datasets present comparable performance.

Iv Conclusion

In this paper, we propose a simple and effective approach that utilize the transform consistency to identify mislabeled samples. Specifically, we train one single network with a joint loss between two inputs (the original image and its transformed image), which includes two classification losses and one KL loss. Furthermore, we design a classification loss by using the off-line hard labels and on-line soft labels to provide more reliable supervisions for training a robust model. We conduct comprehensive experiments on CIFAR-10, CIFAR-100 and Clothing1M datasets and achieve the state-of-the-art performance.

References