Jo-SRC: A Contrastive Approach for Combating Noisy Labels

03/24/2021
by   Yazhou Yao, et al.
0

Due to the memorization effect in Deep Neural Networks (DNNs), training with noisy labels usually results in inferior model performance. Existing state-of-the-art methods primarily adopt a sample selection strategy, which selects small-loss samples for subsequent training. However, prior literature tends to perform sample selection within each mini-batch, neglecting the imbalance of noise ratios in different mini-batches. Moreover, valuable knowledge within high-loss samples is wasted. To this end, we propose a noise-robust approach named Jo-SRC (Joint Sample Selection and Model Regularization based on Consistency). Specifically, we train the network in a contrastive learning manner. Predictions from two different views of each sample are used to estimate its "likelihood" of being clean or out-of-distribution. Furthermore, we propose a joint loss to advance the model generalization performance by introducing consistency regularization. Extensive experiments have validated the superiority of our approach over existing state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

03/05/2020

Combating noisy labels by agreement: A joint training method with co-regularization

Deep Learning with noisy labels is a practically challenging problem in ...
03/28/2022

UNICON: Combating Label Noise Through Uniform Selection and Contrastive Learning

Supervised deep learning methods require a large repository of annotated...
06/29/2021

Adaptive Sample Selection for Robust Learning under Label Noise

Deep Neural Networks (DNNs) have been shown to be susceptible to memoriz...
07/06/2021

An Ensemble Noise-Robust K-fold Cross-Validation Selection Method for Noisy Labels

We consider the problem of training robust and accurate deep neural netw...
08/06/2020

Salvage Reusable Samples from Noisy Data for Robust Learning

Due to the existence of label noise in web images and the high memorizat...
11/19/2019

Carpe Diem, Seize the Samples Uncertain "At the Moment" for Adaptive Batch Selection

The performance of deep neural networks is significantly affected by how...
03/25/2021

Transform consistency for learning with noisy labels

It is crucial to distinguish mislabeled samples for dealing with noisy l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Existing small-loss based sample selection methods (upper) tend to regard a human-defined proportion of samples within each mini-batch as clean ones. They ignore the fluctuation of noise ratios in different mini-batches. On the contrary, our proposed method (bottom) selects clean samples in a global manner. Moreover, in-distribution (ID) noisy samples and out-of-distribution (OOD) ones are also selected and leveraged for enhancing the model generalization performance.

DNNs have recently lead to tremendous progress in various computer vision tasks

[krizhevsky2012imagenet, ren2015faster, yao2021weakly, redmon2017yolo9000, xie2020region, luo2019segeqa]

. These successes largely attribute to large-scale datasets with reliable annotations (, ImageNet

[deng2009]). However, collecting well-annotated datasets is extremely labor-intensive and time-consuming, especially in domains where expert knowledge is required (, fine-grained categorization [cub200-2011, inat17]). The high cost of acquiring large-scale well-labeled data poses a bottleneck in employing DNNs in real-world scenarios.

As an alternative, employing web images to train DNNs has received increasing attention recently [liu2021exploiting, yang2018recognition, yao2020bridging, tanaka2018joint, yao2018extracting, yao2020exploiting, zhang2020web, zhang2020data, sun2020crssc]. Unfortunately, whereas web images are cheaper and easier to obtain via image search engines [fergus2010learning, schroff2010harvesting, yao2017exploiting, yao2016domain], they usually yield inevitable noisy labels due to the error-prone automatic tagging system or non-expert annotations [niu2018webly, sun2020crssc, yao2018extracting, yao2019towards]. A recent study has suggested that samples with noisy labels would be unavoidably overfitted by DNNs and consequently cause performance degradation [motivation2017, zhang2016understanding].

To alleviate this issue, many methods have been proposed for learning with noisy labels. Early approaches primarily attempt to correct losses during training. Some methods correct losses by introducing a noise transition matrix [sukhbaatar2015iclr, patrini2017making, goldberger2017, hendrycks2018using]

. However, estimating the noise transition matrix is challenging, requiring either prior knowledge or a subset of well-labeled data. Some methods design noise-robust loss functions which correct losses according to predictions of DNNs

[reed2015training, zhang2018generalized, tanaka2018joint]. However, these methods are prone to fail when the noise ratio is high.

Another active research direction in mitigating the negative effect of noisy labels is training DNNs with selected or reweighted training samples [mentornet, ren2018learning, decoupling, coteaching, coteachingplus, wei2020combating, sun2020crssc]. The challenge is to design a proper criterion for identifying clean samples. It has been recently observed that DNNs have a memorization effect and tend to learn clean and simple patterns before overfitting noisy labels [motivation2017, zhang2016understanding]. Thus, state-of-the-art methods (, Co-teaching [coteachingplus], Co-teaching+ [coteachingplus], and JoCoR [wei2020combating]) propose to select a human-defined proportion of small-loss samples as clean ones. Although promising performance gains have been witnessed by employing the small-loss sample selection strategy, these methods tend to assume that noise ratios are identical among all mini-batches. Hence, they perform sample selection within each mini-batch based on an estimated noise rate. However, this assumption may not hold true in real-world cases, and the noise rate is also challenging to estimate accurately (, Clothing1M [xiao2015learning]). Furthermore, existing literature mainly focuses on closed-set scenarios, in which only in-distribution (ID) noisy samples are considered. In open-set cases (, real-world cases), both in-distribution (ID) and out-of-distribution (OOD) noisy samples exist. High-loss samples do not necessarily have noisy labels. In fact, hard samples, ID noisy ones, and OOD noisy ones all produce large loss values, but the former two are potentially beneficial for making DNNs more robust [sun2020crssc].

Motivated by the self-supervised contrastive learning [chen2020simple, grill2020bootstrap], we propose a simple yet effective approach named Jo-SRC (Joint Sample Selection and Model Regularization based on C

onsistency) to address aforementioned issues. Specifically, we first feed two different views of an image into a backbone network and predict two corresponding softmax probabilities accordingly. Then we divide samples based on two likelihood metrics. We measure the likelihood of a sample being clean using the Jensen-Shannon divergence between its predicted probability distribution and its label distribution. We measure the likelihood of a sample being OOD based on the prediction disagreement between its two views. Subsequently, clean samples are trained conventionally to fit their given labels. ID and OOD noisy samples are re-labeled by a mean-teacher model before they are back-propagated for updating network parameters. Finally, we propose a joint loss, including a classification term and a consistency regularization term, to further advance model performance. A comparison between Jo-SRC and existing sample selection methods is provided in Figure 

1. The major contributions of this work are:

(1) We propose a simple yet effective contrastive approach named Jo-SRC to alleviate the negative effect of noisy labels. Jo-SRC trains the network with a joint loss, including a cross-entropy term and a consistency term, to obtain higher classification and generalization performance.

(2) Our proposed Jo-SRC selects clean samples globally by adopting the Jensen-Shannon divergence to measure the likelihood of each sample being clean. We also propose to distinguish ID noisy samples and OOD noisy ones based on the prediction consistency between samples’ different views. ID and OOD noisy samples are relabeled by a mean-teacher network before being used for network update.

(3) By providing comprehensive experimental results, we show that Jo-SRC significantly outperforms state-of-the-art methods on both synthetic and real-world noisy datasets. Furthermore, extensive ablation studies are conducted to validate the effectiveness of our approach.

Figure 2: The overall framework of our proposed Jo-SRC approach (a), the clean sample selection module (b), and the ID/OOD sample selection module (c). Each image is augmented into two different views and before being fed into the backbone network. The network then predicts two probability distributions and accordingly. Afterwards, we obtain the likelihood of being clean using the Jensen-Shannon (JS) divergence between its predicted distribution and its label distribution . If is judged as “unclean”, we obtain its likelihood of being out-of-distribution (OOD) based on the prediction disagreement between and . Finally, is re-labeled as by a mean-teacher model. The final objective function is a joint loss, including a classification term and a consistency term.

2 Related Works

Existing works on learning with noisy labels can be briefly categorized into the following two subsets [sun2020crssc]: 1) Loss Correction and 2) Sample Selection.

Loss correction. A large proportion of existing literature on training with noisy labels focuses on loss correction approaches. Some methods endeavor to estimate the noise transition matrix [sukhbaatar2015iclr, chang2017active, patrini2017making, goldberger2017, hendrycks2018using]. For example, Patrini [patrini2017making] provided a loss correction method to estimate the noise transition matrix by using a deep network trained on the noisy dataset. However, these methods are limited in that the noise transition matrix is challenging to estimate accurately and may not be feasible in real-world scenarios. Some methods attempt to design noise-tolerant loss functions [reed2015training, zhang2018generalized, tanaka2018joint]. For example, the bootstrapping loss [reed2015training] extended the conventional cross-entropy loss with a perceptual term. However, these methods fail to perform well in real-world cases when the noise ratio is high.

Sample Selection. Another idea of dealing with noisy labels is to select and remove corrupted data. The problem is to find proper sample selection criteria. It has been shown that DNNs tend to learn simple patterns first before memorizing noisy data [motivation2017, zhang2016understanding]. Resorting to this observation, the small-loss sample selection criterion has been widely adopted: samples with lower loss values are more likely to have clean labels. For example, Co-teaching [coteaching] proposed to maintain two networks simultaneously during training, with one network learning from the other network’s selected small-loss samples. JoCoR [wei2020combating] proposed to use a joint loss, including the conventional cross-entropy loss and the co-regularization loss, to select small-loss samples. However, above methods select samples within each mini-batch based on a human-defined drop rate. In real-world scenarios, noise ratios in different mini-batches are not guaranteed to be identical, and the drop rate is challenging to estimate.

3 The Proposed Method

Background. Generally, for a multi-class classification task with classes, we train DNNs using a labeled dataset , in which is the -th training sample and is its corresponding one-hot label over classes. The conventional objective loss function is the cross-entropy between the predicted softmax probability distributions of training samples and their corresponding label distributions:

(1)

in which is a simplified form of , denoting the predicted probability of sample for class given a model with parameters . However, for datasets with noisy labels (, web image datasets), labels are not guaranteed to be correct. Thus, training DNNs using noisy datasets directly is problematic and usually leads to a dramatic performance drop, given the fact that DNNs have the capability to memorize all training samples, including noisy ones [motivation2017].

Terminology. This paper adopts two consistency metrics to reveal how likely each sample could be clean or OOD. We accordingly term them as “likelihood”, which is different from the concept of “likelihood” in statistics.

3.1 Global clean sample selection

Regarding samples with small cross-entropy losses as clean ones is one of the most widely-used sample selection criteria. This criterion is justified by the observation, in which DNNs tend to learn clean patterns first and then gradually fit noisy labels [motivation2017, zhang2016understanding]. Methods using this criterion (, Co-teaching [coteaching] and Co-teaching+ [coteachingplus]) typically select a pre-defined proportion of small-loss samples within each mini-batch. Unfortunately, noise ratios in different mini-batches inevitably fluctuate in real-world scenarios. One solution is to record losses for all samples and select samples in the entire training set. However, this becomes impractical when the dataset volume is increasingly huge.

To this end, we propose to reformulate the clean sample selection criterion from another perspective. Specifically, we propose to adopt the Jensen-Shannon (JS) divergence in Eq. (2) to quantify the difference between the predicted probability distribution and the given ground truth label distribution of the sample as follows:

(2)

in which is the Kullback-Leibler (KL) divergence function. The JS divergence is a measure of differences between two probability distributions. It is known to be bounded in , given a base 2 logarithm is used [lin1991divergence]. Therefore, intuitively, we can leverage to measure the “likelihood” of being clean as follows:

(3)

In fact, reveals the consistency between and . Here, we adopt smoothed label distributions [szegedy2016] in calculating Eq. (2) to avoid the issue of . We finally define our clean sample selection criterion as follows:

Criterion 3.1.

The sample is a clean one if its likelihood of being clean .

Why can we select clean samples globally based on ? Similar to the cross-entropy, the JS divergence is a measurement depicting differences between two probability distributions. Since the in Eq. (2) is not updated in the back-propagation process, the JS divergence between and is equivalent to the cross-entropy between them. Accordingly, our proposed Criterion 3.1 is consistent with the small-loss sample selection criterion. However, whereas the value of cross-entropy is not constrained, the JS divergence is bounded in , making it a natural global selection metric to describe how likely a sample could be clean. By directly modeling the likelihood of a sample being clean using Eq. (3), clean samples are selected more efficiently in a global manner, alleviating the issue caused by the imbalance of noise ratios within different mini-batches.

3.2 Out-of-distribution detection

Real-world scenarios contain both in-distribution (ID) noisy samples and out-of-distribution (OOD) ones. Despite their noisy labels, they can contribute to the model if their labels are re-assigned properly, especially for ID samples. Therefore, dropping all “unclean” samples directly is not data-efficient.

DNNs are usually uncertain about OOD samples when making predictions since their correct labels are outside the task scope. Conversely, while ID noisy samples have corrupted labels, they usually lead to consistent model predictions. Therefore, inspired by the self-supervised contrastive learning [chen2020simple] and agreement maximization principle [sindhwani2005co], we propose to use the prediction consistency to distinguish OOD and ID samples. Specifically, we first generate two augmented views and from a sample by applying two different image transformations and . These two views are subsequently fed into a DNN to produce their corresponding predictions and , respectively. Finally, we adopt the consistency between these two predictions to determine if this sample is out-of-distribution or not. More explicitly, we define the “likelihood” of a sample being out-of-distribution (OOD) as:

(4)

Consequently, given , our OOD/ID sample selection criterion is defined as follows:

Criterion 3.2.

Given a sample that is selected as a “unclean” one by Criterion 3.1, it is judged as an OOD noisy one if (, its predictions of two differently augmented views disagree). If (, its predictions of two differently augmented views is consistent), it is deemed as an ID noisy sample.

3.3 Label re-assignment

The proposed Criterion 3.1 and 3.2 jointly divide training data into three subsets: a clean subset , an ID subset , and an OOD subset . To leverage all training data efficiently, we treat their labels differently before feeding them into the network.

For samples in , we keep their labels unaltered. To enhance the generalization performance, we adopt the label smoothing regularization (LSR) [szegedy2016] when calculating their losses. Therefore, the label distribution of a clean sample is provided as Eq. (5), given its label :

(5)

in which is a hyper-parameter controlling the smoothness of the label distribution.

For samples in ID subset , inspired by the mean-teacher model [tarvainen2017mean], we use the temporally averaged model (mean-teacher model) to generate reliable pseudo label distributions for providing supervision. Therefore, given an ID sample , its pseudo label distribution is provided as:

(6)

where denotes parameters of the mean-teacher model.

Finally, for samples in

, we also use the mean-teacher model to create their corresponding pseudo label distributions. However, since OOD samples’ true labels are outside the task scope, the DNN should be highly confused when predicting their label assignments. Therefore, we propose to enforce predictions of OOD samples to fit an approximately uniform distribution for boosting generalization performance. In practice, given an OOD sample

, we relabel it with the following pseudo label distribution:

(7)

in which is a large scaling constant. In our experiments, we empirically, set to make this label distribution smooth enough (, ).

It should be noted that the mean-teacher model is not updated via the loss back-propagation. Instead, its parameters is an exponential moving average of . Specifically, given a decay rate , is updated in each training step as follows:

(8)
Input: Network , mean-teacher , learning rate , iteration

, epoch

and .
for  do
        for  do
               Sample a mini-batch randomly.
               Predict and .
               Divide samples into , , and based on Criterion 3.1 and 3.2.
               Re-label samples by Eq. (5), (6), and (7).
               if  then
                      Obtain using entire by Eq. (10).
                      Update .
                     
              else
                      Obtain using only by Eq. (11).
                      Update .
                     
               end if
              Update by Eq. (8).
              
        end for
       
end for
Output: Updated network .
Algorithm 1 Jo-SRC

3.4 Consistency regularization

As stated above, we use each sample’s prediction consistency to measure its likelihood of being OOD. We follow the intuition that in-distribution samples (including clean ones and noisy ones) tend to produce consistent predictions while out-of-distribution samples do not. Thus, we propose to use an auxiliary consistency loss as Eq. (9) to provide joint supervision for enhancing the separability between ID and OOD samples.

(9)

in which if ; otherwise, .

On the one hand, resorting to this additional regularization term, clean samples and ID ones are encouraged to make consistent predictions. Meanwhile, this consistency term also enhances the prediction divergence of OOD noisy samples. Our approach is accordingly able to select clean/ID/OOD samples more effectively. On the other hand, this auxiliary consistency loss also implicitly promotes representation learning in a self-supervised fashion.

3.5 The overall framework

Combining all submodules together, our final objective loss function is

(10)

in which is a hyper-parameter, and

(11)

Details of Jo-SRC are shown in Figure 2 and Algorithm 1.

In practice, the model gets increasingly stronger during training and will eventually overfit noisy labels. Thus, we proposed to dynamically adjust the selection threshold as Eq. (12):

(12)

in which . is a hyper-parameter and is a large constant ( is empirically set to 0.95 in our experiments). Accordingly, more samples will be treated as clean ones in initial epochs so that the model can learn simple and easy patterns from as much samples as possible. As the training proceeds, fewer samples are fed into the model as clean ones for ensuring the quality of learned data.

Standard Decoupling Co-teaching Co-teaching+ JoCoR Jo-SRC
35.14 0.44 33.10 0.12 43.73 0.16 49.27 0.03 53.01 0.04 58.15 0.14
16.97 0.40 15.25 0.20 34.96 0.50 40.04 0.70 43.49 0.46 51.26 0.11
4.41 0.14 3.89 0.16 15.15 0.46 13.44 0.37 15.49 0.98 23.80 0.05
27.29 0.25 26.11 0.39 28.35 0.25 33.62 0.39 32.70 0.35 38.52 0.20
Table 1: Average test accuracy () on CIFAR100N-C over the last 10 epochs.
Standard Decoupling Co-teaching Co-teaching+ JoCoR Jo-SRC
29.37 0.09 43.49 0.39 60.38 0.22 53.97 0.26 59.99 0.13 65.83 0.13
13.87 0.08 28.22 0.19 52.42 0.51 46.75 0.14 50.61 0.12 58.51 0.08
4.20 0.07 10.01 0.29 16.59 0.27 12.29 0.09 12.85 0.05 29.76 0.09
22.25 0.08 33.74 0.26 42.42 0.30 43.01 0.59 39.37 0.16 53.03 0.25
Table 2: Average test accuracy () on CIFAR80N-O over the last 10 epochs.
Figure 3: Comparison on CIFAR80N-O: test accuracy () vs. epochs.

4 Experiments

4.1 Experiment setup

Datasets. We evaluate Jo-SRC in four benchmark datasets: CIFAR100N-C, CIFAR80N-O, Clothing1M [xiao2015learning], and Food101N [lee2018cleannet]. CIFAR100N-C and CIFAR80N-O are two synthetic datasets created from CIFAR100 [krizhevsky2009]. Specifically, we follow JoCoR [wei2020combating] to create the closed-set synthetic dataset CIFAR100N-C with a noise ratio . The noise type could be either “Symmetry” or “Asymmetry”. To create the open-set synthetic dataset CIFAR80N-O, we first regard the last 20 categories in CIFAR100 as out-of-distribution ones. Then we create in-distribution noisy samples by randomly corrupting percentage of remaining samples’ labels in a fashion. This finally leads to an overall noise ratio . Clothing1M and Food101N are two large-scale real-world datasets with noisy labels. Details are in supplementary materials.

Evaluation Metrics

. For evaluating the model classification performance, we take the test accuracy as the evaluation metric. Besides, we also adopt the label precision as the metric to evaluate our sample selection criteria.

Implementation Details. Following JoCoR [wei2020combating], we adopt a 7-layer DNN for CIFAR100N-C and CIFAR80N-O. During training, we use Adam optimizer with a momentum of 0.9. The initial learning rate is 0.001, and the batch size is 128. We train the network for 200 epochs and start to decay the learning rate linearly after 80 epochs. The decay rate in updating the mean-teacher network is set to . The and the LSR parameter is empirically set to 0.95 and 0.6, respectively. For Clothing1M, we follow settings in JoCoR [wei2020combating] and use ResNet-18 [he2016deep]

with ImageNet pre-trained weights to take a fair comparison with results presented in JoCoR. We also conduct experiments using ResNet-50

[he2016deep] and follow experimental settings used in DivideMix [li2020dividemix] for fair comparison. For Food101N, we use ResNet-50 [he2016deep] pre-trained on ImageNet and follow experimental settings used in DeepSelf [han2019deep]

. All experiments are repeated five times and averaged results are reported accordingly. Our code implementation is based on PyTorch.

Baselines. To evaluate Jo-SRC on CIFAR100N-C and CIFAR80N-O, we follow JoCoR [wei2020combating] and compare Jo-SRC with the following state-of-the-art sample selection methods: Decoupling [decoupling], Co-teaching [coteaching], Co-teaching+ [coteachingplus], and JoCoR [wei2020combating]. To evaluate our approach on Clothing1M, besides the above methods, other state-of-the-art methods like F-correction [patrini2017making], M-correction [arazo2019unsupervised], Joint-Optim [tanaka2018joint], Meta-Cleaner [zhang2019metacleaner], Meta-Learning [li2019learning], P-correction [yi2019probabilistic], and DivideMix [li2020dividemix] are also compared. To perform evaluation on Food101N, CleanNet [lee2018cleannet] and DeepSelf [han2019deep] are compared with our approach. Finally, training directly on noisy datasets is also adopted into comparison as a simple baseline (named as Standard).

4.2 Comparison on synthetic noisy datasets

Results on CIFAR100N-C. Whereas our proposed Jo-SRC method is designed for open-set scenarios, it is also applicable and useful in closed-set cases. Comparison in test accuracy with state-of-the-art approaches on CIFAR100N-C is shown in Table 1. For simplicity, the results of existing methods are drawn directly from JoCoR [wei2020combating], and our method is evaluated using the same experimental settings. From Table 1, we can observe that our proposed Jo-SRC method consistently outperforms state-of-the-art methods. Although performance of all methods drops dramatically in the most inferior case (, Symmetry-), our methods still obtain the highest test accuracy.

Results on CIFAR80N-O. CIFAR80N-O is created to simulate the real-world scenario (, open-set problem). We present the comparison in test accuracy with state-of-the-art methods on CIFAR80N-O in Table 2. We implement all these methods with default parameters. Results in Table 2 come from experiments under the same experiment settings. From this table, we can observe that our Jo-SRC method performs consistently better than other methods. In the simplest case (, Symmetry-), while all methods work effectively and robustly (except Standard), our method achieves the best test accuracy. When the noise scenario becomes harder (, Symmetry-, and Asymmetry-), model performance inevitably starts to drop, especially Decoupling. However, our method is still effective and outperforms other methods. Finally, when it goes to the most challenging case (, Symmetry-), all approaches fail to combat the massive noisy labels. However, Jo-SRC once again achieves significantly higher performance than other methods, demonstrating the superiority of our method in coping with extremely noisy scenarios. Figure 3 shows the test accuracy epochs. From this figure, we can observe that Jo-SRC consistently outperforms other methods by a large margin. Moreover, the superiority in the robustness of our method is demonstrated clearly in these curves.

Method Backbone Test accuracy
Stardard ResNet-18 67.22
Decoupling [decoupling] ResNet-18 68.48
Co-teaching [coteaching] ResNet-18 69.21
Co-teaching+ [coteachingplus] ResNet-18 59.32
JoCoR [wei2020combating] ResNet-18 70.30
Stardard ResNet-50 69.21
F-correction [patrini2017making] ResNet-50 69.84
M-correction [arazo2019unsupervised] ResNet-50 71.00
Joint-Optim [tanaka2018joint] ResNet-50 72.16
Meta-Cleaner [zhang2019metacleaner] ResNet-50 72.50
Meta-Learning [li2019learning] ResNet-50 73.47
P-correction [yi2019probabilistic] ResNet-50 73.49
DivideMix [li2020dividemix] ResNet-50 74.76
Jo-SRC ResNet-18 71.78
Jo-SRC ResNet-50 75.93
Table 3: Comparison with state-of-the-art methods in test accuracy () on Clothing1M.
Method Backbone Test accuracy
Stardard ResNet-50 84.51
CleanNet [lee2018cleannet] ResNet-50 83.47
CleanNet [lee2018cleannet] ResNet-50 83.95
DeepSelf [han2019deep] ResNet-50 85.11
Jo-SRC ResNet-50 86.66
Table 4: Comparison with state-of-the-art methods in test accuracy () on Food101N using ResNet-50.

4.3 Comparison on real-world noisy datasets

Results on Clothing1M. To verify the effectiveness of our Jo-SRC, we provide experimental results on real-world scenarios. Clothing1M is a large-scale real-world dataset. It contains one million training images and yield a accuracy of noisy labels [xiao2015learning]. Table 3 shows comparison with state-of-the-art methods using ResNet-18 and ResNet-50 as the backbone network. From this table, we can observe that our proposed Jo-SRC approach achieves the best scores on both backbones. Using ResNet-18 as the backbone, our method achieves an improvement of over the existing state-of-the-art. When ResNet-50 is adopted, Jo-SRC boosts the test accuracy from to .

Results on Food101N. Food101N is another real-world noisy dataset. It contains 310k training images in 101 food categories and also has a large proportion of noisy labels. Table 4 presents the performance comparison with state-of-the-arts. As shown in Table 4, Jo-SRC achieves the best score and outperforms the state-of-the-art DeepSelf [han2019deep] by , validating the effectiveness of our approach in dealing with real-world noisy cases.

Figure 4: Comparison on CIFAR80N-O: precision of clean sample selection () vs. epochs.
Figure 5: The prediction accuracy () on different groups of CIFAR80N-O (Symmetry-) training data during the training process.
Noise ID Sample OOD Sample
best last best last
60.91 42.62 59.54 54.38
83.86 65.92 40.70 38.75
96.31 72.84 26.67 24.60
45.86 45.52 63.97 45.37
Table 5: The precision of ID/OOD sample selection on CIFAR80N-O at the best and last epoch.
Model Test accuracy
Standard 29.37 0.09
Jo-SRC-C 57.12 0.33
Jo-SRC-CI 61.32 0.18
Jo-SRC-CIO 63.10 0.07
Jo-SRC 65.83 0.13
Table 6: Effect of different steps in test accuracy () on CIFAR80N-O (Symmetry-) over the last 10 epochs.

4.4 Ablation Study

Precision of sample selection. The key reason for our approach in obtaining state-of-the-art performance is accurate and reliable sample selection. To study and verify the superiority of our proposed sample selection strategy, we show the precision of sample selection in Figure 4 and Table 5. Figure 4 presents the precision of clean sample selection epochs. From this figure, Jo-SRC is shown to be highly effective in selecting clean samples accurately and reliably. In all cases, our proposed Jo-SRC achieves the best performance in selecting clean data, compared with state-of-the-art sample selection methods. Furthermore, in the most demanding scenario (, Symmetry-), while all other methods suffer in finding clean samples, the selection precision of our Jo-SRC increases steadily as the training proceeds. These results validate the effectiveness of our clean sample selection strategy. Table 5 presents the precision in selecting ID/OOD samples. In this table, the best and last denote the selection precision at the best and last epochs, respectively. Results shown in this table verify the effectiveness of our Jo-SRC in selecting ID/OOD samples.

Prediction accuracy of different training samples. The memorization effect argues that DNNs would eventually memorize all samples (including noisy ones). Therefore, it is critical to prevent networks from overfitting noisy labels when training with noisy datasets. To further prove the effectiveness of our proposed Jo-SRC, we show the prediction accuracy of different training samples in Figure 5. As shown in this figure, all methods achieve increasing prediction accuracy on clean samples. JoCoR and our Jo-SRC achieve the lowest prediction accuracy on noisy samples (including both ID ones and OOD ones) given labels (, noisy labels). This indicates that JoCoR and Jo-SRC perform best in preventing networks from memorizing noisy labels. Although JoCoR obtains lower prediction accuracy on ID and OOD training samples, it yields an under-fitting issue in clean samples, leading to sub-optimal final test accuracy. While Co-teaching fits clean samples slightly better than our Jo-SRC, it suffers from overfitting on noisy labels. This causes its final performance decrease in test samples. Moreover, by observing the last sub-figure, we can find that our Jo-SRC achieves the best prediction accuracy on ID noisy samples their true labels. This further demonstrates the effectiveness of our sample selection and model regularization, given the fact that ID noisy samples are not supervised by their true labels during training.

Influence of different steps. Table 6 reveals the effect of different steps in our method. The Jo-SRC-C denotes the case in which only selected clean samples are adopted in training. The Jo-SRC-CI denotes the case where clean samples and ID noisy samples are adopted in training. The Jo-SRC-CIO denotes the case when all samples are adopted in training. The mean-teacher-based re-labeling is performed accordingly when noisy samples are leveraged in training. Lastly, the Jo-SRC denotes the final proposed method. From this table, we can observe that the proposed clean sample selection plays the most crucial role in addressing the label noise issue. Moreover, appropriately treated noisy samples (including ID and OOD ones) can contribute to the model generalization performance. Finally, the consistency loss promotes model performance by further regularization.

5 Conclusion

In this paper, we proposed a simple yet effective approach named Jo-SRC to address the performance degradation caused by noisy labels. Jo-SRC trained DNNs in a contrastive manner. Clean samples were identified globally based on JS divergence, while ID and OOD noisy samples were distinguished based on consistency. Samples were selected and divided accordingly for subsequent network learning. Finally, a joint loss, including a classification term and a consistency regularization term, was proposed to further advance the performance and robustness. Comprehensive experiments on both synthetic and real-world noisy datasets validated the superiority of the proposed method.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61976116) and Fundamental Research Funds for the Central Universities (No. 30920021135).

References