Identifying Training Stop Point with Noisy Labeled Data

12/24/2020 ∙ by Sree Ram Kamabattula, et al. ∙ The University of Texas at Arlington Baylor Scott & White Health 0

Training deep neural networks (DNNs) with noisy labels is a challenging problem due to over-parameterization. DNNs tend to essentially fit on clean samples at a higher rate in the initial stages, and later fit on the noisy samples at a relatively lower rate. Thus, with a noisy dataset, the test accuracy increases initially and drops in the later stages. To find an early stopping point at the maximum obtainable test accuracy (MOTA), recent studies assume either that i) a clean validation set is available or ii) the noise ratio is known, or, both. However, often a clean validation set is unavailable, and the noise estimation can be inaccurate. To overcome these issues, we provide a novel training solution, free of these conditions. We analyze the rate of change of the training accuracy for different noise ratios under different conditions to identify a training stop region. We further develop a heuristic algorithm based on a small-learning assumption to find a training stop point (TSP) at or close to MOTA. To the best of our knowledge, our method is the first to rely solely on the training behavior, while utilizing the entire training set, to automatically find a TSP. We validated the robustness of our algorithm (AutoTSP) through several experiments on CIFAR-10, CIFAR-100, and a real-world noisy dataset for different noise ratios, noise types and architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

DNNs have achieved remarkable performance in several computer vision domains due to the availability of large datasets and advanced deep learning techniques. However, in areas such as medical imaging, vast amount of data is still unlabeled. Labeling by experts is time-consuming and expensive, and is often performed by crowd sourcing

[28], online queries [3], etc. Such datasets introduce a large number of noisy labels, caused by errors during the labeling process. This work focuses on label noise. Therefore, noise and label noise are used interchangeably throughout the paper.

Training DNNs robustly with noisy datasets has become challenging and prominent work [1, 27]. While DNNs are robust to a certain amount of noise [20], they have the high capacity to fit on the noisy samples [30, 2]. However, the performance of the DNNs declines as they begin to learn on a significant number of noisy samples.

One notable approach to deal with this problem is to select and train on only clean samples from the noisy training data. In [4], authors check for inconsistency in predictions. O2UNet [9] uses cyclic learning rate and gathers loss statistics to select clean samples. SELF [16]

takes ensemble of predictions at different epochs.

Few methods utilize two networks to select clean samples. MentorNet [10] pre-trains an extra network to supervise a StudentNet. Decoupling [15] trains two networks simultaneously, and the network parameters are updated only on the examples with different predictions. Co-teaching [7] updates one network’s parameters based on the peer network’s small-loss samples. Co-teaching is further improved in co-teaching+ [29].

Small-loss observation suggests that clean samples have smaller loss compared to noisy samples in the early stages of training. However, in the later stages, it is harder to distinguish clean and noisy samples based on the loss distribution. Therefore, it is desirable to stop the training in the initial stages. In [13], authors theoretically show that the DNNs can be robust to noise despite over-parameterization with early-stopping, when trained with a first order derivative-based optimization method.

However, identifying the point, where the network starts to fit on noisy samples, is a challenging problem. Most previous methods assume that either i) the noise ratio is known or ii) the noise type is known. For instance, when the noise ratio is unknown, in [5], authors propose a cross-validation approach to estimate the noise based on the validation accuracy, which becomes inaccurate for higher number of classes and harder datasets. In [22], the authors assume that a clean validation set is available to find the early stopping point. However, often, a clean validation set is not available. In any case, [23] suggests that the overall performance of the network is sensitive to the validation set quality. Therefore, a clean or noisy validation set is not reliable to identify the best stop point.

On the other hand, we attempt to stop the training before the network learns on a significant amount of noisy samples without any of the above assumptions. We propose a novel heuristic approach called AutoTSP to identify a TSP by monitoring the training behavior of the DNN. Briefly, our main contributions are:

  • We define different memorization stages using clean and noisy label recall to understand the learning behavior on noisy samples.

  • We analyze the positive and negative rates of change of training accuracy to define a training stop region.

  • We further develop a heuristic algorithm to identify the TSP that does not require a clean validation set, and is independent of noise estimation and noise type.

  • We show that AutoTSP can be incorporated in other clean sample selection methods [5], [7] to further improve the performance.

We validate AutoTSP on two clean benchmark datasets: CIFAR-10 and CIFAR-100. Since CIFAR-10 and CIFAR-100 are clean, we synthetically add noise by changing the true label to a different label with a noise transition matrix . We employ two types of noise to evaluate our robustness: 1) Symmetric (Sym): The true label is changed to any other class symmetrically with rate . ( for ). 2) Asymmetric (Asym): Noise is distributed with just one class, we change the true label to the label of the following class with rate . Asymmetric datasets are more realistic, as the labeler is likely to confuse one class with another class. We also validate our algorithm on a real world noisy dataset ANIMAL-10N. We perform several experiments with different architectures: 9-layer CNN (CNN9), ResNet32 and ResNet110 [8] to show that AutoTSP is architecture independent.

Ii Related work

Several methods have been introduced to deal with the challenges of training with noisy labels. A detailed review of different methods is provide in [1].

Few methods attempt to correct the noisy labels. Bootstrapping [18] uses predictions of the network to correct the labels, but their method is prone to overfitting. Joint optimization [24] corrects the labels by updating parameters and labels alternately, but requires prior knowledge of noise distribution. SELFIE [21] is a hybrid method, which selects the clean samples based on the small-loss observation and re-labels the noisy samples using the loss-correction method. D2L [14] utilizes a local intrinsic dimensionality. Dividemix [12] separates the training data into labeled and unlabeled based on the loss distribution and trained in a semi-supervised manner.

Another approach is to correct the loss function by estimating the noise transition matrix.

[6]

added an extra softmax layer,

[17] introduce forward and backward loss. However, the noise estimation becomes inaccurate, especially when the number of classes or the number of noisy samples is large.

Other related methods assign higher weights to clean samples for effective training. [19] assigned weights to training samples by minimizing the loss on a clean validation set. CleanNet [11] produces weights for each sample during network training but requires human supervision.

A few related methods assign higher weights to clean samples for effective training [19], [11]. Few other works develop noise-robust loss functions [25], [31], [26]

Iii Understanding training stop point

Ideally, to find an early stopping point, a clean validation set is required, which is not often available. Thus in this section, as an alternative to the ideal requirement, we analyze the training behavior of the test accuracy with respect to the training accuracy, when we possess the ground truth, i.e. information about clean and noisy labeled samples in the training dataset. We utilize this analysis on the training behavior to define a Training stop point (TSP) without the ground truth.

Let be the total number of epochs and

denote the vector of number of correctly predicted samples at each epoch. Let

represent the number of correctly predicted samples at epoch and and be the number of clean and noisy samples in . Let the training accuracy at each epoch be denoted by , where D represents the number of training samples in the dataset. For each epoch, we define the label recall of the clean and noisy samples as and respectively, where and represent the number of clean and noisy samples present in . and represent the vectors of label recall values at each epoch.

We observe and to understand the learning behavior of clean and noisy samples, i.e. to assess the region maximum obtainable test accuracy (MOTA) is achieved. In particular, we monitor the ratio of , and divide the training period into three memorization regions: pre-memorization (), mild-memorization () and severe-memorization (). and can be observed in the top plots of Fig.1, where the training accuracy and test accuracy are in the bottom plots.

Fig. 1: The memorization observation of CIFAR-10 0.5 Sym. Label recall on top figures, train and test accuracy on bottom figures with CNN9 (left) and ResNet110 (right)

We calculate the ratio of and define two points: , the red vertical dotted line or the beginning of the shaded region and , the blue vertical dotted line as shown in Fig.1.

The separates the PM and MM regions by calculating the argmax of the ratio of label recall, i.e. the highest gap between the blue and the orange line as shown in the top plots of Fig.1. The separates MM and SM regions by monitoring the point at which the ratio drops progressively. In some cases, with a high capacity network, the and might coincide, because the network can fit noisy examples at a high rate as shown in the right plot of Fig.1.

In the PM stage, before , the unshaded region in Fig.1, the learning is essentially on clean samples and thus the increase in the training accuracy is to the test accuracy. In particular, the increases at a higher rate, whereas the remains minimal. In the MM stage (between and ), the rate at which increases drops and learning on noisy labels increases gradually. Thus, the proportionality between training and test accuracy begins to attenuate. Later, in the SM region, as keeps increasing, the completely fails, as the learning is mostly on noisy labels. In other words, the test accuracy keeps dropping as the training accuracy increases. This can be seen in the bottom plots of Fig.1. The network learns on higher number of noisy samples in the SM stage, and it might not learn significant number of clean samples in the PM stage. Therefore, the desired training stopping point should be between and .

However, and are obtained with the ground truth. Therefore, we utilize the learning behavior, specifically, the rate of change of training accuracy at each memorization stage to find the desired TSP without the ground truth.

Iv Ideal time to stop training

We monitor the rate of change of training accuracy at each epoch , which can be either positive or negative. We represent the magnitude of rate of change at epoch e by if positive, or if negative. We separate all the and into vectors PROCE and NROCE

respectively. Now, we observe the variance of magnitudes of the

PROCE and NROCE in different memorization stages.

In the PM stage, both and have high magnitude consistently. Thus, we can observe higher spikes in the unshaded region in Fig.1. Therefore, we assume there would be very few epochs with small-magnitude initially. We refer to these epochs with small-magnitude as small learning epochs (SLEs). As the network begins to fit on the noisy samples, in the MM region, the magnitudes of both and decrease, so there would be higher number of SLEs. We would also observe fewer epochs, resulting in a smoothness in the training accuracy curve near the SM region. As the network severely memorizes the noisy labels in the SM region, there would still be higher number of SLEs. But, towards the end, as the network fits on all the training examples, the rate tends to increase slightly, causing the oscillation towards the end in Fig.1. The magnitude in the SM region varies with different settings, but, the drop in the magnitude near the SM region (smoothness) is consistent across different architectures, datasets, noise types and noise ratios as shown in later figures. From our analysis in the previous section, the desired training stop point would be between and . Therefore, we identify this smoothness i.e. find the initial reduction in the PROCE and NROCE to choose a training stop region.

We divide the PROCE and NROCE into intervals of fixed length and calculate the cumulative sum of magnitudes in each interval. Then, we standardize (zero mean and unit variance) these sums and identify the interval, where the standardized value becomes less than zero. Two different reduction points will be obtained for PROCE and NROCE (labeled as posline and negline in figures). We define this region between the two points as the training stop region.

We plot the NROCE, PROCE along with the test accuracy in Fig.2. The blue asterisk points in the figure are located at the epoch numbers in NROCE and PROCE, spaced apart by length . The height of the blue asterisk points represents the cumulative sum at each interval. The red vertical dotted line is plotted at the point, where the reduction happens.

Fig. 2: NROCE on left, PROCE on right with ResNet110 on CIFAR-10: 0.2 Sym on top, 0.2 Asym on bottom

The small-learning belief can be seen in Fig.2, i.e. the asterisk points are higher in the beginning and drop in the later stages. It can be observed that the red dotted line on both the figures are close to the MOTA. Therefore, this supports our training stop region consideration.

Fig.3 shows plots for the higher noise ratio with a different architecture. The PROCE asterisk points are higher initially, but subsequently vary in the SM stage as discussed earlier. However, as the smoothness is consistent, the red dotted lines of both PROCE and NROCE are still close to MOTA.

Fig. 3: NROCE on left, PROCE on right with CNN9 0.5 Sym: CIFAR-10 on top, CIFAR-100 on bottom

Since, our belief depends on the rate of change of training accuracy, we check the sensitivity of our belief on different learning rate schedules. The top plot in Fig.4 are of 0.001 initial rate and multiplied by 0.5, 0.25, 0.1 at 80, 120, 160 epochs respectively (LR2). The bottom plot is of constant learning rate 0.001 (LR3). The reduction of sums in the later stages can still be observed in both cases. This further verifies the robustness of our training stop region.

Fig. 4: NROCE on left and PROCE on right when trained on CIFAR 10 with ResNet32 on 0.2 symmetric noise with LR2 on top, LR3 on bottom

Now, we develop a heuristic algorithm Alg. 1 to identify the training stop point (TSP) within the training stop region. We calculate the maximum training accuracy which stores only if is greater than , else it retains . We calculate the rate of change of this vector denoted by MT, where represents the magnitude at epoch . As discussed earlier, there would be very few SLEs in initially. In contrast, there would be higher number of SLEs in the later stages. Since, the test accuracy decreases in the later stages, we assume that the network learns on the noisy samples at SLEs, which results in a drop in test accuracy. We define an epoch as a small-learning epoch (SLE), if the magnitude value is less than the threshold value . We negate the values of these SLEs in to indicate that the test accuracy is dropped at these epochs. To identify the TSP, we determine whether the epochs following the SLEs compensate the drop in the test accuracy at SLEs, and achieve higher test accuracy. Therefore, we define a vector to calculate the cumulative sum of the SLEs and the following epochs until the sum becomes positive, which we believe indicates the increase in test accuracy. We store the cumulative sum at epoch e in . We set the to zero, if the sum at epoch is negative, to indicate that the test accuracy is not increasing. Now, we calculate the sums of consecutive values in within the training stop region and store them in vector . We consider the highest cumulative sum point (argmax) in as the training stop point .

Input: ,,,,, ,
for  to  do
      )
       )
      
end for
Then standardize , ,
Pass through and /*monitoring the rate of change*/
if  then
       = l*
      /* Similarly find for */
end if
/*Training stop region: [min(, ), max(, )]*/
set to 0
for  to  do
      if () then
             = + and set to 0
       end if
      else
             = 0 and set = -
       end if
      if  then
            then =0
       end if
      
end for
V Sums of consecutive non-zero elements of within the training stop region.
/*Training stop point*/ = argmax() Output:
Algorithm 1 AutoTSP

V Experiments

We use the Adam optimizer and a momentum of 0.9. The batch size is set to 128 and the network is trained for 200 epochs. The initial learning rate is set to 0.001 and multiplied by 0.5, 0.25, 0.1 at 20, 30 and 40 epochs respectively for all experiments, otherwise mentioned. We set the algorithm hyper-parameters to 5, 6 and 7 denoted by {5,6,7}, and , to 0.5 based on experimental observations.

V-a Performance Comparison of AutoTSP:

In this section, we compare the AutoTSP when all else are equal: use of entire training data, assumption that a clean validation set is not available. Thus, we compare our method only with 1) MOTA, point where the maximum test accuracy is obtained, as represented in the figures by the green vertical line. 2) Standard training, i.e., test accuracy at the end of the epochs without early stopping (labeled as Standard in the tables). 3) Additionally, we also compare with Noise Heuristic Accuracy (NHA) [22], which assumes that the noise ratio is known, an assumption, which we believe should be avoided. NHA suggests that during training on a percent noisy dataset, the best point to stop is when the training accuracy reaches percent.

We measure the performance of AutoTSP to the above-mentioned methods with test accuracy (acc), label precision (LP) and label recall (LR) of the training data (, ).

Table I reports the results on ResNet110 architecture. It can be seen that the AutoTSP is either exact or close to MOTA across different noise ratios, datasets and noise types. It can also be observed that AutoTSP outperforms NHA in most of the cases. The results also confirm that the standard training performs poorly when training with noisy labels.

CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric Asymmetric
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
MOTA (acc) 85.01 75.86 88.28 77.78 58.96 46.65 57.99 43.7
AutoTSP (acc) 85.01 75.86 88.28 76.31 57.33 46.65 56.03 42.25
NHA (acc) 83.19 70.7 84.22 73.57 56.35 40 56.43 40.25
Standard (acc) 75 48.78 74.20 56.67 52.09 27.33 53.98 39.66
MOTA (LP) 97.6 95.2 97.54 87.94 99.14 96.45 92.76 71.81
AutoTSP (LP) 97.6 95.2 97.54 87.54 94.5 96.45 84.25 63.1
NHA (LP) 95.32 85.9 95.18 82.39 90.38 76.41 87.24 66.94
Standard (LP) 81.78 54.52 80.92 61.35 80.53 50.6 79.84 59.86
MOTA (LR) 94.23 80.23 96.56 87.62 76.07 61.48 78.77 64.96
AutoTSP (LR) 94.23 80.23 96.56 82.66 88.9 61.48 93.83 88.21
NHA (LR) 96.54 89.03 94.41 85.10 93.71 84.02 89.46 67.32
Standard (LR) 98.18 95.02 97.5 96.33 98.58 96.4 99.12 98.74
TABLE I: Comparison of results with ResNet110

On CIFAR-10 dataset, with the easier 0.2 Sym noise, it can be seen that the AutoTSP is able to exactly identify the MOTA point. High LP indicates that the learning on noisy samples is minimal, and high LR represents that the network learned significant number of clean samples at the TSP. The significance of AutoTSP can be observed for the harder cases. The AutoTSP still finds the exact stop point at MOTA, when the noise ratio is increased to 0.5, as shown in the top left plot of Fig. 5. It performs similarly for both asymmetric cases as shown in the bottom left plot. On the other hand, NHA and standard achieve higher LR than the MOTA, because the network stops the training in the later stages, where the number of clean samples learned is higher. But, lower LP indicates that they learn on higher number of noisy samples due to the stop point in the SM region, which is not desired.

Fig. 5: AutoTSP comparison with ResNet110: 0.5 Sym on top, 0.2 Asym on bottom, CIFAR-10 on left, CIFAR-100 on right

Results on the harder dataset CIFAR-100 are similar to CIFAR-10 for Sym as shown in the top righ plot of Fig.5. For the Asym 0.2 and 0.4 cases, both AutoTSP and NHA are close to MOTA, as shown in the bottom right plot of Fig.5. In this case, the LP and LR of AutoTSP are not as desired, because the network learns on noisy labels from the beginning of the training.

For harder asymmetric noise cases, the network learns on noisy examples significantly along with clean examples from the initial stages of training as shown in the left plot of Fig.6. It can be observed that the (orange line) increases along with the (blue line). This suggests that the small-loss observation is not stronger in this case. Thus, the test accuracy remains the same even in the later stages. Surprisingly, in this case, NHA stops early than AutoTSP. However, the AutoTSP is still close to MOTA, as shown in the right plot of Fig. 6. But it stops at a point later in the region, resulting in high LR and low LP as shown in table I.

Fig. 6: LR (left) and test accuracy (right) on CIFAR-100 with 0.4 Asym on ResNet110

We validated our algorithm with CNN9 as shown in Table II. It can be observed that the AutoTSP is still close to MOTA across different cases. As shown in Fig. 3 bottom plot, the AutoTSP identifies a training stop region close to MOTA. However, AutoTSP finds a stop point in the beginning of the region, resulting in a low LR and higher LP. This is likely due to the hyper-parameter (argmax) selection of the stop point.

CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric Asymmetric
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
MOTA (acc) 85.69 79.19 87.58 81.82 60.84 52.03 61.02 43.61
AutoTSP (acc) 83.26 79.19 86.89 81.82 57.06 42.05 59.99 42.98
NHA (acc) 80.52 69.16 82.17 79.15 56.7 47.55 59.99 37.28
Standard (acc) 72.49 46.21 77.23 56.88 53.3 29.36 56.65 38.46
MOTA (LP) 97.78 95.48 96.93 93.35 95.62 96.6 81.4 63.41
AutoTSP (LP) 98.99 95.48 97.26 93.35 99.41 98.25 85.21 60.19
NHA (LP) 91.96 87.14 94.42 86.75 90.28 81.51 85.21 64.61
Standard (LP) 80.58 52.16 80.84 60.77 81.05 50.84 80.35 60.01
MOTA (LR) 94.21 84.71 96.32 87.02 88.48 68.42 98.46 85.11
AutoTSP (LR) 88.12 84.71 94.87 87.02 68.85 47.8 92.85 96.43
NHA (LR) 96.35 86.06 92.03 89.29 90.67 86.25 92.85 64.77
Standard (LR) 96.85 92.58 97.92 97.65 97.48 95.74 98.3 95.37
TABLE II: Comparison of results with CNN9

We also validated our algorithm with ResNet32 architecture, where the network does not fit all the training samples. Results are presented in Table III. The networks learning drops significantly when fitting on noisy samples and thus the PROCE is close to MOTA in Fig. 4. Similar to CNN9 case, AutoTSP identifies the beginning of memorization region as shown in Fig. 7. However, we can still observe that the AutoTSP works fairly well on both CIFAR-10 and CIFAR-100, and outperforms NHA on both Sym and Asym cases.

CIFAR 10 CIFAR 100
Sym Asym Sym Asym
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
MOTA (acc) 85.45 78.38 87.22 80.71 56.53 47.35 56.71 42.94
AutoTSP (acc) 80.42 77.82 86.21 77.09 52.45 45.77 52.73 33.24
NHA (acc) 81 68.23 81.47 76.47 52.05 36.44 51.87 38.61
Standard (acc) 76.65 59.09 78.87 56.21 52.05 35.99 51.18 38.1
MOTA (LP) 98.44 93.75 98.41 91.9 99.14 96.44 95.66 73.27
AutoTSP (LP) 99.07 95.64 98.87 90.74 99.58 97.31 97.64 77.67
NHA (LP) 95.01 85.93 93.72 86.22 96.51 85.9 89.63 69.12
Standard (LP) 91.61 73.33 88.06 63.55 96.51 85.9 89.56 66.81
MOTA (LR) 93.83 87.83 94.42 87.07 74.19 62.29 73.09 61.04
AutoTSP (LR) 84.3 83.58 91.79 80.9 59.83 55.59 61.37 37.94
NHA (LR) 94.71 84.73 92.98 86.42 80.49 70.97 80.34 67.12
Standard (LR) 93.55 84.36 94.75 82.99 80.49 70.97 80.48 69.13
TABLE III: Comparison of results with ResNet32
Fig. 7: AutoTSP comparison on CIFAR-10 with ResNet32: Left 0.2 Sym, Right 0.5 Sym

ANIMAL-10N: We also conducted experiments on a real world noisy ANIMAL-10N dataset with 50000 training and 5000 testing samples. The noise is caused by human labeling error on 5 pairs of confusing classes and the approximate noise rate is 8%. It can be seen that AutoTSP still exactly identifies MOTA with CNN9 in the left plot of Fig.8. With ResNet32 in the right plot, the AutoTSP stops a little earlier than the MOTA, as discussed previously with the results in Table II. However, it is still close to MOTA, where the accuracy at MOTA was observed to be 79.6 and AutoTSP accuracy was 78.4. It can also be observed that the NHA stops the training deep in the region. The promising results on this real world noisy dataset validates the robustness of AutoTSP.

Fig. 8: AutoTSP comparison on ANIMAL-10N with: Left CNN9, Right ResNet32

Clean dataset: We further validated our method on two benchmark clean datasets CIFAR-10 and CIFAR-100 with CNN9 architecture. In general, we utilize a validation dataset and stop the training when the validation accuracy does not improve for a few epochs. Since, utilizing a validation set is not reliable [23], as an alternative, we use the AutoTSP to find the TSP on a clean dataset. It can be observed from Fig. 9 that, the AutoTSP is able to the stop point when the test accuracy no longer improves.

Fig. 9: AutoTSP comparison with CNN9: Left CIFAR-10, Right CIFAR-100

LR2 and LR3: Additionally, we trained CIFAR-10 0.5 Sym with ResNet32 using LR2 and LR3 discussed with Fig. 4. It can be observed that the AutoTSP performs well on both cases. AutoTSP finds the beginning of memorization region with the LR2, as can be seen in the left plot if Fig.10. The accuracy at MOTA is 77.65 and the accuracy of AutoTSP is 75.53. Similarly, with LR3, it can be observed that the AutoTSP is very close to MOTA in the right plot. These results continue to validate our claim that the rate of change of training accuracy can be utilized to avoid learning on noisy labels.

Fig. 10: AutoTSP comparison on CIFAR-10 dataset 0.5 Sym on ResNet32: Left LR2, Right LR3.

V-B Comparison of test accuracy with other baseline methods:

For additional validation, we concatenate AutoTSP with the INCV and co-teaching method (referred as AutoTSP+INCV in Table IV) to obtain higher test accuracy. We compare the test accuracy with the following baselines: F-correction, Decoupling, Coteaching, MentorNet, D2L, INCV and the standard training method. Additionally, we also compare with the test accuracy obtained when trained only on the clean set of the noisy training data (Labeled as Train on clean in the table). The results are shown in Table IV.

The objective of AutoTSP is to find the best training stop point close to the MOTA. Thus, it is not fair to compare AutoTSP directly with the other baseline methods, whose objective is to improve the test accuracy by re-labeling the noisy samples or, selecting higher number of samples iteratively, etc. We utilize the INCV discard measure to remove a percentage of high-loss values at which are believed to be noisy samples. Then, we retrain on the remaining samples, find the and discard few more high-loss samples. So, the new discarded dataset will be less noisy than the given training dataset. Now, we train the network using co-teaching method instead of standard-training, which proved to be effective with noisy labels.

Since, INCV is the highest among all the other baseline methods in Table IV, we compare our AutoTSP+INCV results only with the INCV. For CIFAR-10, we can see that our AutoTSP + INCV and INCV accuracy are very close. The significance of AutoTSP can be observed for CIFAR-100 as our accuracy is higher than INCV. This is because, INCV depends largely on the noise estimation and the noise estimation becomes inaccurate for higher number of classes. On the other hand, we find the TSP without using the noise information. We only utilized the noise information to adopt the INCV discard measure.

The claims in Table IV are supported by Table V. For CIFAR-10 Sym, AutoTSP+INCV and INCV LP and LR are very similar. The significance of AutoTSP can be observed with Asym 0.4 on CIFAR-10, as the LP and LR of AutoTSP+INCV is higher. On CIFAR-100 it can be seen that the LR of AutoTSP+INCV is much less compared to INCV. This is because, the AutoTSP would stop the training before several noisy labels are memorized as discussed earlier. Thus, high LP is achieved for both symmetric noise cases on CIFAR-100.

CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric
0.2 0.5 0.4 0.2 0.5
Standard training 85 75 66 57 43
F-correction [17] 85.08* 79.3* 83.55* - -
Decoupling [15] 86.72* 79.31* 75.27* - -
Co-teaching [7] 89.15 82.23 83.95 57 51
MentorNet [10] 88.36* 77.10* 77.33* - -
D2L [14] 86.12* 67.39* 85.57* - -
INCV [5] 89.65 84.81 85.84 60.4 53
AutoTSP+INCV 89.49 84.5 86.34 62.5 53.8
Train on clean 90 89.4 89 64 58
TABLE IV: Test accuracy comparison with other baseline methods with ResNet32, *the values are taken from the INCV paper and the highest test accuracy is in bold.
CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric Asymmetric
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
LP (INCV) 0.98 0.92 0.97 0.82 0.98 0.97 0.96 0.73
LR (INCV) 0.93 0.89 0.95 0.92 0.93 0.63 0.73 0.65
LP (AutoTSP+INCV) 0.98 0.95 0.98 0.89 0.99 0.96 0.95 0.71
LR (AutoTSP+INCV) 0.95 0.85 0.93 0.94 0.73 0.64 0.74 0.57
TABLE V: Comparison of LP and LR of AutoTSP+INCV and INCV with ResNet32

Vi Conclusion

In this work, we have shown that the rate of change of training accuracy can be essential in understanding the behavior of learning on noisy labels. Our key idea is to monitor the smoothness in the training accuracy to identify the training stop region. We further proposed a heuristic algorithm to automatically identify a training stop point based on the small-learning assumption. Our algorithm does not require a clean validation set and does not depend on noise estimation or the type of noise. We conducted several experiments to validate the robustness of our algorithm. One drawback of early stopping, in general, is that the network does not train on the harder-clean samples learned in the later stages, which could improve the networks generalization performance. Therefore, one possible future work is to discard noisy samples effectively without depending on the noise estimation.

References

  • [1] G. Algan and I. Ulusoy (2019) Image Classification with Deep Learning in the Presence of Noisy Labels: A Survey. Arxiv Preprint arXiv:1912.05170. External Links: Link Cited by: §I, §II.
  • [2] D. Arpit, S. Jastrzȩbski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and S. Lacoste-Julien (2017) A Closer Look at Memorization in Deep Networks. In

    Proceedings of the 34th International Conference on Machine Learning (ICML 2017)

    ,
    Cited by: §I.
  • [3] A. Blum, A. Kalai, and H. Wasserman (2003) Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM 50 (4), pp. 506–519. Cited by: §I.
  • [4] H. Chang, E. Learned-Miller, and A. Mccallum (2017) Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples. In Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 1002–1012. Cited by: §I.
  • [5] P. Chen, B. Liao, G. Chen, and S. Zhang (2019) Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels. In 36th International Conference on Machine Learning, ICML, Cited by: 4th item, §I, TABLE IV, §VII-C.
  • [6] J. Goldberger and E. Ben-Reuven (2017) TRAINING DEEP NEURAL-NETWORKS USING A NOISE ADAPTATION LAYER. In International Conference on Learning Representations, Cited by: §II.
  • [7] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. W. Tsang, and M. Sugiyama (2018) Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, External Links: ISSN 10495258 Cited by: 4th item, §I, TABLE IV, §VII-C.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. External Links: Link Cited by: §I.
  • [9] J. Huang, L. Qu, R. Jia, and B. Zhao (2019) O2U-Net: A simple noisy label detection approach for deep neural networks. Proceedings of the IEEE International Conference on Computer Vision 2019-Octob (Iccv), pp. 3325–3333. External Links: ISBN 9781728148038, Document, ISSN 15505499 Cited by: §I.
  • [10] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In 35th International Conference on Machine Learning, ICML, Vol. 5, pp. 3601–3620. Cited by: §I, TABLE IV.
  • [11] K. Lee, X. He, L. Zhang, L. Yang, and J. D. A. I. Research (2018)

    CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5447–5456. Cited by: §II, §II.
  • [12] J. Li, R. Socher, and S. C. H. Hoi (2020)

    DIVIDEMIX: LEARNING WITH NOISY LABELS AS SEMI-SUPERVISED LEARNING

    .
    In International Conference on Learning Representations (ICLR), Cited by: §II.
  • [13] M. Li, M. Soltanolkotabi, and S. Oymak (2019) Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. arXiv preprint arXiv:1903.11680. External Links: Link Cited by: §I.
  • [14] X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. M. Erfani, S. Xia, S. Wijewickrema, and J. Bailey (2018) Dimensionality-Driven Learning with Noisy Labels. In 35th International Conference on Machine Learning, ICML 2018, Vol. 8, pp. 5332–5341. Cited by: §II, TABLE IV.
  • [15] E. Malach and S. Shalev-Shwartz (2017) Decoupling ”when to update” from ”how to update”. In Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 960–970. Cited by: §I, TABLE IV.
  • [16] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox (2019) Self: Learning to filter noisy labels with self-ensembling. arXiv, pp. 1–15. Cited by: §I.
  • [17] G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu (2017) Making Deep Neural Networks Robust to Label Noise: a Loss Correction Approach. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1944–1952. Cited by: §II, TABLE IV.
  • [18] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2015) Training Deep Neural Networks on Noisy Labels with Bootstrapping. In 3rd International Conference on Learning Representations, ICLR 2015 - Workshop Track Proceedings, Cited by: §II.
  • [19] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to Reweight Examples for Robust Deep Learning. In 35th International Conference on Machine Learning, ICML 2018, Vol. 10. Cited by: §II, §II.
  • [20] D. Rolnick, A. Veit, S. Belongie, and N. Shavit (2017) Deep Learning is Robust to Massive Label Noise. Arxiv Preprint arXiv:1705.10694. Cited by: §I.
  • [21] H. Song, M. Kim, and J. Lee (2019) SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In the 36th International Conference on Machine Learning (ICML), Cited by: §II.
  • [22] H. Song, M. Kim, D. Park, and J. Lee (2019) Prestopping: How Does Early Stopping Help Generalization against Label Noise?. arXiv:1911.08059. External Links: Link Cited by: §I, §V-A.
  • [23] Y. Sun, Y. Tian, Y. Xu, and J. Li (2019) Limited Gradient Descent: Learning with Noisy Labels. IEEE Access 7, pp. 168296–168306. Cited by: §I, §V-A.
  • [24] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint Optimization Framework for Learning with Noisy Labels. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5552–5560. Cited by: §II.
  • [25] X. Wang, Y. Hua, E. Kodirov, and N. M. Robertson (2019) IMAE for noise-robust learning: Mean absolute error does not treat examples equally and gradient magnitude’s variance matters. In arXiv, Cited by: §II.
  • [26] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey (2019) Symmetric cross entropy for robust learning with noisy labels. arXiv, pp. 322–330. Cited by: §II.
  • [27] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from Massive Noisy Labeled Data for Image Classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2691–2699. Cited by: §I.
  • [28] Y. Yan, R. Rosales, G. Fung, R. Subramanian, and J. Dy (2014) Learning from multiple annotators with varying expertise. Machine Learning 95 (3), pp. 291–327. Cited by: §I.
  • [29] X. Yu, B. Han, J. Yao, G. Niu, I. W. Tsang, and M. Sugiyama (2019) How does Disagreement Help Generalization against Label Corruption?. In 36th International Conference on Machine Learning, ICML, pp. 7167–7173. Cited by: §I.
  • [30] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings. Cited by: §I.
  • [31] Z. Zhang and M. R. Sabuncu (2018) Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Advances in Neural Information Processing Systems 31 (NIPS 2018), pp. 8778–8788. Cited by: §II.

Vii Supplementary material

Vii-a Choice of :

To find the training stop region, we monitor the magnitude of PROCE and NROCE over different intervals and identify the point at which the magnitude decreases. Since, we are looking to identify the reduction of the rate of change at different intervals in the training process, a few smaller magnitude epochs in the same interval could result in finding a region too early. Thus, to ignore such an effect, we choose a large interval size to identify the consistent drop in magnitude. We also consider three different interval sizes for redundancy and compare the reduction point among the three intervals to choose the training stop region. We choose to be 3, 4 and 5 for our experiments and denote this case as {3, 4, 5}. Now, we consider the region between the min(min(NROCE{3, 4, 5}), max(PROCE{3, 4, 5})) and max(max(NROCE{3, 4, 5}), max(PROCE{3, 4, 5})) as the training stop region.

Table VI shows the accuracy results when we set to {5, 6, 7}. It can be observed that the accuracy is similar in both cases. This shows that the algorithm is not very sensitive to different interval sizes. This result makes qualitative sense, because, the reduction happens at the same point in the training irrespective of the different interval sizes.

CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric Asymmetric
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
AutoTSP ( {3, 4, 5} acc) 85.01 75.86 88.28 76.31 57.33 46.65 56.03 42.25
AutoTSP ( {5, 6, 7} acc) 83.67 75.86 88.28 76.31 56.35 46.65 57.99 42.25
TABLE VI: Comparison of test accuracy at the TSP with different interval sizes on ResNet110

Vii-B Small-learning assumption:

The small-learning assumption negates the magnitude when is less than the threshold . Since, in the initial stages, there would be only a few SLEs, the following epochs would be able to compensate for the drop and achieve positive sum, meaning achieving higher test accuracy. But, in the later stages, there would be higher number of SLEs and fewer epochs with high magnitude. So, these fewer high magnitude epochs cannot overcome the large negative cumulative sum. Therefore, this results in several zeros in the later stages. Thus, we can claim that our small-learning assumption is strong. However, we cannot use this assumption alone to find the TSP. As the rate tends to increase slightly in the SM region, there could be a few false positives which indicate the test accuracy is increasing at the end. Thus, to avoid these false positives, we only apply the small-learning assumption within the training stop region, assuming the training stop region truncates this SM region as shown in Fig 1, 2 3.

Vii-C Choice of and :

To show the impact of the thresholds, we set to 0.2, 0.5, 0.8 and to 0, 0.5. Note that, if is set to zero the small-learning assumption is ignored.

Table VII shows accuracy results for different values. It can be observed that the results are similar for the three cases, showing that the algorithm is not very sensitive to the thresholds chosen. However, if the threshold is set even higher, it found a stop point too early in the training. So, the threshold size should increase with the epoch number, as, in the later stages, even the higher magnitude epochs result in learning on noisy labels. But, finding that relation is harder without knowing the noise rate. Both INCV [5] and co-teaching [7] utilize a threshold to select a percentage of loss values as small or high loss values. However, the two methods utilize the noise ratio in their thresholds. But our algorithm assumes that the noise rate is unknown. Thus, we observed the rate of change across different datasets, architectures and hyper-parameters to choose the threshold. Based on the observation, we set the ideal to be .

The results are similar for both values 0 and 0.5. helps when the training stop region finds an upper bound in the later region by further penalizing the small-magnitude epochs. This, can be ignored if threshold value is not kept constant and increased with the epoch number.

CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric Asymmetric
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
AutoTSP (=0.2 acc) 83.86 74.76 84.85 76.36 57.33 46.65 56.03 42.25
AutoTSP (=0.5 acc) 85.01 75.86 88.28 76.31 57.33 46.65 56.43 42.25
AutoTSP (=0.8 acc) 83.7 75.86 84.85 76.31 58.04 46.65 56.03 42.25
AutoTSP (=0.2 LP) 98.11 93.57 98.42 88.1 94.5 96.45 87.24 63.1
AutoTSP (=0.5 LP) 97.6 95.2 97.54 87.54 94.5 96.45 84.25 63.1
AutoTSP (=0.8 LP) 98.27 95.2 98.42 87.54 95.61 97.16 84.25 63.1
AutoTSP (=0.2 LR) 92.47 82.72 91.12 83.56 88.9 61.48 89.46 88.21
AutoTSP (=0.5 LR) 94.23 80.23 96.56 82.66 88.9 61.48 93.83 88.21
AutoTSP (=0.8 LR) 91.42 80.23 91.12 82.66 88.08 60.9 93.83 88.21
TABLE VII: Comparison results of test accuracy, LP and LR with ResNet110 for different values when is set to 0.5

Vii-D Choosing a different training stop point:

Table VIII, IX and X show the results for different architectures when a different training stop point is chosen within the training stop region. We implemented the last non-zero element within the training stop region as the training stop point instead of the maximum cumulative sum (argmax). However, we observed that argmax performs slightly better than the last non-zero element.

CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric Asymmetric
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
AutoTSP (max acc) 85.01 75.86 88.28 76.31 57.33 46.65 56.03 42.25
AutoTSP (last non-zero acc) 84.7 74.47 86.07 76.31 57.33 46.65 56.03 42.25
AutoTSP (max LP) 97.6 95.2 97.54 87.54 94.5 96.45 84.25 63.1
AutoTSP (last non-zero LP) 98.27 93.94 95.64 87.54 94.5 97.16 84.25 63.1
AutoTSP (max LR) 94.23 80.23 96.56 82.66 88.9 61.48 93.83 88.21
AutoTSP (last non-zero LR) 91.42 82.39 97.04 82.66 88.9 60. 93.83 88.21
TABLE VIII: Comparison results of test accuracy, LP and LR with ResNet110 for different training stop points
CIFAR 10 CIFAR 100
Symmetric Asymmetric Symmetric Asymmetric
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
AutoTSP (max acc) 83.26 79.19 86.89 81.82 57.06 42.05 59.99 42.98
AutoTSP (last non-zero acc) 83.26 77.64 87.58 79.15 56.91 42.43 58.36 42.61
AutoTSP (max LP) 98.99 95.48 97.26 93.35 99.41 98.25 85.21 60.19
AutoTSP (last non-zero LP) 98.99 93.95 96.93 86.75 98.62 97.62 84 60.35
AutoTSP (max LR) 88.12 84.71 94.87 87.02 68.85 47.8 92.85 96.43
AutoTSP (last non-zero LR) 88.12 87.17 96.32 89.29 75.5 50.29 93.1 96.89
TABLE IX: Comparsion results of test accuracy, LP and LR with CNN9 for different training stop points
CIFAR 10 CIFAR 100
Sym Asym Sym Asym
0.2 0.5 0.2 0.4 0.2 0.5 0.2 0.4
AutoTSP (max acc) 80.42 77.82 86.21 77.09 52.45 45.77 52.73 33.24
AutoTSP (acc) 80.42 77.39 86.21 77.09 54.64 45.77 54 36.85
AutoTSP (max LP) 99.07 95.64 98.87 90.74 99.58 97.31 97.64 77.67
AutoTSP (LP) 99.07 94.97 98.87 90.74 99.51 97.31 96.8 77.08
AutoTSP (max LR) 84.3 83.58 91.79 80.9 59.83 55.59 61.37 37.94
AutoTSP (LR) 84.3 84.56 91.79 80.9 67.09 55.59 66.64 44.1
TABLE X: Comparsion results of test accuracy, LP and LR with ResNet32 for different training stop points

Vii-E Architecture description:

The 9-layer CNN architecture we used is described below in Table XI.

CNN-9

3x3 conv,128 ReLu

Batchnorm
3x3 conv,128 ReLu
Batchnorm
3x3 conv,128 ReLu
Batchnorm
2x2 maxpool
3x3 conv,256 ReLu
Batchnorm
3x3 conv,256 ReLu
Batchnorm
3x3 conv,256 ReLu
Batchnorm
2x2 maxpool
3x3 conv,512 ReLu
Batchnorm
3x3 conv,512 ReLu
Batchnorm
3x3 conv,512 ReLu
Batchnorm
average pool
Fully connected layer


TABLE XI: 9 layer CNN architecture