When does loss-based prioritization fail?

by   Niel Teng Hu, et al.

Not all examples are created equal, but standard deep neural network training protocols treat each training point uniformly. Each example is propagated forward and backward through the network the same amount of times, independent of how much the example contributes to the learning protocol. Recent work has proposed ways to accelerate training by deviating from this uniform treatment. Popular methods entail up-weighting examples that contribute more to the loss with the intuition that examples with low loss have already been learned by the model, so their marginal value to the training procedure should be lower. This view assumes that updating the model with high loss examples will be beneficial to the model. However, this may not hold for noisy, real world data. In this paper, we theorize and then empirically demonstrate that loss-based acceleration methods degrade in scenarios with noisy and corrupted data. Our work suggests measures of example difficulty need to correctly separate out noise from other types of challenging examples.



There are no comments yet.


page 4


Exponentiated Gradient Reweighting for Robust Training Under Label Noise and Beyond

Many learning tasks in machine learning can be viewed as taking a gradie...

BulletTrain: Accelerating Robust Neural Network Training via Boundary Example Mining

Neural network robustness has become a central topic in machine learning...

Learning with Instance-Dependent Label Noise: A Sample Sieve Approach

Human-annotated labels are often prone to noise, and the presence of suc...

Graph Learning with Loss-Guided Training

Classically, ML models trained with stochastic gradient descent (SGD) ar...

Accelerating Deep Learning by Focusing on the Biggest Losers

This paper introduces Selective-Backprop, a technique that accelerates t...

Improving MAE against CCE under Label Noise

Label noise is inherent in many deep learning tasks when the training se...

Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models

Neural conversational models require substantial amounts of dialogue dat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have observed a rapid explosion in the size and training costs of deep neural networks. Training ever larger neural networks exacerbates the cost and time required for training (Amodei,Dario et al., 2018) and the carbon footprint of the model (Strubell et al., 2019)

. Long training times and high cost hinder the democratization of deep learning research and applications from people with limited budgets or few resources

(Hooker, 2020; Obando-Ceron and Castro, 2020).

However, much of this cost could be avoided with more efficient training. A typical training regime makes the costly choice of treating all examples equally even though the value of the information in each example may not be the same. Assuming all examples are equally important and propagating each forward and backwards through the network the same amount of times results in redundancy and inefficient use of training budget.

Recent work proposes accelerating training by treating data examples differently. Jiang et al. (2019) and Katharopoulos and Fleuret (2018) have proposed loss-based sampling methods to speed up deep neural network training by emphasizing the high loss examples. Although such methods achieve significant acceleration, the merits of each have been evaluated on small curated research datasets that do not necessarily represent the messiness of real world data. In large scale training settings, models are often trained on datasets with unknown quality and degree of input or label corruption (Tsipras et al., 2020; Hooker et al., 2020; Beyer et al., 2020).

Intuitively, these real-world settings present a particular challenge for loss-based sampling approaches because two distinct categories of examples have high-loss: (a) difficult or low-frequency examples and (b) corrupted, noisy, or mislabeled examples. The former are more useful for training than the median example and are adeptly selected by loss-based approaches. But the latter are less useful than the median. In fact, mislabeled examples hurt training and decrease generalization, and so loss-based methods that systematically boost their salience during training may do more harm than good.

In this paper we evaluate the robustness of loss-based sampling methods to varying levels of dataset noise and corruption. Our goal is to spur a more nuanced discourse about what we mean by hardness of examples and the assumptions motivating loss-based prioritization methods. We define three types of modifications to create artificial corrupted examples: (1) label randomization, (2) pixel shuffling, and (3) replacing inputs with Gaussian noise. We have observed that all these transformations corrupt examples such that a human can no longer identify the target class. Thus, our definition of noise centers on corruptions which introduce sufficient stochasticity that the mapping can no longer be learnt by a human.

We study the performance of two representative loss-based sampling methods, Selective Backprop (SB) (Jiang et al., 2019)

and the variance reduction importance sampling method proposed in

(Katharopoulos and Fleuret, 2018)

, on artificial corrupted datasets and try to explain the intuition behind the failing cases. We find that acceleration of these methods derails under supervised learning tasks with corrupted or noisy information, however the variance reduction importance sampling method didn’t degrade as severely as SB. We show that for both methods considered, the degradation to performance occurs due to the upweighting of these high-loss, out-of-distribution examples.

Implications of work.

Loss-based acceleration methods present the attractive proposition of being able to accelerate any loss-based training method, but as we show in this paper, this assumes that the highest loss examples are informative for the task at hand. Often, the source of noise in challenging examples is irreducible. Even with an up-weighting strategy, a useful mapping between input and output space cannot be learnt. Real world datasets contain varying degrees and types of corruption, and as we show here, corruption causes these acceleration methods to degrade. This suggests that either better acceleration methods are needed or that rigorous data pre-processing to remove corruptions should be applied before deploying these methods. A rich subject of future research is developing estimation methods that distinguish between noisy examples which can derail training and the more informative atypical examples.

2 Loss-based Sampling Methods

In a supervised learning setting, let be the i-th example of the training set where

represents the input tensor to the network and

represents the label. Let be the neural network model parameterized by learnable parameter and the cross-entropy loss. The goal of training is to find

Where represents the number of examples in the training set. In mini batch SGD, we uniformly sample a batch of examples from the dataset without replacement and use the average gradient from these examples to update the parameters, here using a learning rate .

As may be seen in the equation above, the model weights all training examples equally. In contrast, the loss-based sampling methods that we evaluate up-weight challenging examples. We introduce both methods below:

Selective Backprop (SB). SB (Jiang et al., 2019) is a framework proposed to prioritize learning high loss examples every iteration. In the original paper, SB converges to target error rates up to 3.5x faster than standard SGD and can be further accelerated by using stale forward pass information.

SB works by maintaining a moving histogram of size

and a buffer for candidate examples. At each iteration, SB computes forward passes to calculate the loss of each example. Then, for each example, its loss is input to a function which outputs the probability that it should be sampled. Samples are then chosen and pushed into the candidate buffer. When the buffer size exceeds the batch size

, the first examples are used to compute the model gradient updates.

The probability of each example being sampled is calculated by the CDF of its loss from the histogram.

is a hyper-parameter controlling the selectivity of the algorithm. The percentage of examples selected in a batch , so = 0 corresponds to 100% selection (regular SGD), = 1 corresponds to 50% selection with linearly ramping probability from 0% probability of the minimum loss example being chosen to 100% probability of the highest loss example being chosen, and = 2 corresponds to 33% selection with a similarly arranged quadratic ramp.

Variance Reduction Importance Sampling (VR). We also consider the importance sampling approach of (Katharopoulos and Fleuret, 2018) using loss values. The paper proposed an importance sampling scheme that prioritizes computation on examples that reduce the variance of the gradient estimates.

Similarly to SB, the method maintains a pool of pre-sampling candidate examples. During training, mini-batches are pushed into this pool and once the size of pool exceeds a predefined size , the algorithm samples examples from a distribution proportional to their loss values. The size of the pre-sampling pool is an additional hyper-parameter introduced. The paper also derives an estimator of the variance reduction and switches importance sampling on when it is estimated to produce a speedup.

In this paper, we test specifically the loss-based, no up-weighting, no warm up version of importance sampling from (Katharopoulos and Fleuret, 2018), because this is the version adopted by (Jiang et al., 2019) in their tests.

Corruption Pristine Random Labels Shuffle Pixels Gaussian Generated
Algorithm 0% 25% 50% 25% 50% 25% 50%

1x 1x 1x 1x 1x 1x 1x
(4.28 ) (8.53 ) (13.27 ) (4.98 ) (6.88 ) (5.27 ) (6.72 )

SB (50 selectivity)
2.0x 2.0x 1.5x 2.0x 2.0x 2.0x -
(4.26 ) (9.03 ) (14.66 ) (5.48 ) (7.71 ) (5.68 ) (8.47 )

SB (33 selectivity)
3.0x - - - - - -
(4.39 ) (12.28 ) (33.73 ) (6.31 ) (9.95 ) (7.18 ) (17.92 )

VR (max 33% selectivity)
1.7x 1.0x 1.0x 1.1x 1.0x 1.0x 1.0x
(4.81 ) (8.65 ) (13.20 ) (5.71 ) (6.95 ) (5.35 ) (6.89 )

VR (max 50% selectivity)
1.8x 1.0x 1.0x - 1.1x 1.1x 1.0x
(4.87 ) (8.74 ) (13.14 ) (6.27 ) (7.69 ) (6.30 ) (6.70 )

Table 1: Speedup and error rate of various methods under different datasets.


3 Experiments

We evaluate the SB and VR methods with a standard image classification dataset, CIFAR 10

(Krizhevsky and Hinton, 2009), as well as corrupted variations of it. We first explain how we create noisy examples.

3.1 Creating Noisy examples

We denote the clean training and test datasets as and respectively and be the neural network model trained on where are the learnable parameters. The objective is to apply modifications on where with inputs and output where is the number of classes.

We consider the following modifications on example to artificially introduce noise:

1. Random labels: Output is replaced by which is uniformly sampled from the set including the original label.

2. Shuffled pixels: A single random permutation is chosen for the task and is applied to turn each into a permuted version .

3. Gaussian: is replaced by a

in which each pixel is sampled from a Gaussian distribution whose mean and variance match the mean and variance of pixel values in


These noise-inducing methods, while artificial, attempt to recreate the real-world scenario where, for example, label error is commonly seen (Snow et al., 2008; Yan et al., 2014) or hardware collection noise adds artefacts to a subset of the data.

3.2 Evaluating loss-based sampling

We evaluate using CIFAR10, which contains 50000 training images and 10000 test images, divided into 10 classes respectively, and each example is a 32 32 image with three color channels. We randomly sample or of the dataset and apply the modifications described in the previous subsection.

For each corruption type and percentage, we run 5 variants of acceleration algorithms, SB-0 (Selective Backprop with selection, which is standard SGD), SB-1 (Selective Backprop with selection), SB-2 ( selection), VR with max selectivity , and VR with max selectivity .

We train a Wide Residual Network (Zagoruyko and Komodakis, 2017) with depth = 28 and widen_factor = 10. We use batch_size = 128, lr = 0.1 and momentum = 0.9. We use a standard SGD optimizer with weight_decay = 0.0005. The learning rate drops by

at epoch 60 and 80 and the training runs for 100 epochs. For each training image, first we crop the given image at a random location with padding on the borders, then horizontally flip the image with

probability, and lastly normalize the image. For the test set, we only normalize the dataset without data augmentation. We ran the experiment with 5 random seeds and average the results.

In this paper, consistent with prior work (Jiang et al., 2019), we use the number of examples back-propagated to reach a target test error as a proxy measure for speedup. We record the best test error of standard non-accelerated training as the target and when measuring the speedup of a acceleration method, we use the number of back-propagation to reach the 1.2x non-accelerated best test error as a measure of speed.

3.3 Results

The results are shown in table:speedup. A test error threshold is chosen by running standard SGD training and multiplying the best error achieved by 1.2. Speedup is the number of back-propagations standard SGD requires to reach this threshold divided by the number required by the given method. Dashes indicate the network was unable to reach the threshold error. Numbers in parentheses indicate the best test error achieved by the method averaged over five runs. With 0% corruption both SB and VR significantly accelerate training. When corruption is applied to the data, both methods either fail to deliver a speedup or attain a worse test error.

We find that on the pristine (un-corrupted) CIFAR10 dataset, both SB and VR are able to reach the target test error with speedup. As we increase the amount of corruptions in the dataset, the speedup degrade compared to pristine datasets and both algorithms converge to higher test error. Training plots are included in the supplementary material.

Noticeably, VR didn’t degrade as severely as SB when more corruptions are introduced into the datasets. Katharopoulos and Fleuret (2018) show that the variance reduction is proportional to the squared L2

distance between the sampling distribution and uniform distribution. By tracking the squared distance

L2, VR only enables importance sampling when there is a guaranteed variance reduction. We believe in our experiments with artificial corrupted examples, VR cancelled the importance sampling most steps which leads to only minor degradation vs. standard training but does not produce a speedup.

Figure 1: 16 images sampled from the 2000 most frequent picks by Selective-Backprop on (top) -shuffle-pixel CIFAR10 vs. (bottom) the pristine dataset. SB tends to prioritize noisy images instead of the difficult (high-loss) noiseless examples that would lead to accelerated, generalizable learning.

4 Discussion

Why the degradation in the corrupted datasets. We hypothesize that the degradation is caused by these modified examples being prioritized by the acceleration method. The randomization of label deliberately destroys the links between inputs and target labels therefore contains no value for generalization. For Gaussian generated and pixels shuffled images, the modification transforms the inputs into noise-like images while only preserving some global statistics. These modifications eliminate structures in the images and create out-of-distribution examples that don’t contribute to generalization.

We visually examine the top picked examples from both datasets. In comparison, we sample 16 images (belong to one class) from the 2000 most frequent picks by SB on modified and unmodified datasets. In top_images, we sample 16 of the most frequent and least frequent picks by SB for two classes (frogs and dogs).

Figure 2: Images sampled by SB and VR on the pristine CIFAR10 dataset. Shown are 16 images sampled from the 2000 (first row) most frequent picks by SB, (second row) least frequent picks by SB, (third row) most frequent picks by VR, and (fourth row) least frequent picks by VR for a given single class. The left column shows images from the “frog” class and the right shows images from the “dog” class.

It is observed that examples less likely to be picked often share similar visual characteristics: similar shape, contours, color or image composition. Their similarity makes them more redundant during training, so some may be dropped without damaging the training process. On pristine datasets, those examples more likely to be picked show more diverse compositions. Unfortunately, when datasets are corrupted, those corrupted examples tend to be chosen.

Figure 3: Percentage of corrupted examples selected for inclusion in CIFAR10 training batches where 25% of images are corrupted via (left) random labels, (middle) Gaussian generated pixels (right) shuffled pixels. Both SB and VR sample corrupted examples more often than clean examples, which degrades training performance.

We hypothesized that for corrupted datasets, both atypical examples and corrupted examples produce high losses, leading the acceleration method to prioritize both. To validate this hypothesis, we plot the percentage of corrupted examples included in each training iteration (noise_pct) and find that across different type of corruptions, both SB and VR consistently over-sample corrupted examples compared to the constant percentage of standard training. The results suggest that, in contrary to the promises of loss-based acceleration methods, these methods might actually hurt the training in partial corrupted datasets.

Figure 4: Test error over the course of training for SB with entropy prioritization target vs. vanilla SGD and SB with Loss. Results are shown on CIFAR10 with (left) 50% random labels, (middle) 50% Gaussian generated images, and (right) 50% images with shuffled pixels.

Entropy-Based Selective Backprop (SBE). In an attempt to fix SB’s vulnerability, we switch the prioritization target from the cross-entropy loss to the model’s uncertainty, calculated as the entropy of the prediction distribution. Intuitively, entropy-based sampling encourages the model to learn those examples that cause the most uncertainty over multiple classes.

The method works similar to SB except at each iteration, SBE computes forward passes and calculates the entropy of the prediction distribution for every example and then updates the moving histogram. The probability of each example being sampled is calculated by the CDF of its entropy with respect to the histogram. Samples are then chosen and pushed into the candidate buffer. When the buffer size exceeds the batch size , the first examples are fetched to compute the gradient.

As shown in SBE, we find that this approach improves training significantly in the face of random labeling corruptions, but it performs worse than loss-based SB when datasets contain Gaussian generated examples. In CIFAR10 with 50% random labels, entropy-based SB is able to achieve the 1.2x standard test error with 2.0x speedup. In CIFAR10 with 50% Gaussian generated examples, entropy-based SB performs slightly worse than loss-based SB.

5 Related Work

Many methods have been proposed to improve neural network training. Curriculum learning (Bengio et al., 2009) devised a strategy of supplying easy and prototypical examples first and gradually increases their difficulty, which has shown to be beneficial to the overall generalization of the model. In real-world applications where identification of easy/hard examples could be difficult, Self-paced learning (Kumar et al., 2010) infers the difficulty of examples from their corresponding training loss during training.

Another common approach is to use importance sampling. The basic approach is to over-sample a subset of examples, then to weight them with the inverse of the sampling probability so the gradient estimator is unbiased (Katharopoulos and Fleuret, 2018; Johnson and Guestrin, 2018; Gao et al., 2015). Among these works, many (Katharopoulos and Fleuret, 2018; Loshchilov and Hutter, 2015; Schaul et al., 2016) use the loss to generate the sampling distribution and sample examples proportional to the historical loss. Although most methods require maintaining a data structure proportional to the training set in size, e.g. full history of training loss for each example, or training an extra auxiliary DNN, (Katharopoulos and Fleuret, 2018) doesn’t have the requirement to maintain historical data for each example and therefore may scale to large datasets and to the incremental learning setting.

Besides, (Jiang et al., 2019) proposes a novel framework to prioritize high loss examples and speedup training without maintaining a data structure proportional to the training set in size. (Chang et al., 2018) proposes up-weighting examples based on estimates on model uncertainty. (Yoon et al., 2019) proposes a meta learning framework which models the value of each example using a deep neural network.

Mitigating the impact of noisy labels has also been the subject of considerable research (Tanaka et al., 2018).

6 Conclusions

In this paper, we showed that the acceleration from prioritizing high loss examples does not always hold when we cannot guarantee the dataset is high quality. We showed the degradation of using loss-based acceleration through experiments with artificially corrupted datasets. In our experiments, two acceleration algorithms actually hurt the training by over-sampling the corrupted examples, which confirmed our hypothesis that loss is an imperfect estimate of how challenging an example is. Up-weighting based upon loss is not robust in the face of irreducible noise.


  • Amodei,Dario, D. Hernandez, Sastry,Girish, J. Clark, G. Brockman, and I. Sutskever (2018) AI and compute. External Links: Link Cited by: §1.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. New York, NY, USA, pp. 41–48. Cited by: §5.
  • L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. van den Oord (2020)

    Are we done with imagenet?

    External Links: 2006.07159 Cited by: §1.
  • H. Chang, E. Learned-Miller, and A. McCallum (2018) Active bias: training more accurate neural networks by emphasizing high variance samples. External Links: 1704.07433 Cited by: §5.
  • J. Gao, H. V. Jagadish, and B. C. Ooi (2015) Active sampler: light-weight accelerator for complex data analytics at scale. External Links: 1512.03880 Cited by: §5.
  • S. Hooker, A. Courville, G. Clark, Y. Dauphin, and A. Frome (2020) What do compressed deep neural networks forget?. External Links: 1911.05248 Cited by: §1.
  • S. Hooker (2020) The hardware lottery. External Links: 2009.06489 Cited by: §1.
  • A. H. Jiang, D. L. -K. Wong, G. Zhou, D. G. Andersen, J. Dean, G. R. Ganger, G. Joshi, M. Kaminksy, M. Kozuch, Z. C. Lipton, and P. Pillai (2019) Accelerating deep learning by focusing on the biggest losers. External Links: 1910.00762 Cited by: §1, §1, §2, §2, §3.2, §5.
  • T. B. Johnson and C. Guestrin (2018) Training deep models faster with robust, approximate importance sampling. pp. 7265–7275. External Links: Link Cited by: §5.
  • A. Katharopoulos and F. Fleuret (2018) Not all samples are created equal: deep learning with importance sampling. CoRR abs/1803.00942. External Links: Link, 1803.00942 Cited by: §1, §1, §2, §2, §3.3, §5.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §3.
  • M. P. Kumar, B. Packer, and D. Koller (2010) Self-Paced learning for latent variable models. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 1189–1197. Cited by: §5.
  • I. Loshchilov and F. Hutter (2015) Online batch selection for faster training of neural networks. CoRR abs/1511.06343. External Links: Link, 1511.06343 Cited by: §5.
  • J. S. Obando-Ceron and P. S. Castro (2020)

    Revisiting rainbow: promoting more insightful and inclusive deep reinforcement learning research

    External Links: 2011.14826 Cited by: §1.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico. Cited by: §5.
  • R. Snow, B. O’Connor, D. Jurafsky, and A. Ng (2008) Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In

    Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

    Honolulu, Hawaii, pp. 254–263. External Links: Link Cited by: §3.1.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in nlp. External Links: 1906.02243 Cited by: §1.
  • D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint optimization framework for learning with noisy labels. External Links: 1803.11364 Cited by: §5.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Ilyas, and A. Madry (2020)

    From imagenet to image classification: contextualizing progress on benchmarks

    ArXiv abs/2005.11295. Cited by: §1.
  • Y. Yan, R. Rosales, G. Fung, R. Subramanian, and J. Dy (2014) Learning from multiple annotators with varying expertise. Machine Learning 95, pp. . External Links: Document Cited by: §3.1.
  • J. Yoon, S. O. Arik, and T. Pfister (2019) Data valuation using reinforcement learning. External Links: 1909.11671 Cited by: §5.
  • S. Zagoruyko and N. Komodakis (2017) Wide residual networks. External Links: 1605.07146 Cited by: §3.2.

Appendix S1 Additional plots

In S6 we show the performance of the methods under consideration on a dataset without any corruption. In S6 to S11, we depict the performance in the face of various types and amounts of corruption.

Figure S5: No Dataset Corruption
Figure S6: Corrupted Labels (25 )
Figure S5: No Dataset Corruption
Figure S8: Gaussian (25 )
Figure S7: Corrupted Labels (50 )
Figure S8: Gaussian (25 )
Figure S9: Gaussian(50 )
Figure S7: Corrupted Labels (50 )
Figure S10: Shuffle (25 )
Figure S11: Shuffle (50 )
Figure S10: Shuffle (25 )