1 Introduction
Recent years have observed a rapid explosion in the size and training costs of deep neural networks. Training ever larger neural networks exacerbates the cost and time required for training (Amodei,Dario et al., 2018) and the carbon footprint of the model (Strubell et al., 2019)
. Long training times and high cost hinder the democratization of deep learning research and applications from people with limited budgets or few resources
(Hooker, 2020; Obando-Ceron and Castro, 2020).However, much of this cost could be avoided with more efficient training. A typical training regime makes the costly choice of treating all examples equally even though the value of the information in each example may not be the same. Assuming all examples are equally important and propagating each forward and backwards through the network the same amount of times results in redundancy and inefficient use of training budget.
Recent work proposes accelerating training by treating data examples differently. Jiang et al. (2019) and Katharopoulos and Fleuret (2018) have proposed loss-based sampling methods to speed up deep neural network training by emphasizing the high loss examples. Although such methods achieve significant acceleration, the merits of each have been evaluated on small curated research datasets that do not necessarily represent the messiness of real world data. In large scale training settings, models are often trained on datasets with unknown quality and degree of input or label corruption (Tsipras et al., 2020; Hooker et al., 2020; Beyer et al., 2020).
Intuitively, these real-world settings present a particular challenge for loss-based sampling approaches because two distinct categories of examples have high-loss: (a) difficult or low-frequency examples and (b) corrupted, noisy, or mislabeled examples. The former are more useful for training than the median example and are adeptly selected by loss-based approaches. But the latter are less useful than the median. In fact, mislabeled examples hurt training and decrease generalization, and so loss-based methods that systematically boost their salience during training may do more harm than good.
In this paper we evaluate the robustness of loss-based sampling methods to varying levels of dataset noise and corruption. Our goal is to spur a more nuanced discourse about what we mean by hardness of examples and the assumptions motivating loss-based prioritization methods. We define three types of modifications to create artificial corrupted examples: (1) label randomization, (2) pixel shuffling, and (3) replacing inputs with Gaussian noise. We have observed that all these transformations corrupt examples such that a human can no longer identify the target class. Thus, our definition of noise centers on corruptions which introduce sufficient stochasticity that the mapping can no longer be learnt by a human.
We study the performance of two representative loss-based sampling methods, Selective Backprop (SB) (Jiang et al., 2019)
and the variance reduction importance sampling method proposed in
(Katharopoulos and Fleuret, 2018), on artificial corrupted datasets and try to explain the intuition behind the failing cases. We find that acceleration of these methods derails under supervised learning tasks with corrupted or noisy information, however the variance reduction importance sampling method didn’t degrade as severely as SB. We show that for both methods considered, the degradation to performance occurs due to the upweighting of these high-loss, out-of-distribution examples.
Implications of work.
Loss-based acceleration methods present the attractive proposition of being able to accelerate any loss-based training method, but as we show in this paper, this assumes that the highest loss examples are informative for the task at hand. Often, the source of noise in challenging examples is irreducible. Even with an up-weighting strategy, a useful mapping between input and output space cannot be learnt. Real world datasets contain varying degrees and types of corruption, and as we show here, corruption causes these acceleration methods to degrade. This suggests that either better acceleration methods are needed or that rigorous data pre-processing to remove corruptions should be applied before deploying these methods. A rich subject of future research is developing estimation methods that distinguish between noisy examples which can derail training and the more informative atypical examples.
2 Loss-based Sampling Methods
In a supervised learning setting, let be the i-th example of the training set where
represents the input tensor to the network and
represents the label. Let be the neural network model parameterized by learnable parameter and the cross-entropy loss. The goal of training is to find
Where represents the number of examples in the training set. In mini batch SGD, we uniformly sample a batch of examples from the dataset without replacement and use the average gradient from these examples to update the parameters, here using a learning rate .
As may be seen in the equation above, the model weights all training examples equally. In contrast, the loss-based sampling methods that we evaluate up-weight challenging examples. We introduce both methods below:
Selective Backprop (SB). SB (Jiang et al., 2019) is a framework proposed to prioritize learning high loss examples every iteration. In the original paper, SB converges to target error rates up to 3.5x faster than standard SGD and can be further accelerated by using stale forward pass information.
SB works by maintaining a moving histogram of size
and a buffer for candidate examples. At each iteration, SB computes forward passes to calculate the loss of each example. Then, for each example, its loss is input to a function which outputs the probability that it should be sampled. Samples are then chosen and pushed into the candidate buffer. When the buffer size exceeds the batch size
, the first examples are used to compute the model gradient updates.The probability of each example being sampled is calculated by the CDF of its loss from the histogram.
is a hyper-parameter controlling the selectivity of the algorithm. The percentage of examples selected in a batch , so = 0 corresponds to 100% selection (regular SGD), = 1 corresponds to 50% selection with linearly ramping probability from 0% probability of the minimum loss example being chosen to 100% probability of the highest loss example being chosen, and = 2 corresponds to 33% selection with a similarly arranged quadratic ramp.
Variance Reduction Importance Sampling (VR). We also consider the importance sampling approach of (Katharopoulos and Fleuret, 2018) using loss values. The paper proposed an importance sampling scheme that prioritizes computation on examples that reduce the variance of the gradient estimates.
Similarly to SB, the method maintains a pool of pre-sampling candidate examples. During training, mini-batches are pushed into this pool and once the size of pool exceeds a predefined size , the algorithm samples examples from a distribution proportional to their loss values. The size of the pre-sampling pool is an additional hyper-parameter introduced. The paper also derives an estimator of the variance reduction and switches importance sampling on when it is estimated to produce a speedup.
In this paper, we test specifically the loss-based, no up-weighting, no warm up version of importance sampling from (Katharopoulos and Fleuret, 2018), because this is the version adopted by (Jiang et al., 2019) in their tests.
Corruption | Pristine | Random Labels | Shuffle Pixels | Gaussian Generated | |||
Algorithm | 0% | 25% | 50% | 25% | 50% | 25% | 50% |
Standard |
1x | 1x | 1x | 1x | 1x | 1x | 1x |
(4.28 ) | (8.53 ) | (13.27 ) | (4.98 ) | (6.88 ) | (5.27 ) | (6.72 ) | |
SB (50 selectivity) |
2.0x | 2.0x | 1.5x | 2.0x | 2.0x | 2.0x | - |
(4.26 ) | (9.03 ) | (14.66 ) | (5.48 ) | (7.71 ) | (5.68 ) | (8.47 ) | |
SB (33 selectivity) |
3.0x | - | - | - | - | - | - |
(4.39 ) | (12.28 ) | (33.73 ) | (6.31 ) | (9.95 ) | (7.18 ) | (17.92 ) | |
VR (max 33% selectivity) |
1.7x | 1.0x | 1.0x | 1.1x | 1.0x | 1.0x | 1.0x |
(4.81 ) | (8.65 ) | (13.20 ) | (5.71 ) | (6.95 ) | (5.35 ) | (6.89 ) | |
VR (max 50% selectivity) |
1.8x | 1.0x | 1.0x | - | 1.1x | 1.1x | 1.0x |
(4.87 ) | (8.74 ) | (13.14 ) | (6.27 ) | (7.69 ) | (6.30 ) | (6.70 ) | |
|
table:speedup
3 Experiments
We evaluate the SB and VR methods with a standard image classification dataset, CIFAR 10
(Krizhevsky and Hinton, 2009), as well as corrupted variations of it. We first explain how we create noisy examples.3.1 Creating Noisy examples
We denote the clean training and test datasets as and respectively and be the neural network model trained on where are the learnable parameters. The objective is to apply modifications on where with inputs and output where is the number of classes.
We consider the following modifications on example to artificially introduce noise:
1. Random labels: Output is replaced by which is uniformly sampled from the set including the original label.
2. Shuffled pixels: A single random permutation is chosen for the task and is applied to turn each into a permuted version .
3. Gaussian: is replaced by a
in which each pixel is sampled from a Gaussian distribution whose mean and variance match the mean and variance of pixel values in
.3.2 Evaluating loss-based sampling
We evaluate using CIFAR10, which contains 50000 training images and 10000 test images, divided into 10 classes respectively, and each example is a 32 32 image with three color channels. We randomly sample or of the dataset and apply the modifications described in the previous subsection.
For each corruption type and percentage, we run 5 variants of acceleration algorithms, SB-0 (Selective Backprop with selection, which is standard SGD), SB-1 (Selective Backprop with selection), SB-2 ( selection), VR with max selectivity , and VR with max selectivity .
We train a Wide Residual Network (Zagoruyko and Komodakis, 2017) with depth = 28 and widen_factor = 10. We use batch_size = 128, lr = 0.1 and momentum = 0.9. We use a standard SGD optimizer with weight_decay = 0.0005. The learning rate drops by
at epoch 60 and 80 and the training runs for 100 epochs. For each training image, first we crop the given image at a random location with padding on the borders, then horizontally flip the image with
probability, and lastly normalize the image. For the test set, we only normalize the dataset without data augmentation. We ran the experiment with 5 random seeds and average the results.In this paper, consistent with prior work (Jiang et al., 2019), we use the number of examples back-propagated to reach a target test error as a proxy measure for speedup. We record the best test error of standard non-accelerated training as the target and when measuring the speedup of a acceleration method, we use the number of back-propagation to reach the 1.2x non-accelerated best test error as a measure of speed.
3.3 Results
The results are shown in table:speedup. A test error threshold is chosen by running standard SGD training and multiplying the best error achieved by 1.2. Speedup is the number of back-propagations standard SGD requires to reach this threshold divided by the number required by the given method. Dashes indicate the network was unable to reach the threshold error. Numbers in parentheses indicate the best test error achieved by the method averaged over five runs. With 0% corruption both SB and VR significantly accelerate training. When corruption is applied to the data, both methods either fail to deliver a speedup or attain a worse test error.
We find that on the pristine (un-corrupted) CIFAR10 dataset, both SB and VR are able to reach the target test error with speedup. As we increase the amount of corruptions in the dataset, the speedup degrade compared to pristine datasets and both algorithms converge to higher test error. Training plots are included in the supplementary material.
Noticeably, VR didn’t degrade as severely as SB when more corruptions are introduced into the datasets. Katharopoulos and Fleuret (2018) show that the variance reduction is proportional to the squared L2
distance between the sampling distribution and uniform distribution. By tracking the squared distance
L2, VR only enables importance sampling when there is a guaranteed variance reduction. We believe in our experiments with artificial corrupted examples, VR cancelled the importance sampling most steps which leads to only minor degradation vs. standard training but does not produce a speedup.4 Discussion
Why the degradation in the corrupted datasets. We hypothesize that the degradation is caused by these modified examples being prioritized by the acceleration method. The randomization of label deliberately destroys the links between inputs and target labels therefore contains no value for generalization. For Gaussian generated and pixels shuffled images, the modification transforms the inputs into noise-like images while only preserving some global statistics. These modifications eliminate structures in the images and create out-of-distribution examples that don’t contribute to generalization.
We visually examine the top picked examples from both datasets. In comparison, we sample 16 images (belong to one class) from the 2000 most frequent picks by SB on modified and unmodified datasets. In top_images, we sample 16 of the most frequent and least frequent picks by SB for two classes (frogs and dogs).
It is observed that examples less likely to be picked often share similar visual characteristics: similar shape, contours, color or image composition. Their similarity makes them more redundant during training, so some may be dropped without damaging the training process. On pristine datasets, those examples more likely to be picked show more diverse compositions. Unfortunately, when datasets are corrupted, those corrupted examples tend to be chosen.
We hypothesized that for corrupted datasets, both atypical examples and corrupted examples produce high losses, leading the acceleration method to prioritize both. To validate this hypothesis, we plot the percentage of corrupted examples included in each training iteration (noise_pct) and find that across different type of corruptions, both SB and VR consistently over-sample corrupted examples compared to the constant percentage of standard training. The results suggest that, in contrary to the promises of loss-based acceleration methods, these methods might actually hurt the training in partial corrupted datasets.
Entropy-Based Selective Backprop (SBE). In an attempt to fix SB’s vulnerability, we switch the prioritization target from the cross-entropy loss to the model’s uncertainty, calculated as the entropy of the prediction distribution. Intuitively, entropy-based sampling encourages the model to learn those examples that cause the most uncertainty over multiple classes.
The method works similar to SB except at each iteration, SBE computes forward passes and calculates the entropy of the prediction distribution for every example and then updates the moving histogram. The probability of each example being sampled is calculated by the CDF of its entropy with respect to the histogram. Samples are then chosen and pushed into the candidate buffer. When the buffer size exceeds the batch size , the first examples are fetched to compute the gradient.
As shown in SBE, we find that this approach improves training significantly in the face of random labeling corruptions, but it performs worse than loss-based SB when datasets contain Gaussian generated examples. In CIFAR10 with 50% random labels, entropy-based SB is able to achieve the 1.2x standard test error with 2.0x speedup. In CIFAR10 with 50% Gaussian generated examples, entropy-based SB performs slightly worse than loss-based SB.
5 Related Work
Many methods have been proposed to improve neural network training. Curriculum learning (Bengio et al., 2009) devised a strategy of supplying easy and prototypical examples first and gradually increases their difficulty, which has shown to be beneficial to the overall generalization of the model. In real-world applications where identification of easy/hard examples could be difficult, Self-paced learning (Kumar et al., 2010) infers the difficulty of examples from their corresponding training loss during training.
Another common approach is to use importance sampling. The basic approach is to over-sample a subset of examples, then to weight them with the inverse of the sampling probability so the gradient estimator is unbiased (Katharopoulos and Fleuret, 2018; Johnson and Guestrin, 2018; Gao et al., 2015). Among these works, many (Katharopoulos and Fleuret, 2018; Loshchilov and Hutter, 2015; Schaul et al., 2016) use the loss to generate the sampling distribution and sample examples proportional to the historical loss. Although most methods require maintaining a data structure proportional to the training set in size, e.g. full history of training loss for each example, or training an extra auxiliary DNN, (Katharopoulos and Fleuret, 2018) doesn’t have the requirement to maintain historical data for each example and therefore may scale to large datasets and to the incremental learning setting.
Besides, (Jiang et al., 2019) proposes a novel framework to prioritize high loss examples and speedup training without maintaining a data structure proportional to the training set in size. (Chang et al., 2018) proposes up-weighting examples based on estimates on model uncertainty. (Yoon et al., 2019) proposes a meta learning framework which models the value of each example using a deep neural network.
Mitigating the impact of noisy labels has also been the subject of considerable research (Tanaka et al., 2018).
6 Conclusions
In this paper, we showed that the acceleration from prioritizing high loss examples does not always hold when we cannot guarantee the dataset is high quality. We showed the degradation of using loss-based acceleration through experiments with artificially corrupted datasets. In our experiments, two acceleration algorithms actually hurt the training by over-sampling the corrupted examples, which confirmed our hypothesis that loss is an imperfect estimate of how challenging an example is. Up-weighting based upon loss is not robust in the face of irreducible noise.
References
- AI and compute. External Links: Link Cited by: §1.
- Curriculum learning. New York, NY, USA, pp. 41–48. Cited by: §5.
-
Are we done with imagenet?
. External Links: 2006.07159 Cited by: §1. - Active bias: training more accurate neural networks by emphasizing high variance samples. External Links: 1704.07433 Cited by: §5.
- Active sampler: light-weight accelerator for complex data analytics at scale. External Links: 1512.03880 Cited by: §5.
- What do compressed deep neural networks forget?. External Links: 1911.05248 Cited by: §1.
- The hardware lottery. External Links: 2009.06489 Cited by: §1.
- Accelerating deep learning by focusing on the biggest losers. External Links: 1910.00762 Cited by: §1, §1, §2, §2, §3.2, §5.
- Training deep models faster with robust, approximate importance sampling. pp. 7265–7275. External Links: Link Cited by: §5.
- Not all samples are created equal: deep learning with importance sampling. CoRR abs/1803.00942. External Links: Link, 1803.00942 Cited by: §1, §1, §2, §2, §3.3, §5.
- Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §3.
- Self-Paced learning for latent variable models. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 1189–1197. Cited by: §5.
- Online batch selection for faster training of neural networks. CoRR abs/1511.06343. External Links: Link, 1511.06343 Cited by: §5.
-
Revisiting rainbow: promoting more insightful and inclusive deep reinforcement learning research
. External Links: 2011.14826 Cited by: §1. - Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico. Cited by: §5.
-
Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks.
In
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
, Honolulu, Hawaii, pp. 254–263. External Links: Link Cited by: §3.1. - Energy and policy considerations for deep learning in nlp. External Links: 1906.02243 Cited by: §1.
- Joint optimization framework for learning with noisy labels. External Links: 1803.11364 Cited by: §5.
-
From imagenet to image classification: contextualizing progress on benchmarks
. ArXiv abs/2005.11295. Cited by: §1. - Learning from multiple annotators with varying expertise. Machine Learning 95, pp. . External Links: Document Cited by: §3.1.
- Data valuation using reinforcement learning. External Links: 1909.11671 Cited by: §5.
- Wide residual networks. External Links: 1605.07146 Cited by: §3.2.
Appendix S1 Additional plots
In S6 we show the performance of the methods under consideration on a dataset without any corruption. In S6 to S11, we depict the performance in the face of various types and amounts of corruption.
|
|
|
|
|
|
|
Comments
There are no comments yet.