Despite rapidly improving software and hardware, training machine learning models remains a costly and time-consuming task. Training a model on a large dataset with many classes such as ImageNet takes roughly a day. Choosing, for example between 7 different hyperparameter values would take roughly a week of GPU time. For researchers and developers seeking to quickly iterate on algorithmic approaches or search for hyperparameters, it can be impractical to wait this long for a model to train. Additionally, cheaper experiments might reduce the cost of running large AutoML experiments, allowing a wider group of researchers to contribute to a growing field.
In this paper, we explore the idea of making such experiments cheaper by creating “proxy datasets” that exhibit two desirable qualities: (1) they should be relatively cheap to train on (2) the relationship between hyperparameters and accuracy of training on the proxy task should closely resemble the dynamics of the full dataset. More specifically, a set of parameters (here referring to hyperparameters, model architectures, or other algorithmic approaches controlled by the researcher) that boost validation accuracy over another set of parameters on the proxy dataset cause a proportional boost in validation accuracy on the full dataset. Similarly, if a set of parameters performs poorly on a proxy dataset, it should perform poorly on the full dataset and the reduction in accuracy compared with a “good” set of parameters should be proportional between the proxy and full datasets. In many hyperameter search tasks all that matters is finding the best configuration, so it is especially important that running hyperparameter search on a proxy dataset produces hyperparameters that are nearly as good as the true best hyperparameters.
In section 2, we discuss related work, which is more focused on the accuracy of training on a reduced dataset than the proxy quality of the reduced dataset. Section 3 discusses our dataset, approach for generating proxy datasets, experimental settings, and metrics for measuring proxy quality. Section 4 presents results.
2 Related Work
Some training examples have negative value. In “What is the value of a training example”, (Lapedriza et al., 2013) propose to rank each training example by the validation accuracy of a model trained on that example and all examples from other classes. They run experiments on Pascal-VOC using SVMs, and find that they can create smaller subsets that outperform training on the full dataset. Using about 90% of the original examples works best in their setting, suggesting that some potentially mislabeled examples have negative value.
In “An investigation of catastrophic forgetting in Neural Networks”, (Goodfellow et al., 2013) examine how often a neural network “forgets” the correct prediction for an example, that is, how often it predicts the correct class for an example in one epoch and then makes an incorrect prediction in a subsequent epoch. In experiments on CIFAR-10111(Krizhevsky and Hinton, 2009), they find that removing the 30% of examples that have been forgotten the least after each epoch, does not impact validation accuracy, whereas removing a random 30% of examples causes a much more significant reduction in accuracy. They also find that the same examples are forgettable and unforgettable for different ResNet architectures. We do not generate proxy datasets based on forget frequency, but instead use the loss of a fully trained model to generate “easy” and “hard” subsets. This choice is based on our hypothesis that frequently forgotten examples will tend to have higher loss at the end of training.
Another group of works use approximations of the per-example gradient norm to estimate the training value of each example, with the insight that cause smaller weight updates are less valuable. In “Not All Samples Are Created Equal: Deep Learning with Importance Sampling”, to take one example,(Katharopoulos and Fleuret, 2018) present a new trick for approximating example value and show that the benefit of skipping back propagation on low value examples can outweigh the cost of approximating the per-example gradient norm. Since an example with a low gradient norm likely has a low loss, we hypothesize that this technique is similar to our training technique, but do not explicitly test that assumption. There is also a large but somewhat older group of papers, on coreset construction, summarized by (Bachem et al., 2017)
Dataset Distillation, by (Wang et al., 2018), compresses a dataset to one example per class, with the goal of training that model to reasonable accuracy with 1 gradient descent step per example, by optimizing synthetic inputs such that a neural network trains well on them. The authors show that AlexNet222(Krizhevsky et al., 2012) trained on distilled data achieves 54% accuracy on CIFAR-10 if the distilllation process is given access to AlexNet’s weight initialization, and 36% without access. On Imagenette, with AlexNet and Kaiming initialization333(He et al., 2016) training on distilled examples achieved 37% accuracy, but we could not achieve results better than 20% (10% is random performance) with images larger than 32x32 or a ResNet architecture.
More importantly, even with the best 37% setup, the distilled dataset turned out to be a poor proxy for the full dataset. Many changes that improved the default AlexNet implementation, like increasing the learning rate or using label smoothing loss, reduced proxy accuracy to random levels or caused gradient explosions. We interpret this result as evidence that (a) the images produced by dataset distillation are directly optimized to produce large gradients and should not be shown to the model multiple times and (b) distilled examples generated with a model that uses one set of hyperparameters do not transfer well to a model with different hyperparameters, as the original authors discuss.
To summarize our contribution, much previous work aims to reduce a dataset to a smaller dataset, either during or before training, and achieve good validation accuracy in less time. Our goal is not to train a strong final model on the smaller dataset, but rather to use the smaller dataset as a tool to accelerate hyperparameter search.555Our results could also be used with smarter hyperparameter search methods like Neural Architecture Search and Bayesian Optimization.444(Zoph and Le, 2016) and (Bergstra et al., 2013)
Due to the large number of experiments required to demonstrate the statistical power of proxy results, we create proxy datasets on Imagenette and Imagewoof, two ImageNet proxies created by Jeremy Howard 666github.com/fastai/imagenette. Imagenette contains ImageNet examples from 10 easy to distinguish classes, while Imagewoof, a harder dataset for classification, contains data from 10 hard to distinguish classes – different dog breeds. We run all experiments on 128x128 images, besides synthetic images generated by dataset distillation, which are 32x32. We ran 36 different hyperparameter configurations x 2 datasets x 6 proxy creation strategies x 3 time budget levels, for a total of 1,296 training runs.
Hyperparameter Configurations (defaults in bold):
Architecture: XResnet50, XResnet18 or XResnet101.
Stem Channels: How many channels for to output from each of ResNet’s first two convolutional layers. This is two parameters set independently [[4, 32, 48], [4, 32, 48]]
Flip LR Probability: [.0, .25,.5] This parameter controls data augmentation. Specifically, how often to flip a training image horizontally.
, SGD, RMSProp
Loss Function: Cross Entropy Loss with or without label smoothing. Label Smoothing (Pereyra et al., 2017) attempts to decrease a model’s sensitivity to mislabeled examples by modifying the cross entropy loss function to use target values of 0.9 for the labeled class instead of instead of 1.0.777Label Smoothing seems to degrade performance slightly on our experiments, likely because they don’t use mixup.
We do not run every possible combination of these hyperparameter values. Instead, we change one parameter at a time, while leaving the others as their default (bolded) values in order to increase the diversity of our search space. Table 1 shows the best performing hyperparameters on each dataset, which are slightly different than the defaults.
|Flip LR Probability||0.5||0.25|
|Val Accuracy for Default Params||0.920||0.784|
Proxy Creation Strategies
All the Classes + Random Sampling: this is our baseline.
Half the Classes + Random Sampling
Easiest Examples: Both easy and hard examples are selected by training a model with the default hyperparameters on the full dataset then using that model to evaluate the loss on each example. Low loss examples are considered easy.
Hardest Examples: High loss examples selected using the same procedure.
Fewer Epochs: train models for 1, 5, 10, epochs. We refer to this as a proxy creation strategy because it reduces the cost of obtaining an experimental result, and can theforefore be thought of as a proxy for the target task.
Experimental Settings For each hyperparameter setting and proxy creation strategy (including the baseline – use all the data), we train for the required epochs with the relevant hyperparameter setting. The validation data is always the same 500 examples, besides for the half-classes proxy creation strategies, where we remove classes that are not shown in the training set. For each run, we take the best validation accuracy at any epoch as the proxy result.888This tends to increase top 1 Accuracy for those proxies but does not impact proxy quality metrics. All experiments are run with the One-Cycle learning rate schedule, from (Smith and Topin, 2018), and half-precision training, using a slightly modified version of the train imagenette script from the fastai library. Running the target task (20 epochs) on either dataset takes about 30 minutes and costs 52 cents on a P100 GPU.
Measuring Proxy Quality We propose three metrics for evaluation the quality of a proxy. First the of the regression
where denotes the th hyperparameter configuration. This is also the covariance of the Target Accuracy and Proxy Accuracy.
In order to produce statistics in a wider range we normalize all accuracies within each dataset.999This way simply achieving less accuracy on Imagewoof, a harder dataset, does not give a proxy positive This metric can be interpreted as the amount of variance in target experiment results that can be explained by proxy experiment results. Higher is better.
The second metric is the Spearman correlation between Proxy and Target results, ignoring poorly performing hyperparameters.
where is the best performing hyperparameter configurations for a given proxy creation strategy.101010For example. if we ran 4 experiments and achieved [.92, .93, .84, .86] on the proxy, and [.95, .99, .81] on the Target Acc. The statistic would be Spearman([.92, .93], [.95, .99]) This statistic is motivated by the insight that in most hyperparameter search use cases, the user mostly cares about the relative rankings of the best configurations, which is what this statistic measures.111111We also tried to measure regret – how the best hyperparameters found on the proxy task before on the target task, but found this metric unstable, even averaged across randomly sampled hyperparameter configurations.
The third metric
Cost Adjustment Since the of a proxy is very correlated with it’s computational cost, we also measure Cost-Adjusted , by extracting the residual of a second regression:
We chose to use three polynomial terms by adding terms until they were given 0 coefficient by Lasso Regression.121212https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html Visually, is the distance between each point and the line in Figure 2.
General Trends Figure 2 shows our two metrics of proxy quality for all proxy creation strategies against the computation cost of the proxy. As expected, costlier proxies tend to be more correlated to the target than cheaper ones, but some cheaper proxies perform reasonably well. With the easiest 10% of examples as a proxy, one can explain 81% of the variance in target outcomes.
On 50% of the data two strategies – Easy Examples and Half Classes have >95% , and these two strategies beat the others across different Computational costs (the X axis). All of these strategies recover the same best hyperparameters as the target task.
Figure 1 plots proxy performance vs target performance for 4 different proxies that all take less than 3 minutes to train (10% of the target task). Each data point in each chart represents the proxy and target results for some hyperparameter setting. The easy-example (red) strategy’s regression line has an of 81% of the variance in target results, much more than the 20% from using the 10% hardest examples.
Hard vs Easy Proxying by selecting the easiest examples is the best performing strategy as shown in Figure 1. Another way to make the task easier, sampling half of the classes, is the second best proxy creation strategy, only slightly underperforming direct selection of easy examples on a cost-adjusted basis. These two strategies: restricting to the easiest examples, and restricting to half the classes before randomly sampling, outperform the other proxy creation strategies at all cost levels. Random Sampling and Training for fewer epochs perform almost equivalently to each other, but worse than the easier strategies. Selecting only hard examples is the worst strategy. This result is not caused by mislabeled or unlearnable examples; discarding the 5% hardest examples, helps the 50% hardest example proxy by 2% while reducing the computational cost, but that proxy still underperforms Random Sampling. These trends continue in Figure 2, which plots our second metric, the Spearman correlation between proxy result and target result for good hyperparameter configurations.
At first glance, these results seems to contradict the Forgetting paper’s evidence that easy/rarely forgotten examples can be discarded during training without event, but given the fairly small datasets we are working with, and the fact that our models never get to see the easy examples before they are discarded, we hypothesize that hard examples are much more useful later in the training process, after the model has already learned the basic features of the dataset from the easy examples.
These “easy” subsets are better proxies than training on the full dataset for a reduced number of epochs (but equivalent computational cost), and persist even if we remove the hardest 5% of examples, which might be mislabeled.
Table 2, in the appendix, shows all metrics for all proxy creation strategies at different complexity levels.
Proxy Quality Transfers to New Settings Figure 3 shows that all three of our proxy quality measurements generalize between datasets and between independent hyperparameter grids.
On the left hand side we split our experimental results by dataset and then rerun metrics. By all three metrics, including the cost-adjusted , proxy creation strategies that measure well on Imagenette tend to measure well on Imagewoof, and vice versa. The right hand side splits the hyperparameter configurations in two roughly equally sized non-overlapping subsets – configurations that change the learning rate (“Learning Rate Experiments”) and those that change any other hyperparameter. Again, proxy creation strategies that measure well on learning rate experiments tend to measure well on other experiments, by all three metrics. Although this result is not plotted, the easiest example proxies continue to perform the best in all settings.
Although we do not investigate this phenomenon as deeply, another way to reduce the cost of hyperparameter search might be to cancel underperforming models before they are done training and redirect the saved resources to other experiments. Figure 4 examines how much we know about the final model performance at a given epoch, and suggests that there appears to be a significant jump in information in the second epoch, and second, that late epochs continue to provide valuable information. These results are admittedly very dependent on our context – if we trained for 30 epochs the curve would likely flatten out more quickly, but they still should inspire caution. If we randomly choose two runs that use all the data, but with different hyperparameters, the chance that the model with the higher validation accuracy after 1 epoch outperforms the other model is only 72%. After 10 epochs this number is 81%, and after 15 epochs it increases to 96%.
Our results suggest that hyperparameter search can be accelerated by using small subsets of the data. Running hyperparameter search on the easiest 10% of examples explains 81% of the variance in experiment results on the target task, and using the easiest 50% of examples can explain 95% of the variance, and all three of these strategies generate the same optimal hyperparameters as the target task. Proxy datasets built using the easiest examples are consistently higher quality than those built with hardest examples. This pattern persists across datasets, and independent slices of the parameter grid.
6 Future Work
Imagenette and Imagewoof are both subsets of ImageNet, and it would be interesting to test whether the same high-level proxy quality patterns persist in other CV datasets, like CIFAR-10, or other domains like NLP. Additionally, it would be interesting to put these ideas into practice in the context of smarter hyperparameter selection systems and re-measure the speed/accuracy tradeoff. It would similarly be interesting to develop a regret style metric that actually measures the accuracy lost by taking shortcuts on a large enough set of experiments to establish consistency. Finally, the failure of the hard example proxies suggests that they might be better used in a curriculum learning inspired approach, where we switch from easier to harder proxy sets, might lead to even better proxies. Relatedly, initializing from pretrained weights might reduce the usefulness of easy examples.
As an experienced data scientist and machine learning engineer, Sam was the creative force behind much of the project design and direction, including deciding on sampling strategies, hyperparameter grids to run, and metrics to evaluate proxy quality. He also made major contributions to the experimental framework code, reproduced Dataset Distillation (Wang et al., 2018), bootstrapped the analysis code, and led the charge on the final report writeup.
Eric contributed by bootstrapping the experiment framework code and later enhancing it to support multiple sampling strategies, running experiments and collecting results, analyzing results, and creating plots.
We’d like to thank Jeremy Howard for inspiring us with his Imagenette (Howard, 2019) dataset to further explore the idea of creating proxy datasets, and Tongzhou Wang for helping us get Dataset Distillation running and quickly responding to our GitHub issues.
- Bachem et al. (2017) Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
- Bergstra et al. (2013) James Bergstra, Dan Yamins, and David D Cox. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference, pages 13–20. Citeseer, 2013.
- Goodfellow et al. (2013) Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
- Howard (2019) Jeremy Howard. Imagenette. Github repository with links to dataset, 2019. https://github.com/fastai/imagenette.
- Katharopoulos and Fleuret (2018) Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. arXiv preprint arXiv:1803.00942, 2018.
- Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Lapedriza et al. (2013) Agata Lapedriza, Hamed Pirsiavash, Zoya Bylinskii, and Antonio Torralba. Are all training examples equally valuable? arXiv preprint arXiv:1311.6510, 2013.
- Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
- Smith and Topin (2018) Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of residual networks using large learning rates. 2018.
- Wang et al. (2018) Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
- Zoph and Le (2016) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.