1 Introduction
Data sampling policies can greatly influence the performance of model training in computer vision tasks, and therefore finding robust sampling policies can be important. Handcrafted rules, e.g. data resampling, reweighting, and importance sampling, promote better model performance by adjusting the training data frequency and order
(Estabrooks et al., 2004; Weiss et al., 2007; Bengio et al., 2009; Johnson and Guestrin, 2018; Katharopoulos and Fleuret, 2018; Shrivastava et al., 2016). But they heavily rely on the assumption over the dataset and cannot adapt well to datasets with their own characteristics. To handle this issue, learningbased methods (Li et al., 2019; Jiang et al., 2017; Fan et al., 2017) were designed to automatically reweight or select training data utilizing metalearning techniques or a policy network.However, existing learningbased sampling methods still rely on human priors as proxies to optimize sampling policies, which may fail in practice. Such priors often include assumptions on policy network design for data selection (Fan et al., 2017), or dataset conditions like noisiness (Li et al., 2019; Loshchilov and Hutter, 2015) or imbalance (Wang et al., 2019)
. These approaches take image features, losses, importance or their representations as inputs and apply the policy network or other learning approaches with small amount of parameters for estimating the sampling probability. However, for example, images with similar visual features can be redundant in training, and their losses or features fed into the policy network are more likely to be close, causing the same probability to be sampled for redundant samples if we rely on aforementioned priors. Therefore, we propose to directly optimize the sampling schedule itself so that no prior knowledge is required for the dataset. Specifically, the sampling schedule refers to order by which data are selected for the entire training course. In this way, we only rely on data themselves to determine the optimal sampling schedule without any prior.
Directly optimizing a sampling schedule is challenging due to its inherent high dimension. For example, for the ImageNet classification dataset
(Deng et al., 2009)with around one million samples, the dimension of parameters would be in the same order. While popular approaches such as deep reinforcement learning
(Cubuk et al., 2018; Zhang et al., 2020), Bayesian optimization (Snoek et al., 2015), populationbased training (Jaderberg et al., 2017) or simple random search (Bergstra and Bengio, 2012) have already been utilized to tune lowdimensional hyperparameters like augmentation schedules, their applications in data sampling schedules remain unexploited. For instance, the dimension of a data augmentation policy is generally only in dozens, and it needs thousands of training runs (Cubuk et al., 2018)to sample enough rewards to find an optimal augmentation policy because highquality rewards require many epochs of training to obtain. As such, optimizing a sampling schedule may require orders of magnitude more rewards than data augmentation to gather and hence training runs, which results in prohibitively slow convergence. To overcome the aforementioned challenge, we propose a data sampling policy search framework, named AutoSampling, to sufficiently learn an optimal sampling schedule in a populationbased training fashion
(Jaderberg et al., 2017). Unlike previous methods, which focus on collecting longterm rewards and updating hyperparameters or agents offline, our AutoSampling method collects rewards online with a shortened collection cycle but without priors. Specifically, the AutoSampling collects rewards within several training iterations, tens or hundred times shorter than that in existing works (Ho et al., 2019; Cubuk et al., 2018). In this manner, we provide the search process with much more frequent feedback to ensure sufficient optimization of the sampling schedule. Each time when a few training iterations pass, we collect the rewards from the previous several iterations, accumulate them and later update the sampling distribution using the rewards. Then, we perturb the sampling distribution to search in distribution space, and use it to generate new minibatches for later iterations, which are recorded into the output sampling schedule. As illustrated in Sec. 5.2, shortened collection cycles with less interference can also better reflect the training value of each data.Our contributions are as follows:

To our best knowledge, we are the first to propose to directly learn a robust sampling schedule from the data themselves without any human prior or condition on the dataset.

We propose the AutoSampling method to handle the optimization difficulty due to the high dimension of sampling schedules, and efficiently learn a robust sampling schedule through shortened reward collection cycle and online update of the sampling schedule.
Comprehensive experiments on CIFAR10/100 and ImageNet datasets
(Krizhevsky, 2009; Deng et al., 2009) with different networks show that the Autosampling can increase the top1 accuracy by up to 2.85% on CIFAR10, 2.19% on CIFAR100, and 2.83% on ImageNet.2 Background
2.1 Related Works
Data sampling is of great significance to deep learning, and has been extensively studied. Approaches with humandesigned rules take predefined heuristic rules to modify the frequency and order by which training data is presented. In particular, one intuitive method is to resample or reweight data according to their frequencies, difficulties or importance in training
(Estabrooks et al., 2004; Weiss et al., 2007; Drummond et al., 2003; Bengio et al., 2009; Lin et al., 2017; Shrivastava et al., 2016; Loshchilov and Hutter, 2015; Wang et al., 2019; Johnson and Guestrin, 2018; Katharopoulos and Fleuret, 2018; Byrd and Lipton, 2018; cased). These methods have been widely used in imbalanced training or hard mining problems. However, they are often restricted to certain tasks and datasets based on which they are proposed, and their ability to generalize to a broader range of tasks with different data distribution may be limited. In another word, these methods often implicitly assume certain conditions on the dataset, such as cleanness or imbalance. In addition, learningbased methods have been proposed for finding suitable sampling schemes automatically. For example, methods using metalearning or reinforcement learning are taken to automatically select or reweight data during training (Li et al., 2019; Jiang et al., 2017; Ren et al., 2018; Fan et al., 2017), but they are only tested on smallscale or noisy datasets. Whether or not they can generalize over tasks of other datasets still remain untested. In this work, we directly study the data sampling without any prior, and we also investigate its wide generalization ability across different datasets such as CIFAR10, CIFAR100 and ImageNet using many typical networks.As for hyperparameter tuning, popular approaches such as deep reinforcement learning (Cubuk et al., 2018; Zhang et al., 2020), Bayesian optimization (Snoek et al., 2015) or simply random search (Bergstra and Bengio, 2012) have already been utilized to tune lowdimensional hyperparameters and proven to be effective. Nevertheless, they have not been adopted to find good sampling schedules due to its inherent high dimension. Some recent works tackle the challenge of optimizing highdimensional hyperparameter. (MacKay et al., 2019) uses structured bestresponse functions and (Jonathan Lorraine, 2019) achieves this goal through the combinations of the implicit function theorem and efficient inverse Hessian approximations. However, they have not been tested on the task of optimizing sampling schedules, which is the major focus of our work in this paper.
2.2 Population Based Training
Hyperparameter tuning task can be framed as a bilevel optimization problem with the following objective function,
(1) 
where represents the model weight and is the hyperparameter schedule for training intervals. Population based training (PBT) (Jaderberg et al., 2017) solves the bilevel optimization problem by training a population of child models in parallel with different hyperparameter schedules initialized:
(2) 
where respectively represents the child model weight, the corresponding hyperparameter schedule for the training interval on worker , and is the number of workers. PBT proceeds in intervals, which usually consists of several epochs of training. During the interval, the population of models are trained in parallel to finish the lowerlevel optimization of weights .
Between intervals, an exploitandexplore procedure is adopted to conduct the upperlevel optimization of the hyperparameter schedule. In particular for interval , to exploit child models are evaluated on a heldout validation dataset:
(3) 
The best performing hyperparameter setting is recorded and the topperforming model is broadcasted to all workers. To explore, new hyperparameter schedules are initialized for interval with different random seeds on all workers, which can be viewed as a search in the hyperparameter space. The next exploitandexplore cycle will then be continued. In the end, the topperforming hyperparameter schedule can be obtained.
PBT is applied to tune lowdimenisal hyperparameters such as data augmentation schedules (Ho et al., 2019; Jaderberg et al., 2017). However, it cannot be directly used for finding sampling strategies due to the high dimension. For instance, in PBA (Ho et al., 2019) 3200 epochs of training are needed to optimize 60 hyperparameters for data augmentation, and a linear upscaling to learning sampling schedule in CIFAR100 would require prohibitively 2.67 million epochs. Unlike PBT, our AutoSampling adopts a multiexploitationandexploration structure, leading to much shorter reward collection cycles that contribute to much more effective rewards for sufficient optimization within a practical computational budget.
3 Preliminaries
Consider the bilevel optimization problem detailed by Equation.1 over a training dataset We define a sampling schedule to be an enumerated collection of data from the dataset for training the model, that is, , with each sampled from . We use to denote the total number of data trained for the lower task of optimizing the model weight . The product of copies of dataset constructs the search space of sampling schedules, from which we wish to find an optimal sampling schedule to solve the aforementioned bilevel optimization problem:
(4) 
Note a sampling schedule may also be represented as a enumerated collection of several sampling subschedules. For instance, a sampling schedule can be denoted as where
4 AutoSampling with Searching
The overview of our AutoSampling is illustrated in Fig.1
. AutoSampling alternately runs multiexploitation step and exploration step. In the exploration step, we 1) update the sampling distribution using the rewards collected from the multiexploitation step (the sampling distribution is uniform distribution initially); 2) perturb the updated sampling distribution for child models so that different child models have different sampling distributions; 3) use the corresponding perturbed sampling distribution for each child model to sample minibatches of training data. In the multiexploitation step, we 1) train multiple child models using the minibatches sampled from the exploration step; 2) collect shortterm rewards from the child models. AutoSampling finishes with a recorded topperforming sampling schedule, which can be transferred to other models.
4.1 MultiExploitation by Searching in the Data Space
In the multiexploitation step, we aim to search locally in the data space by collecting shortterm rewards and subschedules. Specifically, we wish to learn a sampling schedule for exploitation intervals. In each interval, there are a population of child models. We denote as the training data subschedule in the interval for the child model. When all of the exploitation intervals for the child model are considered, we have , where is the number of training data used for this multiexploitation step. Each interval consists of training iterations that is also equivalent to training minibatches, where is the length of the interval. AutoSampling is expected to produce a sequence of training samples, denoted by , so that a given model is optimally trained. The population {} forms the local search space, from which we aim to search for an optimal sampling schedule .
Given the population , we train them in parallel on workers for interval . Once the interval of data containing training batches have been used for training, we evaluate all child models and use the top evaluation performance as the reward. According to the reward, we record the topperforming weight and subschedule for the current interval , in particular,
(5) 
On the other hand, we update all child model weights of by cloning into them with the topperforming weight so we can continue searching based on the most promising child. We continue the exploit steps through all exploitation intervals, record and output the recorded optimal sampling schedule for the multiexploitation step. By using exploitation interval of minibatches rather than epochs or even entire training runs adopted by earlier methods, AutoSampling may yield a better and more robust sampling schedule. It should be pointed out that even though in AutoSampling rewards are collected within a much shorter interval, they remain effective. As we directly optimize the sampling schedule, we are concerned with only the data themselves. The shortterm rewards reflect the training value of data from the exploitation interval they are collected. But for global hyperparameters such as augmentation schedules, shortterm rewards may lead to inferior performance as these hyperparameters are concerned with the overall training outcome. We describe the multiexploitation step with details in Alg.1.
4.2 Exploration by Searching in Sampling Distribution Space
In the exploration, we search in the sampling distribution space by updating and perturbing the sampling distribution. We first estimate the underlying sampling distribution from the top sampling schedule produced in the multiexploitation, that is, for ,
(6) 
where denotes the number of ’s appearances in . We further perturb the and generate the sampling schedules on each worker for the later multiexploitation. We introduce perturbations into the generated schedules by simply sampling from the multinomial distribution using different random seeds. However, in our experiments, we observe that the distribution produced by
tends to be extremely skewed and a majority of the data actually have zero frequencies. Such skewness causes highly imbalanced training minibatches, and therefore destabilizes subsequent model training.
Distribution Smoothing To tackle the above issue, we first smooth through the logarithmic function, and then apply a probability mixture with uniform distributions. In particular for the dataset ,
(7) 
where is the smoothing factor and denotes uniform multinomial distributions on the dataset . The smoothing through the function can greatly reduce the skewness, however, may still contain zero probabilities for some training data, resulting in unstable training. Therefore, we further smooth it through a probability mixture with uniform distribution to ensure presence of all data. This is equivalent to combining epochs of training data to the training batches sampled from , and shuffling the union. Once we have new diverse sampling schedules for the population, we proceed to the next multiexploitation step.
We continue this alternation between multiexploitation and exploration steps until the end of training. Note that to generate sampling schedule for the first multiexploitation run, we initialize to be an uniform multinomial distribution. In the end, we output a sequence of optimal sampling schedules for alternations. The entire process is illustrated in details in Alg.2.
Network  Worker()  Interval()  Exploration type  top1(%) 

Resnet18 (Zhang et al., 2019)        78.34 
Resnet18  1    Uniform  78.46 
Resnet18  20  80 Batches  Random  78.76 
Resnet18  20  20 Batches  Random  78.99 
Resnet18  80  20 Batches  Random  79.09 
Resnet18  20  20 Batches  Mixture  79.44 
Resnet50 (Jin et al., 2019)        79.34 
Resnet50  1    Uniform  79.70 
Resnet50  20  80 Batches  Random  80.55 
Resnet50  20  20 Batches  Random  81.05 
Resnet50  80  20 Batches  Random  81.19 
Resnet50  20  20 Batches  Mixture  81.53 
DenseNet121  1    Uniform  80.13 
DenseNet121  20  80 Batches  Random  80.62 
DenseNet121  20  20 Batches  Random  81.11 
DenseNet121  80  20 Batches  Random  81.08 
DenseNet121  20  20 Batches  Mixture  80.97 
Network  Exploration Type  Top1(%) 

Resnet18  uniform  93.01 
Resnet18  Random  95.86 
Resnet18  Mixture  95.80 
Resnet50  uniform  93.60 
Resnet50  Random  96.10 
Resnet50  Mixture  96.09 
Network  Exploration Type  Top1(%) 

Resnet18  uniform  70.38 
Resnet18  Random  72.07 
Resnet18  Mixture  72.91 
Resnet34  uniform  74.09 
Resnet34  Random  76.11 
Resnet34  Mixture  76.92 
Network  Sampling Type  
uniform  static  Dynamic  
Resnet18  78.46  78.80  79.44 
Resnet50  79.70  80.21  81.53 

5 Experiments
In this section, we present comprehensive experiments on various datasets to illustrate the performance of AutoSampling, and also demonstrate the process of progressively learning better sampling distribution.
5.1 Implementation Details
Experiments on CIFAR We use the same training configuration for both CIFAR100 and CIFAR10 datasets, which both consist of 50000 training images. In particular, for model training we use the base learning rate of 0.1 and a step decay learning rate schedule where the learning rate is divided by 10 after each 60 epochs. We run the experiments for 240 epochs. In addition, we set the training batch size to be 128 per worker, and each worker is for one Nvidia V100 GPU card.
We run the explore step for each epochs with , but note that we take the first explore step after the initial epochs to better accumulate enough rewards. The experiments require 4800 epochs of training for 20 workers, and roughly 14 hours of training time.
Experiments on ImageNet For ImageNet which consists of 1.28 million training images, we adopted the base learning rate of 0.2 and a cosine decay learning rate schedule. We run the experiments with 100 epochs of training. For each worker we utilize eight Nvidia V100 GPU cards and a total batch size of 512. Eight workers are used for all ImageNet experiments, and the rest of the setting adheres to that of CIFAR experiments. In addition, we utilize FP16 computation to achieve faster training, which has almost no drop in accuracy in practice. The experiments require 800 epochs of training for 8 workers, and roughly 4 days of training time.
Network  Sampling Schedule Source  

uniform  Resnet18  Resnet50  Densenet121  
Resnet50  79.70  80.27  80.21  80.47 
5.2 Ablation study
For this part, we gradually build up and test components of AutoSampling on CIFAR100, and then examine their performances on CIFAR10 and ImageNet datasets.
Exploration types We first introduce the three exploration types examined, corresponding to one baseline and two variants of AutoSampling.

Uniform Exploration corresponds to regular model training with minibatches uniformly sampled from the dataset.

Random Exploration adds the multiexploitation step upon the uniform exploration. In particular, the random exploration method conducts the multiexploitation step (4.1) among several workers. In between multiexploitation steps, the random exploration generate later sampling schedulers for each worker simply through uniform sampling. Note the random exploration with one worker is equivalent to the uniform exploration.

Mixture Exploration adds the sampling distribution search (4.2) upon the random exploration in between multiexploitation steps, completing the AutoSampling method.
Adding Workers To look into the influence of the worker numbers, we conduct experiments using worker numbers of 1, 20, 80 respectively with the same setting (
with random exploration). With the worker number of 1, the experiment is simply the normal model training using stochastic gradient descent. To show the competitiveness of our baselines, we also include recent stateoftheart results on CIFAR100 with ResNet18 and ResNet50
(Zhang et al., 2019; Jin et al., 2019). We notice significant performance gain using the worker number of 20 for ResNet18, ResNet50 and DenseNet121 (He et al., 2015; Huang et al., 2017), as illustrated in Table 1. However, we note that increasing worker number from 20 to 80 only brings marginal performance gains across various model structures, as shown in Table 1. Therefore, we set the worker number to be 20 for the rest of the experiments.Shortening Exploitation Intervals To study the effects of the shortened exploitation interval, we run experiments using different exploitation intervals of 20 and 80 batches(iterations) respectively. As shown in Table 1, models with the shorter exploitation interval of 20 batches(iterations) perform better than the one with the longer exploitation interval across all three network structures, conforming to our assumptions that the reward collected reflects value of each data used in the exploitation interval. This result adheres to our intuition that shorter exploitation interval can encourage the sampler to accumulate more rewards to learn better sampling schedules. For the rest of this section we keep the exploitation interval of 20.
Adding Exploration Type We further add mixture as the exploration type to see the effects of learning the underlying sampling distribution, and completing the proposed method. As shown in Table 1, with ResNet18 and ResNet50 we push performance higher with the mixture exploration, and outperform the baseline method by about 1 and 1.8 percentage on CIFAR100 respectively. However, we found that it is not true in the case of DenseNet121 and this case may be attributed to the bigger capacity of DenseNet121.
Generalization Over Datasets In addition, we experiment on other datasets. We report the results on CIFAR10 in Table 3 and the results of ResNet18, ResNet34 on ImageNet in Table 3. For CIFAR10, we notice that the mixture and random exploration methods are comparable while both outperforming the uniform baseline, and we believe it is due to the simplicity of the dataset. In the more challenging ImageNet, the mixture exploration outperforms the random exploration by a clear margin. We also compare our AutoSampling with some recent nonuniform sampling methods on CIFAR100, which can be found in 5.6.
5.3 Static vs Dynamic Schedules
We aim to see if the final sampling distribution estimated by our AutoSampling is sufficient to produce robust sampling schedules. In another word, we wish to know training with the AutoSampling is either a process of learning a robust sampling distribution, or a process of dynamically adjusting the sampling schedule for optimal training. To this end, we conduct training using different sampling schedules. First, we calculate the sampling distribution estimated throughout the learning steps of AutoSampling, and use it to generate the sampling schedule of a full training process, which we denote as static. Moreover, we represent the sampling schedule learned using AutoSampling as dynamic, since AutoSampling dynamically adjust the sampling schedule alongside the training process. Finally, we denote the baseline method as uniform, which uses the sampling schedule generated from uniform distribution.
We report results on CIFAR100 with ResNet18 and ResNet50 in Table 4. Model trained with static sampling schedules exceeds the baseline uniform significantly, indicating the superiority of the learned sampling distribution over the uniform distribution. It shows the ability of AutoSampling to learn good sampling distribution. Nonetheless, note that models trained with dynamic sampling schedules outperform models trained with static, by a margin bigger than the one between static and uniform. This result shows the fact that despite the AutoSampling’s capability of learning good sampling distribution, its flexibility during training matters even more. Moreover, this phenomenon also indicates that models at different stages of learning process may require different sampling distributions to achieve optimal training. One single sampling distribution, even gradually estimated using AutoSampling, seems incapable of covering the needs from different learning stages. We plot the histograms of data counts in training estimated from schedules of different learning stages with ResNet18 on CIFAR100 in Fig.2, showing the great differences between optimized sampling distributions from different epochs.
Methods  Network  Baseline (%)  With method (%)  Improvement (%) 

DLIS  WRN282  66.0  68.0  2.0 
AutoSampling (ours)  WRN282  73.37  76.24  2.87 
RAIS  ResNet18  76.4  76.4  0.0 
AutoSampling (ours)  ResNet18  78.46  79.44  0.98 
5.4 Analyzing sampling schedules learned by AutoSampling
To further investigate the sampling schedule learned by AutoSampling, we review the images at the tail and head part of the sampling spectrum. In particular, given a sampling schedule learned we rank all images based on their numbers of appearances in the training process. Training images at the top and bottom of the order are extracted, corresponding to high and low probabilities of being sampled respectively. In Fig.3, we show 4 classes of exemplary images. Conforming to our presumption, the sampling probability seems to indicate the difficulty of each training image. The images of low probability tend to have clearer imagery features enabling easy recognition, while the images of high probability tend to be more obscure. This result indicates that the sampling schedule learned by AutoSampling may possess some hard samples mining ability.
We also draw the comparison between the sampling frequency of each training image and its loss values of different training epochs on CIFAR100. As shown in Fig. 4, across different learning stages the correlation between loss values and sampling frequencies of training data is not strong. The high chance of being sampled by AutoSampling does not necessarily lead to high loss values, which demonstrates that AutoSampling is not merely oversampling difficult samples as pointed by the loss and therefore has more potential beyond simple visually hard example mining. The resulting sampling schedule learned by AutoSampling might be significantly different from the one guided by loss.
In addition, we notice the images of low probability also contain low quality images. For instance, in Fig.3 the leftmost image of Camal class contains only legs. This shows that AutoSampling may potentially rule out problematic training data for better training.
Furthermore, we examine the transfer ability of sampling distributions learned by AutoSampling to other network structures. Specifically, we run training on ResNet50 (He et al., 2015) using static sampling schedule generated by three distributions learned by AutoSampling on 3 different models. As shown in Table 5, using sampling schedules learned by AutoSampling from other models, we demonstrate similar improvements over the uniform baseline. This result showing generalization across model structures, combined with the above observations on images of different sampling probability, indicates that there may exist a common optimal sampling schedule determined by the intrinsic property of the data rather than the model being optimized. Our AutoSampling is an effort to gradually converge to such an optimal schedule.
5.5 Discussions
The experimental results and observations from Section 5.3 and 5.4 shed light on the possible existence of an optimal sampling schedule, which relies only on the intrinsic property of the data and the learning stage of the model, regardless of the specific model structure or any human prior knowledge. The AutoSampling method is able to provide relatively enough rewards in the searching process compared to other related works, leading to sufficient convergence towards the desired sampling schedule. Once obtained, the desired sampling schedule may also be generalized over other model structures for robust training, as shown in Table 5. Although AutoSampling requires relatively large amount of computing resources to find a robust sampler, we want to point out that the efficiency of our method can be improved through better training techniques. Moreover, the possibility of an optimal sampling schedule relying solely on the data themselves may indicate more efficient sampling policy search algorithms, if one can quickly and effectively determine data value based on its property.
5.6 Comparison with existing sampling methods
To better illustrate the effectiveness of our AutoSampling method, we conduct experiments in comparison with recent nonuniform sampling methods DLIS (Johnson and Guestrin, 2018) and RAIS (Katharopoulos and Fleuret, 2018). DLIS (Johnson and Guestrin, 2018)
achieves faster convergence by selecting data reducing gradient norm variance, while RAIS
(Katharopoulos and Fleuret, 2018) does so through approximating the ideal sampling distribution using robust optimization. The comparison is recorded in Table 6.First, we run AutoSampling using Wide Resnet282 (Zagoruyko and Komodakis, 2016) on CIFAR100 with the training setting aligned roughly to (Katharopoulos and Fleuret, 2018). AutoSampling achievs improvement of roughly 3 percentage points (73.37% 76.24%), while Katharopoulos and Fleuret shows improvement of 2 percentage points (66.0% 68.0 %). Second, we report the comparisons between AutoSampling and RAIS on CIFAR100. Johnson and Guestrin shows no improvement (76.4% 76.4 %) on accuracy and 0.027 (0.989 0.962 ) decrease in validation loss, while our method shows improvement of 0.008 (78.6% 79.4%) on accuracy and 0.014 (0.886 0.872 ) decrease in validation loss. As such, our method demonstrates improvements over existing nonuniform sampling methods.
6 Conclusions
In this paper, we introduce a new search based AutoSampling scheme to overcome the issue of insufficient rewards for optimizing highdimensional sampling hyperparameter by utilizing a shorter period of reward collection. In particular, we leverage the populationbased training framework (Jaderberg et al., 2017). We use a shortened exploitation interval to search in the local data space and provide sufficient rewards. For the exploration step, we estimate sampling distribution from the searched sampling schedule and perturb it to search in the distribution space. We test our method on CIFAR10/100 and ImageNet datasets (Krizhevsky, 2009; Deng et al., 2009) with different networks show that it consistently outperforms the baseline methods across different benchmarks.
References
 Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 41–48. External Links: ISBN 9781605585161, Link, Document Cited by: §1, §2.1.
 Random search for hyperparameter optimization. J. Mach. Learn. Res. 13 (null), pp. 281–305. External Links: ISSN 15324435 Cited by: §1, §2.1.
 Weighted risk minimization & deep learning. CoRR abs/1812.03372. External Links: Link, 1812.03372 Cited by: §2.1.
 AutoAugment: learning augmentation policies from data. CoRR abs/1805.09501. External Links: Link, 1805.09501 Cited by: §1, §2.1.
 ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, Cited by: §1, §1, §6.
 C4. 5, class imbalance, and cost sensitivity: why undersampling beats oversampling. In Workshop on learning from imbalanced datasets II, Vol. 11, pp. 1–8. Cited by: §2.1.
 A multiple resampling method for learning from imbalanced data sets. Computational intelligence 20 (1), pp. 18–36. Cited by: §1, §2.1.
 Learning what data to learn. CoRR abs/1702.08635. External Links: Link, 1702.08635 Cited by: §1, §1, §2.1.
 Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §5.2, §5.4.
 Population based augmentation: efficient learning of augmentation policy schedules. CoRR abs/1905.05393. External Links: Link, 1905.05393 Cited by: §1, §2.2.

Densely connected convolutional networks.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §5.2. 
Population based training of neural networks
. CoRR abs/1711.09846. External Links: Link, 1711.09846 Cited by: §1, §2.2, §2.2, §6.  MentorNet: regularizing very deep neural networks on corrupted labels. CoRR abs/1712.05055. External Links: Link, 1712.05055 Cited by: §1, §2.1.
 Knowledge distillation via route constrained optimization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 1, §5.2.
 Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 7265–7275. External Links: Link Cited by: §1, §2.1, §5.6, §5.6.

Optimizing millions of hyperparameters by implicit differentiation
. proceedings of AISATS 11. Cited by: §2.1.  Not all samples are created equal: deep learning with importance sampling. CoRR abs/1803.00942. External Links: Link, 1803.00942 Cited by: §1, §2.1, §5.6, §5.6.
 Learning multiple layers of features from tiny images. Technical report . Cited by: §1, §6.
 LAW: learning to auto weight. CoRR abs/1905.11058. External Links: Link, 1905.11058 Cited by: §1, §1, §2.1.
 Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.1.
 Online batch selection for faster training of neural networks. CoRR abs/1511.06343. External Links: Link, 1511.06343 Cited by: §1, §2.1.
 Selftuning networks: bilevel optimization of hyperparameters using structured bestresponse functions. proceedings of ICLR abs/1903.03088. External Links: Link, 1903.03088 Cited by: §2.1.
 Learning to reweight examples for robust deep learning. CoRR abs/1803.09050. External Links: Link, 1803.09050 Cited by: §2.1.
 Training regionbased object detectors with online hard example mining. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
 Scalable bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2171–2180. External Links: Link Cited by: §1, §2.1.
 Dynamic curriculum learning for imbalanced data classification. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1.
 Costsensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs?. Dmin 7 (3541), pp. 24. Cited by: §1, §2.1.
 Wide residual networks. CoRR abs/1605.07146. External Links: Link, 1605.07146 Cited by: §5.6.
 Lookahead optimizer: k steps forward, 1 step back. CoRR abs/1907.08610. External Links: Link, 1907.08610 Cited by: Table 1, §5.2.
 Adversarial autoaugment. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.1.