Log In Sign Up

AutoSampling: Search for Effective Data Sampling Schedules

by   Ming Sun, et al.

Data sampling acts as a pivotal role in training deep learning models. However, an effective sampling schedule is difficult to learn due to the inherently high dimension of parameters in learning the sampling schedule. In this paper, we propose an AutoSampling method to automatically learn sampling schedules for model training, which consists of the multi-exploitation step aiming for optimal local sampling schedules and the exploration step for the ideal sampling distribution. More specifically, we achieve sampling schedule search with shortened exploitation cycle to provide enough supervision. In addition, we periodically estimate the sampling distribution from the learned sampling schedules and perturb it to search in the distribution space. The combination of two searches allows us to learn a robust sampling schedule. We apply our AutoSampling method to a variety of image classification tasks illustrating the effectiveness of the proposed method.


REX: Revisiting Budgeted Training with an Improved Schedule

Deep learning practitioners often operate on a computational and monetar...

Adam with Bandit Sampling for Deep Learning

Adam is a widely used optimization method for training deep learning mod...

Double-Loop Unadjusted Langevin Algorithm

A well-known first-order method for sampling from log-concave probabilit...

Mixup Without Hesitation

Mixup linearly interpolates pairs of examples to form new samples, which...

Analysis of Exploration vs. Exploitation in Adaptive Information Sampling

Adaptive information sampling approaches enable efficient selection of m...

Fast and Compute-efficient Sampling-based Local Exploration Planning via Distribution Learning

Exploration is a fundamental problem in robotics. While sampling-based p...

Automated Diagnosis of Clinic Workflows

Outpatient clinics often run behind schedule due to patients who arrive ...

1 Introduction

Data sampling policies can greatly influence the performance of model training in computer vision tasks, and therefore finding robust sampling policies can be important. Handcrafted rules, e.g. data resampling, reweighting, and importance sampling, promote better model performance by adjusting the training data frequency and order

(Estabrooks et al., 2004; Weiss et al., 2007; Bengio et al., 2009; Johnson and Guestrin, 2018; Katharopoulos and Fleuret, 2018; Shrivastava et al., 2016). But they heavily rely on the assumption over the dataset and cannot adapt well to datasets with their own characteristics. To handle this issue, learning-based methods (Li et al., 2019; Jiang et al., 2017; Fan et al., 2017) were designed to automatically reweight or select training data utilizing meta-learning techniques or a policy network.

However, existing learning-based sampling methods still rely on human priors as proxies to optimize sampling policies, which may fail in practice. Such priors often include assumptions on policy network design for data selection (Fan et al., 2017), or dataset conditions like noisiness (Li et al., 2019; Loshchilov and Hutter, 2015) or imbalance (Wang et al., 2019)

. These approaches take image features, losses, importance or their representations as inputs and apply the policy network or other learning approaches with small amount of parameters for estimating the sampling probability. However, for example, images with similar visual features can be redundant in training, and their losses or features fed into the policy network are more likely to be close, causing the same probability to be sampled for redundant samples if we rely on aforementioned priors. Therefore, we propose to directly optimize the sampling schedule itself so that no prior knowledge is required for the dataset. Specifically, the sampling schedule refers to order by which data are selected for the entire training course. In this way, we only rely on data themselves to determine the optimal sampling schedule without any prior.

Directly optimizing a sampling schedule is challenging due to its inherent high dimension. For example, for the ImageNet classification dataset 

(Deng et al., 2009)

with around one million samples, the dimension of parameters would be in the same order. While popular approaches such as deep reinforcement learning

(Cubuk et al., 2018; Zhang et al., 2020), Bayesian optimization (Snoek et al., 2015), population-based training (Jaderberg et al., 2017) or simple random search (Bergstra and Bengio, 2012) have already been utilized to tune low-dimensional hyper-parameters like augmentation schedules, their applications in data sampling schedules remain unexploited. For instance, the dimension of a data augmentation policy is generally only in dozens, and it needs thousands of training runs (Cubuk et al., 2018)

to sample enough rewards to find an optimal augmentation policy because high-quality rewards require many epochs of training to obtain. As such, optimizing a sampling schedule may require orders of magnitude more rewards than data augmentation to gather and hence training runs, which results in prohibitively slow convergence. To overcome the aforementioned challenge, we propose a data sampling policy search framework, named AutoSampling, to sufficiently learn an optimal sampling schedule in a population-based training fashion

(Jaderberg et al., 2017). Unlike previous methods, which focus on collecting long-term rewards and updating hyper-parameters or agents offline, our AutoSampling method collects rewards online with a shortened collection cycle but without priors. Specifically, the AutoSampling collects rewards within several training iterations, tens or hundred times shorter than that in existing works  (Ho et al., 2019; Cubuk et al., 2018). In this manner, we provide the search process with much more frequent feedback to ensure sufficient optimization of the sampling schedule. Each time when a few training iterations pass, we collect the rewards from the previous several iterations, accumulate them and later update the sampling distribution using the rewards. Then, we perturb the sampling distribution to search in distribution space, and use it to generate new mini-batches for later iterations, which are recorded into the output sampling schedule. As illustrated in Sec. 5.2, shortened collection cycles with less interference can also better reflect the training value of each data.

Our contributions are as follows:

  • To our best knowledge, we are the first to propose to directly learn a robust sampling schedule from the data themselves without any human prior or condition on the dataset.

  • We propose the AutoSampling method to handle the optimization difficulty due to the high dimension of sampling schedules, and efficiently learn a robust sampling schedule through shortened reward collection cycle and online update of the sampling schedule.

Comprehensive experiments on CIFAR-10/100 and ImageNet datasets

(Krizhevsky, 2009; Deng et al., 2009) with different networks show that the Autosampling can increase the top-1 accuracy by up to 2.85% on CIFAR-10, 2.19% on CIFAR-100, and 2.83% on ImageNet.

2 Background

2.1 Related Works

Data sampling is of great significance to deep learning, and has been extensively studied. Approaches with human-designed rules take pre-defined heuristic rules to modify the frequency and order by which training data is presented. In particular, one intuitive method is to resample or reweight data according to their frequencies, difficulties or importance in training

(Estabrooks et al., 2004; Weiss et al., 2007; Drummond et al., 2003; Bengio et al., 2009; Lin et al., 2017; Shrivastava et al., 2016; Loshchilov and Hutter, 2015; Wang et al., 2019; Johnson and Guestrin, 2018; Katharopoulos and Fleuret, 2018; Byrd and Lipton, 2018; cased). These methods have been widely used in imbalanced training or hard mining problems. However, they are often restricted to certain tasks and datasets based on which they are proposed, and their ability to generalize to a broader range of tasks with different data distribution may be limited. In another word, these methods often implicitly assume certain conditions on the dataset, such as cleanness or imbalance. In addition, learning-based methods have been proposed for finding suitable sampling schemes automatically. For example, methods using meta-learning or reinforcement learning are taken to automatically select or reweight data during training (Li et al., 2019; Jiang et al., 2017; Ren et al., 2018; Fan et al., 2017), but they are only tested on small-scale or noisy datasets. Whether or not they can generalize over tasks of other datasets still remain untested. In this work, we directly study the data sampling without any prior, and we also investigate its wide generalization ability across different datasets such as CIFAR-10, CIFAR-100 and ImageNet using many typical networks.

As for hyper-parameter tuning, popular approaches such as deep reinforcement learning (Cubuk et al., 2018; Zhang et al., 2020), Bayesian optimization (Snoek et al., 2015) or simply random search (Bergstra and Bengio, 2012) have already been utilized to tune low-dimensional hyper-parameters and proven to be effective. Nevertheless, they have not been adopted to find good sampling schedules due to its inherent high dimension. Some recent works tackle the challenge of optimizing high-dimensional hyper-parameter. (MacKay et al., 2019) uses structured best-response functions and (Jonathan Lorraine, 2019) achieves this goal through the combinations of the implicit function theorem and efficient inverse Hessian approximations. However, they have not been tested on the task of optimizing sampling schedules, which is the major focus of our work in this paper.

Figure 1: Overview of AutoSampling illustrated through one multi-exploitation-and-exploration cycle. a) The multi-exploitation step, illustrated by the left half, is the process of learning optimal sampling schedule locally. The same color of model for each worker indicates that the same model weight is cloned into it. Also for simplicity, in this figure we adopt the exploitation interval of length 1. b) The exploration step, shown by the right half, is to search in the sampling distribution space. Specifically, we estimate the sampling distribution from the schedules collected in the multi-exploitation step and perturb it to generate new sampling schedules for all workers.

2.2 Population Based Training

Hyper-parameter tuning task can be framed as a bi-level optimization problem with the following objective function,


where represents the model weight and is the hyper-parameter schedule for training intervals. Population based training (PBT) (Jaderberg et al., 2017) solves the bi-level optimization problem by training a population of child models in parallel with different hyper-parameter schedules initialized:


where respectively represents the child model weight, the corresponding hyper-parameter schedule for the training interval on worker , and is the number of workers. PBT proceeds in intervals, which usually consists of several epochs of training. During the interval, the population of models are trained in parallel to finish the lower-level optimization of weights .

Between intervals, an exploit-and-explore procedure is adopted to conduct the upper-level optimization of the hyper-parameter schedule. In particular for interval , to exploit child models are evaluated on a held-out validation dataset:


The best performing hyper-parameter setting is recorded and the top-performing model is broadcasted to all workers. To explore, new hyper-parameter schedules are initialized for interval with different random seeds on all workers, which can be viewed as a search in the hyper-parameter space. The next exploit-and-explore cycle will then be continued. In the end, the top-performing hyper-parameter schedule can be obtained.

PBT is applied to tune low-dimenisal hyper-parameters such as data augmentation schedules  (Ho et al., 2019; Jaderberg et al., 2017). However, it cannot be directly used for finding sampling strategies due to the high dimension. For instance, in PBA  (Ho et al., 2019) 3200 epochs of training are needed to optimize 60 hyper-parameters for data augmentation, and a linear up-scaling to learning sampling schedule in CIFAR-100 would require prohibitively 2.67 million epochs. Unlike PBT, our AutoSampling adopts a multi-exploitation-and-exploration structure, leading to much shorter reward collection cycles that contribute to much more effective rewards for sufficient optimization within a practical computational budget.

3 Preliminaries

Consider the bi-level optimization problem detailed by Equation.1 over a training dataset We define a sampling schedule to be an enumerated collection of data from the dataset for training the model, that is, , with each sampled from . We use to denote the total number of data trained for the lower task of optimizing the model weight . The product of copies of dataset constructs the search space of sampling schedules, from which we wish to find an optimal sampling schedule to solve the aforementioned bi-level optimization problem:


Note a sampling schedule may also be represented as a enumerated collection of several sampling sub-schedules. For instance, a sampling schedule can be denoted as where

4 AutoSampling with Searching

The overview of our AutoSampling is illustrated in Fig.1

. AutoSampling alternately runs multi-exploitation step and exploration step. In the exploration step, we 1) update the sampling distribution using the rewards collected from the multi-exploitation step (the sampling distribution is uniform distribution initially); 2) perturb the updated sampling distribution for child models so that different child models have different sampling distributions; 3) use the corresponding perturbed sampling distribution for each child model to sample mini-batches of training data. In the multi-exploitation step, we 1) train multiple child models using the mini-batches sampled from the exploration step; 2) collect short-term rewards from the child models. AutoSampling finishes with a recorded top-performing sampling schedule, which can be transferred to other models.

  Input: Training dataset , population , population size , number of exploitation intervals , exploitation interval length
  Initialize ()
  for  to  do
     for  to  do
        for  do
                 update the weight of child model i
        end for
                    update the sampling for child model i
        for  to  do
                              clone the optimal weight
        end for
     end for
  end for
  Return ,
Algorithm 1 The Multi-Exploitation Step
  Input: Training dataset , population size
  Initialize () , uniform() and initialize child models
  while not end of training do
     for  to  do
        Sample from Mixture(, )
     end for
     Estimate according to Equation  (6)
     Update according to Equation  (7)
  end while
  Return ,
Algorithm 2 Search based AutoSampling

4.1 Multi-Exploitation by Searching in the Data Space

In the multi-exploitation step, we aim to search locally in the data space by collecting short-term rewards and sub-schedules. Specifically, we wish to learn a sampling schedule for exploitation intervals. In each interval, there are a population of child models. We denote as the training data sub-schedule in the interval for the child model. When all of the exploitation intervals for the child model are considered, we have , where is the number of training data used for this multi-exploitation step. Each interval consists of training iterations that is also equivalent to training mini-batches, where is the length of the interval. AutoSampling is expected to produce a sequence of training samples, denoted by , so that a given model is optimally trained. The population {} forms the local search space, from which we aim to search for an optimal sampling schedule .

Given the population , we train them in parallel on workers for interval . Once the interval of data containing training batches have been used for training, we evaluate all child models and use the top evaluation performance as the reward. According to the reward, we record the top-performing weight and sub-schedule for the current interval , in particular,


On the other hand, we update all child model weights of by cloning into them with the top-performing weight so we can continue searching based on the most promising child. We continue the exploit steps through all exploitation intervals, record and output the recorded optimal sampling schedule for the multi-exploitation step. By using exploitation interval of mini-batches rather than epochs or even entire training runs adopted by earlier methods, AutoSampling may yield a better and more robust sampling schedule. It should be pointed out that even though in AutoSampling rewards are collected within a much shorter interval, they remain effective. As we directly optimize the sampling schedule, we are concerned with only the data themselves. The short-term rewards reflect the training value of data from the exploitation interval they are collected. But for global hyper-parameters such as augmentation schedules, short-term rewards may lead to inferior performance as these hyper-parameters are concerned with the overall training outcome. We describe the multi-exploitation step with details in Alg.1.

4.2 Exploration by Searching in Sampling Distribution Space

In the exploration, we search in the sampling distribution space by updating and perturbing the sampling distribution. We first estimate the underlying sampling distribution from the top sampling schedule produced in the multi-exploitation, that is, for ,


where denotes the number of ’s appearances in . We further perturb the and generate the sampling schedules on each worker for the later multi-exploitation. We introduce perturbations into the generated schedules by simply sampling from the multinomial distribution using different random seeds. However, in our experiments, we observe that the distribution produced by

tends to be extremely skewed and a majority of the data actually have zero frequencies. Such skewness causes highly imbalanced training mini-batches, and therefore destabilizes subsequent model training.

Distribution Smoothing To tackle the above issue, we first smooth through the logarithmic function, and then apply a probability mixture with uniform distributions. In particular for the dataset ,


where is the smoothing factor and denotes uniform multinomial distributions on the dataset . The smoothing through the function can greatly reduce the skewness, however, may still contain zero probabilities for some training data, resulting in unstable training. Therefore, we further smooth it through a probability mixture with uniform distribution to ensure presence of all data. This is equivalent to combining epochs of training data to the training batches sampled from , and shuffling the union. Once we have new diverse sampling schedules for the population, we proceed to the next multi-exploitation step.

We continue this alternation between multi-exploitation and exploration steps until the end of training. Note that to generate sampling schedule for the first multi-exploitation run, we initialize to be an uniform multinomial distribution. In the end, we output a sequence of optimal sampling schedules for alternations. The entire process is illustrated in details in Alg.2.

Network Worker() Interval() Exploration type top1(%)
Resnet18 (Zhang et al., 2019) - - - 78.34
Resnet18 1 - Uniform 78.46
Resnet18 20 80 Batches Random 78.76
Resnet18 20 20 Batches Random 78.99
Resnet18 80 20 Batches Random 79.09
Resnet18 20 20 Batches Mixture 79.44
Resnet50 (Jin et al., 2019) - - - 79.34
Resnet50 1 - Uniform 79.70
Resnet50 20 80 Batches Random 80.55
Resnet50 20 20 Batches Random 81.05
Resnet50 80 20 Batches Random 81.19
Resnet50 20 20 Batches Mixture 81.53
DenseNet121 1 - Uniform 80.13
DenseNet121 20 80 Batches Random 80.62
DenseNet121 20 20 Batches Random 81.11
DenseNet121 80 20 Batches Random 81.08
DenseNet121 20 20 Batches Mixture 80.97
Table 1: Performance on CIFAR-100 using different configurations of AutoSampling and baselines. Worker() is the number of workers used, equivalent to the population size. Interval() is the exploitation interval in terms of batches or equivalent training iterations.
Network Exploration Type Top1(%)
Resnet18 uniform 93.01
Resnet18 Random 95.86
Resnet18 Mixture 95.80
Resnet50 uniform 93.60
Resnet50 Random 96.10
Resnet50 Mixture 96.09
Table 3: Experiments on ImageNet.
Network Exploration Type Top1(%)
Resnet18 uniform 70.38
Resnet18 Random 72.07
Resnet18 Mixture 72.91
Resnet34 uniform 74.09
Resnet34 Random 76.11
Resnet34 Mixture 76.92
Table 2: Experiments on CIFAR-10.
Network Sampling Type
uniform static Dynamic
Resnet18 78.46 78.80 79.44
Resnet50 79.70 80.21 81.53

Table 4: Static vs Dynamic sampling schedule on CIFAR-100 (%)

5 Experiments

In this section, we present comprehensive experiments on various datasets to illustrate the performance of AutoSampling, and also demonstrate the process of progressively learning better sampling distribution.

5.1 Implementation Details

Experiments on CIFAR We use the same training configuration for both CIFAR-100 and CIFAR-10 datasets, which both consist of 50000 training images. In particular, for model training we use the base learning rate of 0.1 and a step decay learning rate schedule where the learning rate is divided by 10 after each 60 epochs. We run the experiments for 240 epochs. In addition, we set the training batch size to be 128 per worker, and each worker is for one Nvidia V100 GPU card.

We run the explore step for each epochs with , but note that we take the first explore step after the initial epochs to better accumulate enough rewards. The experiments require 4800 epochs of training for 20 workers, and roughly 14 hours of training time.

Experiments on ImageNet For ImageNet which consists of 1.28 million training images, we adopted the base learning rate of 0.2 and a cosine decay learning rate schedule. We run the experiments with 100 epochs of training. For each worker we utilize eight Nvidia V100 GPU cards and a total batch size of 512. Eight workers are used for all ImageNet experiments, and the rest of the setting adheres to that of CIFAR experiments. In addition, we utilize FP16 computation to achieve faster training, which has almost no drop in accuracy in practice. The experiments require 800 epochs of training for 8 workers, and roughly 4 days of training time.

Figure 2: The comparison between histograms estimated from the sampling schedules of Epoch 80, 160 and 240 from CIFAR-100 with ResNet-18. We divide the 50000 training images into 500 segments of 100 images, and calculate the histograms of total data counts of all segments. We reorder the -axis based on the ranking of data counts for epoch 240 for easier comparison.
Network Sampling Schedule Source
uniform Resnet18 Resnet50 Densenet121
Resnet50 79.70 80.27 80.21 80.47
Table 5: Transfer of sampling distributions learned by three model structures to ResNet-50 on CIFAR-100 (%). UNIFORM denotes the baseline result using uniform sampling distribution.

5.2 Ablation study

For this part, we gradually build up and test components of AutoSampling on CIFAR-100, and then examine their performances on CIFAR-10 and ImageNet datasets.

Exploration types We first introduce the three exploration types examined, corresponding to one baseline and two variants of AutoSampling.

  • Uniform Exploration corresponds to regular model training with mini-batches uniformly sampled from the dataset.

  • Random Exploration adds the multi-exploitation step upon the uniform exploration. In particular, the random exploration method conducts the multi-exploitation step (4.1) among several workers. In between multi-exploitation steps, the random exploration generate later sampling schedulers for each worker simply through uniform sampling. Note the random exploration with one worker is equivalent to the uniform exploration.

  • Mixture Exploration adds the sampling distribution search (4.2) upon the random exploration in between multi-exploitation steps, completing the AutoSampling method.

Adding Workers To look into the influence of the worker numbers, we conduct experiments using worker numbers of 1, 20, 80 respectively with the same setting (

with random exploration). With the worker number of 1, the experiment is simply the normal model training using stochastic gradient descent. To show the competitiveness of our baselines, we also include recent state-of-the-art results on CIFAR-100 with ResNet-18 and ResNet-50

(Zhang et al., 2019; Jin et al., 2019). We notice significant performance gain using the worker number of 20 for ResNet-18, ResNet-50 and DenseNet-121 (He et al., 2015; Huang et al., 2017), as illustrated in Table 1. However, we note that increasing worker number from 20 to 80 only brings marginal performance gains across various model structures, as shown in Table 1. Therefore, we set the worker number to be 20 for the rest of the experiments.

Shortening Exploitation Intervals To study the effects of the shortened exploitation interval, we run experiments using different exploitation intervals of 20 and 80 batches(iterations) respectively. As shown in Table 1, models with the shorter exploitation interval of 20 batches(iterations) perform better than the one with the longer exploitation interval across all three network structures, conforming to our assumptions that the reward collected reflects value of each data used in the exploitation interval. This result adheres to our intuition that shorter exploitation interval can encourage the sampler to accumulate more rewards to learn better sampling schedules. For the rest of this section we keep the exploitation interval of 20.

Adding Exploration Type We further add mixture as the exploration type to see the effects of learning the underlying sampling distribution, and completing the proposed method. As shown in Table 1, with ResNet-18 and ResNet-50 we push performance higher with the mixture exploration, and outperform the baseline method by about 1 and 1.8 percentage on CIFAR-100 respectively. However, we found that it is not true in the case of DenseNet-121 and this case may be attributed to the bigger capacity of DenseNet-121.

Generalization Over Datasets In addition, we experiment on other datasets. We report the results on CIFAR10 in Table 3 and the results of ResNet-18, ResNet-34 on ImageNet in Table 3. For CIFAR-10, we notice that the mixture and random exploration methods are comparable while both outperforming the uniform baseline, and we believe it is due to the simplicity of the dataset. In the more challenging ImageNet, the mixture exploration outperforms the random exploration by a clear margin. We also compare our AutoSampling with some recent non-uniform sampling methods on CIFAR-100, which can be found in 5.6.

5.3 Static vs Dynamic Schedules

We aim to see if the final sampling distribution estimated by our AutoSampling is sufficient to produce robust sampling schedules. In another word, we wish to know training with the AutoSampling is either a process of learning a robust sampling distribution, or a process of dynamically adjusting the sampling schedule for optimal training. To this end, we conduct training using different sampling schedules. First, we calculate the sampling distribution estimated throughout the learning steps of AutoSampling, and use it to generate the sampling schedule of a full training process, which we denote as static. Moreover, we represent the sampling schedule learned using AutoSampling as dynamic, since AutoSampling dynamically adjust the sampling schedule alongside the training process. Finally, we denote the baseline method as uniform, which uses the sampling schedule generated from uniform distribution.

We report results on CIFAR-100 with ResNet-18 and ResNet-50 in Table 4. Model trained with static sampling schedules exceeds the baseline uniform significantly, indicating the superiority of the learned sampling distribution over the uniform distribution. It shows the ability of AutoSampling to learn good sampling distribution. Nonetheless, note that models trained with dynamic sampling schedules outperform models trained with static, by a margin bigger than the one between static and uniform. This result shows the fact that despite the AutoSampling’s capability of learning good sampling distribution, its flexibility during training matters even more. Moreover, this phenomenon also indicates that models at different stages of learning process may require different sampling distributions to achieve optimal training. One single sampling distribution, even gradually estimated using AutoSampling, seems incapable of covering the needs from different learning stages. We plot the histograms of data counts in training estimated from schedules of different learning stages with ResNet-18 on CIFAR-100 in Fig.2, showing the great differences between optimized sampling distributions from different epochs.

Methods Network Baseline (%) With method (%) Improvement (%)
DLIS WRN-28-2 66.0 68.0 2.0
AutoSampling (ours) WRN-28-2 73.37 76.24 2.87
RAIS ResNet18 76.4 76.4 0.0
AutoSampling (ours) ResNet18 78.46 79.44 0.98
Table 6: Comparisons among AutoSampling and existing sampling methods on CIFAR-100

5.4 Analyzing sampling schedules learned by AutoSampling

To further investigate the sampling schedule learned by AutoSampling, we review the images at the tail and head part of the sampling spectrum. In particular, given a sampling schedule learned we rank all images based on their numbers of appearances in the training process. Training images at the top and bottom of the order are extracted, corresponding to high and low probabilities of being sampled respectively. In Fig.3, we show 4 classes of exemplary images. Conforming to our presumption, the sampling probability seems to indicate the difficulty of each training image. The images of low probability tend to have clearer imagery features enabling easy recognition, while the images of high probability tend to be more obscure. This result indicates that the sampling schedule learned by AutoSampling may possess some hard samples mining ability.

We also draw the comparison between the sampling frequency of each training image and its loss values of different training epochs on CIFAR-100. As shown in Fig. 4, across different learning stages the correlation between loss values and sampling frequencies of training data is not strong. The high chance of being sampled by AutoSampling does not necessarily lead to high loss values, which demonstrates that AutoSampling is not merely over-sampling difficult samples as pointed by the loss and therefore has more potential beyond simple visually hard example mining. The resulting sampling schedule learned by AutoSampling might be significantly different from the one guided by loss.

In addition, we notice the images of low probability also contain low quality images. For instance, in Fig.3 the leftmost image of Camal class contains only legs. This shows that AutoSampling may potentially rule out problematic training data for better training.

Furthermore, we examine the transfer ability of sampling distributions learned by AutoSampling to other network structures. Specifically, we run training on ResNet-50 (He et al., 2015) using static sampling schedule generated by three distributions learned by AutoSampling on 3 different models. As shown in Table 5, using sampling schedules learned by AutoSampling from other models, we demonstrate similar improvements over the uniform baseline. This result showing generalization across model structures, combined with the above observations on images of different sampling probability, indicates that there may exist a common optimal sampling schedule determined by the intrinsic property of the data rather than the model being optimized. Our AutoSampling is an effort to gradually converge to such an optimal schedule.

Figure 3: Example images on the head and tail of the sampling spectrum. The images on the left are the ones with low sampling probability, while the images on the right more likely to be sampled. We obtain these images using AutoSampling with the ResNet-18 model on CIFAR-100.

Figure 4: The comparison between the sampling frequency of each training image and its loss values of Epoch 80, 160 and 240 from CIFAR-100 with ResNet-18. We randomly selected 500 training images, and calculate their sampling frequency and loss values. The x-axis is the indexes of 500 training images, while the left y-axis denotes loss values and the right y-axis denotes the sampling frequency. The blue line represents the sampling frequencies and the red lines represents the loss values of all 500 images. As we can see from the figure, the two lines are not obviously correlated.

5.5 Discussions

The experimental results and observations from Section 5.3 and 5.4 shed light on the possible existence of an optimal sampling schedule, which relies only on the intrinsic property of the data and the learning stage of the model, regardless of the specific model structure or any human prior knowledge. The AutoSampling method is able to provide relatively enough rewards in the searching process compared to other related works, leading to sufficient convergence towards the desired sampling schedule. Once obtained, the desired sampling schedule may also be generalized over other model structures for robust training, as shown in Table 5. Although AutoSampling requires relatively large amount of computing resources to find a robust sampler, we want to point out that the efficiency of our method can be improved through better training techniques. Moreover, the possibility of an optimal sampling schedule relying solely on the data themselves may indicate more efficient sampling policy search algorithms, if one can quickly and effectively determine data value based on its property.

5.6 Comparison with existing sampling methods

To better illustrate the effectiveness of our AutoSampling method, we conduct experiments in comparison with recent non-uniform sampling methods DLIS (Johnson and Guestrin, 2018) and RAIS (Katharopoulos and Fleuret, 2018). DLIS (Johnson and Guestrin, 2018)

achieves faster convergence by selecting data reducing gradient norm variance, while RAIS 

(Katharopoulos and Fleuret, 2018) does so through approximating the ideal sampling distribution using robust optimization. The comparison is recorded in Table  6.

First, we run AutoSampling using Wide Resnet-28-2 (Zagoruyko and Komodakis, 2016) on CIFAR-100 with the training setting aligned roughly to  (Katharopoulos and Fleuret, 2018). AutoSampling achievs improvement of roughly 3 percentage points (73.37% 76.24%), while  Katharopoulos and Fleuret shows improvement of 2 percentage points (66.0% 68.0 %). Second, we report the comparisons between AutoSampling and RAIS on CIFAR-100.  Johnson and Guestrin shows no improvement (76.4% 76.4 %) on accuracy and 0.027 (0.989 0.962 ) decrease in validation loss, while our method shows improvement of 0.008 (78.6% 79.4%) on accuracy and 0.014 (0.886 0.872 ) decrease in validation loss. As such, our method demonstrates improvements over existing non-uniform sampling methods.

6 Conclusions

In this paper, we introduce a new search based AutoSampling scheme to overcome the issue of insufficient rewards for optimizing high-dimensional sampling hyper-parameter by utilizing a shorter period of reward collection. In particular, we leverage the population-based training framework (Jaderberg et al., 2017). We use a shortened exploitation interval to search in the local data space and provide sufficient rewards. For the exploration step, we estimate sampling distribution from the searched sampling schedule and perturb it to search in the distribution space. We test our method on CIFAR-10/100 and ImageNet datasets (Krizhevsky, 2009; Deng et al., 2009) with different networks show that it consistently outperforms the baseline methods across different benchmarks.


  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 41–48. External Links: ISBN 9781605585161, Link, Document Cited by: §1, §2.1.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (null), pp. 281–305. External Links: ISSN 1532-4435 Cited by: §1, §2.1.
  • J. Byrd and Z. C. Lipton (2018) Weighted risk minimization & deep learning. CoRR abs/1812.03372. External Links: Link, 1812.03372 Cited by: §2.1.
  • E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le (2018) AutoAugment: learning augmentation policies from data. CoRR abs/1805.09501. External Links: Link, 1805.09501 Cited by: §1, §2.1.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1, §1, §6.
  • C. Drummond, R. C. Holte, et al. (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, Vol. 11, pp. 1–8. Cited by: §2.1.
  • A. Estabrooks, T. Jo, and N. Japkowicz (2004) A multiple resampling method for learning from imbalanced data sets. Computational intelligence 20 (1), pp. 18–36. Cited by: §1, §2.1.
  • Y. Fan, F. Tian, T. Qin, J. Bian, and T. Liu (2017) Learning what data to learn. CoRR abs/1702.08635. External Links: Link, 1702.08635 Cited by: §1, §1, §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §5.2, §5.4.
  • D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen (2019) Population based augmentation: efficient learning of augmentation policy schedules. CoRR abs/1905.05393. External Links: Link, 1905.05393 Cited by: §1, §2.2.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §5.2.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu (2017)

    Population based training of neural networks

    CoRR abs/1711.09846. External Links: Link, 1711.09846 Cited by: §1, §2.2, §2.2, §6.
  • L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2017) MentorNet: regularizing very deep neural networks on corrupted labels. CoRR abs/1712.05055. External Links: Link, 1712.05055 Cited by: §1, §2.1.
  • X. Jin, B. Peng, Y. Wu, Y. Liu, J. Liu, D. Liang, J. Yan, and X. Hu (2019) Knowledge distillation via route constrained optimization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 1, §5.2.
  • T. B. Johnson and C. Guestrin (2018) Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7265–7275. External Links: Link Cited by: §1, §2.1, §5.6, §5.6.
  • D. D. Jonathan Lorraine (2019)

    Optimizing millions of hyperparameters by implicit differentiation

    proceedings of AISATS 11. Cited by: §2.1.
  • A. Katharopoulos and F. Fleuret (2018) Not all samples are created equal: deep learning with importance sampling. CoRR abs/1803.00942. External Links: Link, 1803.00942 Cited by: §1, §2.1, §5.6, §5.6.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §1, §6.
  • Z. Li, Y. Wu, K. Chen, Y. Wu, S. Zhou, J. Liu, and J. Yan (2019) LAW: learning to auto weight. CoRR abs/1905.11058. External Links: Link, 1905.11058 Cited by: §1, §1, §2.1.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.1.
  • I. Loshchilov and F. Hutter (2015) Online batch selection for faster training of neural networks. CoRR abs/1511.06343. External Links: Link, 1511.06343 Cited by: §1, §2.1.
  • M. MacKay, P. Vicol, J. Lorraine, and R. G. David Duvenaud (2019) Self-tuning networks: bilevel optimization of hyperparameters using structured best-response functions. proceedings of ICLR abs/1903.03088. External Links: Link, 1903.03088 Cited by: §2.1.
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. CoRR abs/1803.09050. External Links: Link, 1803.09050 Cited by: §2.1.
  • A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
  • J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2171–2180. External Links: Link Cited by: §1, §2.1.
  • Y. Wang, W. Gan, J. Yang, W. Wu, and J. Yan (2019) Dynamic curriculum learning for imbalanced data classification. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1.
  • G. M. Weiss, K. McCarthy, and B. Zabar (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs?. Dmin 7 (35-41), pp. 24. Cited by: §1, §2.1.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. CoRR abs/1605.07146. External Links: Link, 1605.07146 Cited by: §5.6.
  • M. R. Zhang, J. Lucas, G. E. Hinton, and J. Ba (2019) Lookahead optimizer: k steps forward, 1 step back. CoRR abs/1907.08610. External Links: Link, 1907.08610 Cited by: Table 1, §5.2.
  • X. Zhang, Q. Wang, J. Zhang, and Z. Zhong (2020) Adversarial autoaugment. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.1.