DeepAI
Log In Sign Up

How Important is Importance Sampling for Deep Budgeted Training?

10/27/2021
by   Eric Arazo, et al.
4

Long iterative training processes for Deep Neural Networks (DNNs) are commonly required to achieve state-of-the-art performance in many computer vision tasks. Importance sampling approaches might play a key role in budgeted training regimes, i.e. when limiting the number of training iterations. These approaches aim at dynamically estimating the importance of each sample to focus on the most relevant and speed up convergence. This work explores this paradigm and how a budget constraint interacts with importance sampling approaches and data augmentation techniques. We show that under budget restrictions, importance sampling approaches do not provide a consistent improvement over uniform sampling. We suggest that, given a specific budget, the best course of action is to disregard the importance and introduce adequate data augmentation; e.g. when reducing the budget to a 30 maintains accuracy, while importance sampling does not. We conclude from our work that DNNs under budget restrictions benefit greatly from variety in the training set and that finding the right samples to train on is not the most effective strategy when balancing high performance with low computational requirements. Source code available at https://git.io/JKHa3 .

READ FULL TEXT VIEW PDF
06/21/2022

An attempt to trace the birth of importance sampling

In this note, we try to trace the birth of importance sampling (IS) back...
09/22/2020

Bayesian Update with Importance Sampling: Required Sample Size

Importance sampling is used to approximate Bayes' rule in many computati...
02/06/2016

Importance Sampling for Minibatches

Minibatching is a very well studied and highly popular technique in supe...
03/02/2018

Not All Samples Are Created Equal: Deep Learning with Importance Sampling

Deep neural network training spends most of the computation on examples ...
01/08/2019

Comparing Sample-wise Learnability Across Deep Neural Network Models

Estimating the relative importance of each sample in a training set has ...
02/11/2021

Neural BRDF Representation and Importance Sampling

Controlled capture of real-world material appearance yields tabulated se...

1 Introduction

The availability of vast amounts of labeled data is crucial when training deep neural networks (DNNs) [2018_ECCV_limits_weakly, 2020_CVPR_improveImagenet]. Despite prompting considerable advances in many computer vision tasks [2018_ECCV_captioning, 2019_CVPR_pose_Estimation], this dependence poses two challenges: the generation of the datasets and the large computation requirements that arise as a result. Research addressing the former has experienced great progress in recent years via novel techniques that reduce the strong supervision required to achieve top results [2019_ICML_efficientnet, 2019_NeurIPS_fixing]

by, e.g. improving semi-supervised learning 

[2019_NIPS_MixMatch, 2020_IJCNN_pseudolab]

, self-supervised learning 

[CVPR_2020_moco, CVPR_2020_InvRepresentations], or training with noisy web labels [2019_ICML_BynamicBootstrapping, 2020_ICLR_DivideMix]. The latter challenge has also experienced many advances from the side of network efficiency via DNN compression [2018_ICML_compressing, 2019_CVPR_compressGAN], neural architecture search [2019_ICML_efficientnet, 2019_ICLR_proxylessnas], or parameter quantization [2016_ECCV_xnor, 2018_CVPR_quantization]. All these approaches are designed with a common constraint: a large dataset is needed to achieve top results [2020_CVPR_improveImagenet]. This conditions the success of the training process on the available computational resources. Conversely, a smart reduction of the amount of samples used during training can alleviate this constraint [2018_ICML_notAllSamples, 2020_ICML_coreSet].

The selection of samples plays an important role in the optimization of DNN parameters during training, where Stochastic Gradient Descent (SGD) 

[2012_NeurIPS_sgd, 2018_SIAM_optim] is often used. SGD guides the parameter updates using the estimation of model error gradients over sets of samples (mini-batches) that are uniformly randomly selected in an iterative fashion. This strategy assumes equal importance across samples, whereas other works suggest that alternative strategies for revisiting samples are more effective in achieving better performance [2017_NeurIPS_activeBias, 2020_AISTATS_osgd] and faster convergence [2018_ICML_notAllSamples, 2019_Arxiv_selective]. Similarly, the selection of a unique and informative subset of samples (core-set) [2018_ICLR_forget, 2020_ICLR_proxySelection] can reduce the computation requirements during training, while reducing the performance drop with respect to training on all data. However, although removing data samples speeds up training, precise sample selection often requires a pretraining stage that acts counter computational reduction [2020_ICML_coreSet, 2018_ICLR_activelearning].

A possible solution to this limitation might be to dynamically change the important subset during training, as is done by importance sampling methods [2017_CEMNLP_repeat, 2019_NeurIPS_autoAssist]

, which select the samples based on a sampling distribution that evolves with the model and often depends on the loss or network logits 

[2015_ICLRw_online, 2018_NeurIPS_RAIS]. An up-to-date sample importance estimation is key for current methods to succeed but, in practice, is infeasible to compute [2018_ICML_notAllSamples]. The importance of a sample changes after each iteration and estimations become out-dated, yielding considerable performance drops [2017_NeurIPS_activeBias, 2019_NeurIPS_autoAssist]. Importance sampling methods focus on training with the most relevant samples and achieve a convergence speed-up as a side effect. They do not, however, strictly study the benefits on DNN training when restricting the number of training iterations, i.e. the budget.

Budgeted training [2017_NeurIPS_adaptive_budget, 2019_ICML_opportunistic_budget, 2020_ICLR_budget] imposes an additional constraint on the optimization of a DNN: a maximum number of iterations. Defining this budget provides a concise notion of the limited training resources. Li et al. [2020_ICLR_budget] propose to address the budget limitation using specific learning rate schedules that better suit this scenario. Despite the standardized scenario that budgeted training poses to evaluate methods when reducing the computation requirements, there are few works to date in this direction [2020_ICLR_budget, 2018_ICML_notAllSamples]. As mentioned, importance sampling methods are closely related, but the lack of exploration of different budget restrictions makes these approaches less applicable: the sensitivity to hyperparamenters that they often exhibit limits their generalization [2017_NeurIPS_activeBias, 2015_ICLRw_online].

In this paper, we overcome the limitations outlined above by analyzing the effectiveness of importance sampling methods when a budget restriction is imposed [2020_ICLR_budget]. Given a budget restriction, we study synergies among importance sampling and data augmentation [2018_ACML_RICAP, 2020_CVPRw_randAugm, 2018_ICLR_mixup]. We find the improvements of importance sampling approaches over uniform random sampling are not always consistent across budgets and datasets. We argue and experimentally confirm (see Section 4.4) that when using certain data augmentation strategies [2018_ACML_RICAP, 2020_CVPRw_randAugm, 2018_ICLR_mixup], existing importance sampling techniques do not provide further benefits, making data augmentation the most effective strategy to exploit a given budget.

2 Related work

Few works exploit a budgeted training paradigm [2020_ICLR_budget]. Instead, many aim to speed up convergence to a given performance using a better sampling strategy or carefully organizing the samples to allow the model to learn faster and generalize better. Others explore how to improve model performance by labeling the most important samples from an unlabeled set [2019_CVPR_learning_loss_for_active_learning, 2020_ICLR_deepactivelearning, 2020_ArXiv_active_learning_survey] or how to better train DNNs when limited samples per class are available [2020_CVPR_fewshot_base, 2021_IJCNN_relab, 2019_ICLR_closer_look_to_activeLearning]. None of these works, however, explore the efficiency of these approaches when trained under constraints in the number of iterations allowed, i.e. budgeted training. This section reviews relevant works that aim to improve the computational efficiency of training DNNs.

Curriculum learning (CL) aims to improve model performance by ordering the samples from easy to difficult [2018_ICML_CLTransfLearn, 2009_ICML_CL, 2019_ICML_PowerOfCL, 2019_CVPR_L2G]. Like importance sampling approaches, CL leverages different samples at different stages of training. However, while CL prioritizes easy samples at the beginning of training and includes all of them at the end, importance sampling prioritizes the most difficult subset of samples at each stage of the training. The main drawback of CL is that, in most cases, the order of the samples (curriculum) has to be defined before training, which is already a costly task that requires manually assessing the sample difficulty or transferring knowledge from a pre-trained model. Some approaches remedy this with a simple curriculum [2017_ICCV_focal_loss] or by learning it during training [2018_ICML_mentorNet]; these methods, however, do not aim to speed up training by ordering the samples, but to improve convergence by weighting the sample contribution to the loss.

Core-set selection approaches aim to find the subset of samples that is most useful [2018_ICLR_forget, 2020_ICLR_proxySelection, 2020_ICML_coreSet] and maintain accuracy despite training on a fraction of the data. The ability of these methods to reduce training cost relies on using smaller training sets, but the benefit is limited since they require a pre-training stage with the full dataset. They do, however, demonstrate that DNNs can achieve peak performance with a fraction of the full dataset. Some approaches to core-set selection use the most often forgotten samples by the network [2018_ICLR_forget], the nearest samples to cluster centroids built from model features [2020_ICML_coreSet], or use a smaller pretrained model to select the most informative samples [2020_ICLR_proxySelection].

Importance sampling approaches lie in the middle ground between the previous two: they aim to speed up training convergence by leveraging the most useful samples at each training stage [2018_ICML_notAllSamples, 2019_Arxiv_selective, 2019_NeurIPS_autoAssist], which correspond to those with highest loss gradient magnitude [2014_NeurIPS_sgdweight, 2015_ICML_stochasticIS, 2016_ICLR_varianceRed]. Johnson and Guestrin [2018_NeurIPS_RAIS]

have shown that the last layer gradients are a good approximation and are easier to obtain in deep learning frameworks. Alternative importance measures include the loss

[2019_Arxiv_selective]

, the probability predicted for the true class

[2017_NeurIPS_activeBias], or the rank order of these probabilities [2015_ICLRw_online].

The approximation of the optimal distribution by importance sampling approaches avoids the cost of computing the importance of each sample at every iteration. However, this distribution changes very rapidly between iterations, leading to outdated estimations. Initial attempts at addressing this included using several hyper-parameters to smooth the estimated distribution [2017_NeurIPS_activeBias], more frequent distribution updates via additional forward passes [2015_ICLRw_online], or alternative measures to estimate the sampling distribution [2017_CEMNLP_repeat]. Several works added complex support techniques to the training to estimate a better distribution: using robust optimization [2018_NeurIPS_RAIS], introducing repulsive point techniques [2019_AAAI_repulsive], or adding a second network [2019_NeurIPS_autoAssist].

More recent methods leverage the random-then-greedy technique [2018_Arxiv_randGradBoost], where the probabilities of an initial random batch of samples are computed and then used to select a batch for training. Within this scheme, [2018_ICML_notAllSamples] define a theoretical bound for the magnitude of the gradients that allows for faster computation of the sampling probabilities, and [2019_Arxiv_selective, 2019_ICAAI_biasedSampling] use the loss as a measure of sample importance to keep the sampling distribution updated through the training. Finally, Kawaguchi and Lu [2020_AISTATS_osgd] introduce the top- loss [2017_NeurIPS_topKLoss]

to perform the back-propagation step using the samples with highest losses only. Note that these methods do a full forward pass every epoch to update the sampling probabilities.

Learning rate schedules have proven to be useful alternatives for faster convergence. In particular, [2019_ISOP_superConv, 2017_WACV_cyclical] propose a cyclic learning rate schedule to reach faster convergence by using larger learning rates at intermediate training stages and very low rates at final stages. Similarly, Li et al. [2020_ICLR_budget] explore budgeted training and propose a linearly decaying learning rate schedule that approaches zero at the end of the training, which without additional hyper-parameters, provides better convergence than the standard learning rate schedulers. These approaches, however, do not explore sample selection techniques to further increase convergence speed.

3 Budgeted training

This section formally introduces budgeted training and the different importance sampling methods used through the paper to explore the efficiency of these approaches under budget restrictions. The standard way of training DNNs is by gradient based minimization of cross-entropy

(1)

where is the number of samples in the dataset and

is the one-hot encoding ground-truth label for sample

, is the number of classes,

is the predicted posterior probability of a DNN model given

(i.e. the prediction after softmax normalization), and are the parameters of the model. Convergence to a reasonable level of performance usually determines the end of the training, whereas in budgeted training there is a fixed iteration budget. We adopt the setting defined by [2020_ICLR_budget], where the budget is defined as a percentage of the full training setup. Formally, we define the budget as the fraction of forward and backward passes used for training the model with respect to a standard full training. As we aim at analyzing importance sampling, the budget restriction will be mainly applied to the amount of data seen every epoch. However, a reduction on the number of epochs to (where an epoch is considered a pass over all samples) is also considered as truncated training for budgeted training.

Truncated training is the simplest approach to budgeted training: keep the standard SGD optimization and reduce the number of epochs trained by the model to . We call this strategy, where the model sees all the samples every epoch, scan-SGD

. While seeing all the samples is common practice, we remove this constraint and draw the samples from a uniform probability distribution at every iteration and call this strategy

unif-SGD. In this approach the budget is defined by randomly selecting samples every epoch (and still training for epochs).

Importance sampling aims to accelerate the convergence of SGD by sampling the most difficult samples more often, where (the number of samples selected given a certain budget). Loshchilov and Hutter [2015_ICLRw_online] proposed a simple approach for importance sampling that uses the loss of every sample as a measure of the sample importance. Chang et al. [2017_NeurIPS_activeBias] adapts this approach to avoid additional forward passes by using as importance:

(2)

where is the prediction of the model given the sample in epoch , and is the current epoch. Therefore, the average predicted probability across previous epochs associated to the ground-truth class of each sample defines the importance of sample . The smoothing constant is defined as the mean per sample importance up to the current epoch: . The sampling distribution at a particular epoch is then given by:

(3)

By drawing samples from the distribution this approach biases the training towards the most difficult samples, and selects samples with highest loss value; we name this method p-SGD. Similarly, Chang et al. [2017_NeurIPS_activeBias] propose to select those samples that are closer to the decision boundaries and favor samples with higher uncertainty by defining the importance measure as ; we name this approach c-SGD.

Both p-SGD and c-SGD are very computationally efficient as the importance estimation only requires information available during training. Conversely, Jiang et al. [2019_Arxiv_selective]

propose to perform forward passes on all the samples to determine the most important ones and later reduce the amount of backward passes; they name this method selective backpropagation (

SB). At every forward pass, SB stores the sample with probability:

(4)

where

is the cumulative distribution function from a history of the loss values of the last

samples seen by the model and is a constant that determines the selectivity of the method, i.e. the budget used during the training. In practice, SB does as many forward passes as needed until it has enough samples to form a full a mini-batch. It then performs the training forward and backward passes with the selected samples to update the model.

Finally, as an alternative training paradigm to prioritize the most important samples, Kawaguchi and Lu [2020_AISTATS_osgd] propose to use only the samples with highest loss from a mini-batch in the backward pass. As the training accuracy increases, decreases until only of the images in the mini-batch are used in the backward pass. The authors name this approach ordered SGD (OSGD) and provide a default setting for the adaptive values of .

Importance sampling methods under budgeted training give a precise notion of the training budget. For unif-SGD, p-SGD, and c-SGD the adaptation needed consists of selecting a fixed number of samples per epoch based on the corresponding sampling probability distribution and still train the full epochs. For SB, the parameter determines the selectivity of the algorithm: higher values will reject more samples. Note that this method requires additional forward passes that we exclude from the budget as they do not induce the backward passes used for training. By assuming that each DNN backward pass is twice as computationally expensive as a forward pass [2018_ICML_notAllSamples] we could approximate the budget used by SB as , e.g. the results under for SB correspond to for the other approaches. We adapt OSGD by truncating the training as in scan-SGD: all the parameters are kept constant but the total number of epochs is reduced to . We also consider the wall-clock time with respect to a full budget training as a metric to evaluate the approaches.

4 Experiments and Results

4.1 Experimental framework

Datasets.  We experiment on image classification tasks using CIFAR-10/100 [2009_CIFAR]

, SVHN 

[2011_NeurIPS_SVHN]

, and mini-ImageNet 

[2016_NIPS_MiniImageNet]

datasets. CIFAR-10/100 consist of 50K samples for training and 10K for testing; each divided into 10(100) classes for CIFAR-10(100). The samples are images extracted from ImageNet

[2009_CVPR_ImageNet] and down-sampled to 3232. SVHN contains 3232 RGB images of real-world house numbers divided into 10 classes, 73257 for training and 26032 for testing. Mini-ImageNet is a subset of ImageNet with 50K samples for training and 10K for testing divided into 100 classes and down-sampled to 84

84. Unless otherwise stated, all the experiments use standard data augmentation: random cropping with padding of 4 pixels per side and random horizontal flip (except in SVHN, where horizontal flip is omitted).

Training details.  We train a ResNet-18 architecture [2016_CVPR_ResNet] for 200 epochs with SGD with momentum of 0.9 and a batch size of 128. We use two learning rate schedules: step-wise and linear decay. For both schedules we adopt the budget-aware version proposed by Li et al. [2020_ICLR_budget] and use an initial learning rate of 0.1. In the step-wise case, the learning rate is divided by 10 at 1/3 (epoch 66) and 2/3 (epoch 133) of the training. The linear schedule decreases the learning rate value at every iteration linearly from the initial value to approximately zero (

) at the end of the training. We always report the average accuracy and standard deviation of the model across 3 independent runs trained on a GeForce GTX 1080Ti GPU using the Pytorch library. For each budget, we report best results in bold and best results in each section – data augmentation or learning rate schedule – in blue (baseline SGD is excluded).

4.2 Budget-free training for importance sampling

CIFAR-10 CIFAR-100
Method A T S A T S
SGD 94.58 0.33 141 0.0 74.56 0.06 141 0.0
p-SGD 94.41 0.19 113 19.9 74.44 0.06 127 9.9
c-SGD 94.17 0.11 100 29.1 74.40 0.06 127 9.9
SB (*) 93.90 0.16 85 39.7 73.39 0.37 119 15.6
OSGD (*) 94.34 0.07 139 0.1 74.22 0.21 141 0.0
Table 1: Test accuracy (%), time (min) and speed-up (%) with respect SGD under a budget-free training (A, T, and S respectively). * denotes that we have used the official code.

Current importance sampling methods from the state-of-the-art are optimized with no restriction in the number of training iterations. While this allows the methods to better exploit the training process, it makes it difficult to evaluate their computational benefit. Therefore, Table 1 presents the performance, wall-clock time, and speed-up relative to a full training of the methods presented in Section 3. All methods train with a step-wise linear learning rate schedule. SGD corresponds to a standard training as described in Subsection 4.1. p-SGD and c-SGD correspond to the methods described in Section 3 introduced by Chang et al. [2017_NeurIPS_activeBias] that for the experiments in Table 1 train for 200 epochs where the first 70 epochs consist of a warm-up stage with a uniform sampling strategy as done in the original paper. For CIFAR-10 we use a budget of 0.8 for p-SGD and 0.7 for c-SGD, and for CIFAR-100 a budget of 0.9 for both approaches (budgets retaining most accuracy were selected). Finally, SB and OSGD follow the setups described in the corresponding papers, [2019_Arxiv_selective] and [2020_AISTATS_osgd], and run on the official code.

While the simpler approaches to importance sampling, p-SGD and c-SGD, achieve similar performance to SGD and reduce computation up to 29.08% (9.93%) in CIFAR-10 (CIFAR-100), SB reduces the training time 39.72% (15.60%) in CIFAR-10 (CIFAR-100) with very small drops in accuracy. This supports importance sampling observations where particular configurations effectively reduce computational requirements and maintain accuracy.

4.3 Budgeted training for importance sampling

CIFAR-10 CIFAR-100
SGD - SLR 94.58 0.33 74.56 0.06
SGD - LLR 94.80 0.08 75.44 0.16
Budget: 0.2 0.3 0.5 0.2 0.3 0.5
Step-wise decay of the learning rate (SLR)
scan-SGD 92.03 0.24 93.06 0.15 93.80 0.15 70.89 0.23 72.31 0.22 73.49 0.20
unif-SGD 91.82 0.05 92.69 0.07 93.71 0.07 70.36 0.30 72.03 0.47 73.36 0.20
p-SGD 92.28 0.05 92.91 0.18 93.85 0.07 70.24 0.28 72.11 0.39 72.94 0.36
c-SGD 91.70 0.25 92.83 0.30 93.71 0.15 69.86 0.36 71.56 0.27 73.02 0.34
SB 93.37 0.11 93.86 0.27 94.21 0.13 70.94 0.38 72.25 0.68 73.39 0.37
OSGD 90.61 0.31 91.78 0.30 93.45 0.10 70.09 0.25 72.18 0.35 73.39 0.22
Linear decay of the learning rate (LLR)
scan-SGD 92.95 0.07 93.55 0.21 94.22 0.16 72.04 0.42 72.97 0.07 73.90 0.43
unif-SGD 92.83 0.14 93.48 0.05 93.98 0.11 72.02 0.24 72.74 0.57 73.93 0.16
p-SGD 93.23 0.14 93.63 0.04 94.14 0.11 71.72 0.37 72.94 0.37 74.06 0.10
c-SGD 92.95 0.17 93.54 0.07 94.11 0.24 71.37 0.49 72.33 0.18 73.93 0.35
SB 93.78 0.11 94.06 0.37 94.57 0.18 71.96 0.67 73.11 0.42 74.35 0.34
OSGD 91.87 0.36 93.00 0.08 93.93 0.22 71.25 0.11 72.56 0.36 73.40 0.14
Table 2: Test accuracy with a step-wise and a linear learning rate decay under different budgets. Note that SB requires additional computation (forward passes).

We adapt importance sampling approaches as described in Section 3 and configure each method to constrain its computation to the given budget. Table 2 shows the analyzed methods performance under the same budget for a step-wise learning rate (SLR) decay and the linear decay (LLR) proposed by Li et al. [2020_ICLR_budget] for budgeted training (described in Section 4.1). Surprisingly, this setup shows that most methods achieve very similar performance given a predefined budget, thus not observing faster convergence when using importance sampling. Both p-SGD and c-SGD provide marginal or no improvements: p-SGD marginally improves unif-SGD in CIFAR-10, but fails to do so in CIFAR-100. Similar behaviour is observed in c-SGD. Conversely, SB surpasses the other approaches consistently for SLR and in most cases in the LLR setup. However, SB introduces additional forward passes not considered as budget, while the other methods do not (see Section 3 for an estimation of the budget used by SB).

We consider scan-SGD and unif-SGD, as two naive baselines for budgeted training. Despite having similar results (scan-SGD seems to be marginally better than unif-SGD), we use unif-SGD for further experimentation in the following subsections as it adopts a uniform random sampling distribution, which allows contrasting with the importance sampling methods. Additionally, Table 2 confirms the effectiveness of a linear learning rate schedule as proposed in [2020_ICLR_budget]: all methods consistently improve with this schedule and, in most cases, unif-SGD and LLR perform on par with SB and SLR, and surpasses all the other methods when using SLR.

This failure of the sampling strategies to consistently outperform unif-SGD

could be explained by importance sampling breaking the assumption that samples are i.i.d: SGD assumes that a set of randomly selected samples represents the whole dataset and provides an unbiased estimation of the gradients. Importance sampling explicitly breaks this assumption and biases the gradient estimates. While this might produce gradient estimates that have a bigger impact on the loss, breaking the i.i.d. assumption leads SGD to biased solutions 

[2015_ICLRw_online, 2017_NeurIPS_activeBias, 2019_NeurIPS_autoAssist], which offsets the possible benefits of training with the most relevant samples. As a result, importance sampling does not bring a consistent speed-up in training. Note that approaches that weight the contribution of each sample with the inverse sampling probability to generate an unbiased gradient estimate obtain similar results [2016_ICLR_varianceRed, 2016_ICML_adaptive, 2017_NeurIPS_activeBias, 2018_NeurIPS_RAIS, 2019_NeurIPS_autoAssist].

4.4 Data variability importance during training

(a) (b) (c) (d)
Figure 1: Importance of data variability in CIFAR-10, (a) and (c), and CIFAR-100, (b) and (d). (a) and (b) compare different training set selection strategies: randomly selecting samples at every epoch (unif-SGD) outperforms fixed core-set or random subsets. (c) and (d) compares the data variability of different training strategies: the entropy of sample counts during training (0.3 budget) demonstrates that importance sampling, linear learning rate, and data augmentation contribute to higher data variability (entropy).

Core-set selection approaches [2018_ICLR_forget, 2020_ICLR_proxySelection] aim to find the most representative samples in the dataset to make training more efficient, while keeping accuracy as high as possible. Figure 1, (a) and (b), presents how core-set selection and a randomly chosen subset (Random) both under-perform uniform random sampling of a subset each epoch (unif-SGD), which approaches standard training performance (black dashed line). This shows that randomly selecting a different subset every epoch (unif-SGD), which is equally computationally efficient, achieves substantially better accuracy. This result supports the widely adopted assumption that data variability is key and suggests that it might be more important than sample quality.

We also find data variability to play an important role within importance sampling. Figure 1 (c) and (d) shows data variability measured using the entropy of the number of times that a sample is seen by the network during training, with being the -D distribution of sample counts. These results show how increases in variability (higher entropy) follow accuracy improvements in p-SGD when introducing the LLR, the smoothing constant to the sampling distribution, the average of the predictions across epochs, and data augmentation.

4.5 Data augmentation for importance sampling

CIFAR-10 CIFAR-100
Budget: 0.2 0.3 0.5 0.2 0.3 0.5
Standard data augmentation
SGD () 94.80 0.08 75.44 0.16
unif-SGD 92.83 0.14 93.48 0.05 93.98 0.11 72.02 0.24 72.74 0.57 73.93 0.16
p-SGD 93.23 0.14 93.63 0.04 94.14 0.11 71.72 0.37 72.94 0.37 74.06 0.10
SB 93.78 0.11 94.06 0.37 94.57 0.18 71.96 0.67 73.11 0.42 74.35 0.34
RandAugment data augmentation (N = 2, M = 4)
SGD () 95.56 0.12 75.52 0.17
unif-SGD 92.76 0.16 93.78 0.11 94.64 0.08 71.44 0.37 73.23 0.29 74.78 0.45
p-SGD 92.95 0.31 93.99 0.28 94.91 0.18 71.63 0.27 72.91 0.13 74.30 0.04
SB 93.27 0.38 94.64 0.07 95.27 0.26 66.84 1.15 73.79 0.40 74.87 0.18
mixup data augmentation ( = 0.3)
SGD () 95.82 0.17 77.62 0.40
unif-SGD 93.64 0.27 94.49 0.04 95.18 0.05 73.28 0.51 75.13 0.52 75.80 0.34
p-SGD 93.78 0.04 94.41 0.16 95.26 0.06 73.35 0.29 75.05 0.15 75.87 0.15
SB 93.62 0.36 93.92 0.08 94.51 0.17 73.38 0.13 74.88 0.31 75.57 0.23
RICAP data augmentation ( = 0.3)
SGD () 96.17 0.09 78.91 0.07
unif-SGD 93.85 0.10 94.93 0.29 95.47 0.18 74.87 0.28 76.27 0.32 77.83 0.15
p-SGD 94.02 0.18 94.79 0.18 95.63 0.15 74.59 0.15 76.50 0.22 77.58 0.49
SB 89.93 0.84 93.64 0.42 94.76 0.02 56.66 0.65 72.24 0.58 76.26 0.22
Table 3: Data augmentation for budgeted importance sampling in CIFAR-10/100. N and M are the number and strength of RandAugment augmentations, and

controls the interpolation in mixup and RICAP. Note that SGD corresponds to the full training.

Importance sampling approaches usually do not explore the interaction of sampling strategies with data augmentation techniques [2015_ICLRw_online, 2018_ICML_notAllSamples, 2019_Arxiv_selective]. To better understand this interaction, we explore interpolation-based augmentations via RICAP [2018_ACML_RICAP] and mixup [2018_ICLR_mixup]; and non-interpolation augmentations using RandAugment [2020_CVPRw_randAugm]. We implemented these data augmentation policies as reported in the original papers (see Table 3

for the hyperparameters used in our experiments). Note that for mixup and RICAP we combine 2 and 4 images respectively within each mini-batch, which results in the same number of samples being shown to the network (

).

SVHN mini-ImageNet
Budget: 0.2 0.3 0.5 0.2 0.3 0.5
Standard data augmentation
SGD () 97.02 0.05 75.19 0.16
unif-SGD 96.56 0.12 96.78 0.13 96.95 0.07 70.87 0.56 72.19 0.43 73.88 0.42
p-SGD 96.52 0.03 96.75 0.03 96.84 0.06 71.05 0.29 72.39 0.45 73.66 0.39
SB 96.93 0.07 96.85 0.01 96.97 0.06 69.68 0.09 71.46 0.15 73.51 0.30
RandAugment data augmentation (N = 2, M = 4)
SGD () 97.59 0.14 74.15 0.22
unif-SGD 97.38 0.05 97.50 0.07 97.60 0.05 71.29 0.25 73.04 0.34 73.21 0.52
p-SGD 97.25 0.03 97.44 0.02 97.52 0.03 71.43 0.25 72.36 0.15 73.21 0.38
SB 97.42 0.09 97.43 0.19 97.56 0.05 67.17 2.51 71.69 0.31 73.28 0.03
mixup data augmentation ( = 0.3)
SGD () 97.24 0.03 76.28 0.28
unif-SGD 96.99 0.09 97.04 0.08 97.24 0.07 72.50 0.51 73.76 0.26 75.05 0.29
p-SGD 96.92 0.08 97.34 0.49 97.37 0.49 72.21 0.81 73.63 0.13 74.54 0.53
SB 96.80 0.09 96.92 0.09 96.96 0.09 70.12 0.51 72.01 0.72 73.76 0.36
RICAP data augmentation ( = 0.3)
SGD () 97.61 0.06 78.75 0.40
unif-SGD 97.47 0.04 97.62 0.16 97.55 0.04 73.56 0.24 75.15 0.45 77.20 0.33
p-SGD 97.48 0.08 97.45 0.06 97.57 0.05 73.67 0.60 75.46 0.27 77.25 0.47
SB 97.34 0.03 97.40 0.06 97.45 0.01 53.26 0.71 71.75 0.67 75.65 0.40
Table 4: Data augmentation for budgeted importance sampling in SVHN and mini-ImageNet. N and M are the number and strength of RandAugment augmentations, and controls the interpolation in mixup and RICAP.

Table 3 and 4 show that data augmentation is beneficial in a budgeted training scenario: in most cases all strategies increase performance compared to standard data augmentation. The main exception is for the lowest budget for SB where in some cases data augmentation hurts performance. In particular, with RICAP and mixup, the improvements from importance sampling approaches are marginal and the naive unif-SGD provides results close to full training with standard augmentation. In some cases unif-SGD surpasses full-training with standard augmentations, e.g. RICAP with 0.3 and 0.5 budget and both mixup and RICAP with 0.3 budget in CIFAR-10/100. This is even more evident in SVHN where all the budgets in Table 4 for unif-SGD with RICAP surpass full training (SGD) with standard augmentation.

Approaches: unif-SGD p-SGD SB
Standard data augmentation 47 48 91
RandAugment 48 48 93
mixup 48 48 93
RICAP 49 49 95
Table 5: Wall-clock time (minutes) in CIFAR-100 for a training of 0.3 of budget.

Given that the cost of the data augmentation policies used is negligible (see Table 5 for the wall-clock times when ), our results show that adequate data augmentation alone can reduce training time at no accuracy cost and in some cases with a considerable increase in accuracy. For example, a 70% reduction in training time (0.3 budget) corresponds to an increase in accuracy from 75.44% to 76.27% in CIFAR-100 and from 94.80% to 94.93% in CIFAR-10. Also, a 50% reduction (0.5 budget) corresponds to an increase in accuracy from 75.44% to 77.83% in CIFAR-100 and from 94.80% to 95.47% in CIFAR-10.

CIFAR-10 CIFAR-100 mini-ImageNet
Budget: 0.05 0.1 0.05 0.1 0.05 0.1
Standard data augmentation
unif-SGD 87.90 0.40 91.46 0.08 62.66 0.65 69.34 0.68 56.38 0.11 67.61 0.52
p-SGD 88.86 0.17 91.66 0.11 62.20 0.56 69.32 0.17 56.95 0.43 67.67 0.41
SB 79.45 4.31 92.66 0.14 50.53 2.27 68.29 0.68 11.19 3.46 61.25 1.76
RandAugment data augmentation (N = 2, M = 4)
unif-SGD 83.24 0.06 88.95 0.22 47.64 3.34 64.48 0.10 42.35 1.54 64.98 0.47
p-SGD 83.94 0.26 89.77 0.38 48.78 1.48 65.05 0.37 41.72 0.77 65.88 0.15
SB 32.21 4.14 33.86 5.02 5.05 0.64 5.05 0.64 5.61 0.66 5.94 0.13
mixup data augmentation ( = 0.3)
unif-SGD 87.33 0.42 91.74 0.04 59.90 0.71 70.43 0.45 53.13 0.83 68.54 0.98
p-SGD 87.56 0.67 91.59 0.17 59.68 0.71 70.31 0.10 54.20 0.95 68.39 0.46
SB 77.72 5.31 92.56 0.15 43.27 7.37 69.64 0.24 12.10 0.27 61.01 0.64
RICAP data augmentation ( = 0.3)
unif-SGD 85.61 0.24 91.32 0.28 55.85 0.51 69.43 0.33 48.95 0.65 67.26 0.63
p-SGD 85.57 0.70 90.94 0.16 56.09 0.71 70.05 0.07 49.35 0.60 67.27 0.85
SB 44.93 2.67 54.76 4.31 10.75 0.72 13.33 0.39 8.71 0.45 10.84 0.86
Table 6: Test accuracy for CIFAR-10/100 and mini-ImageNet under extreme budgets.

We also experimented with extremely low budgets (see Table 6) and found that importance sampling approaches (p-SGD and SB) still bring little improvement over uniform random sampling (unif-SGD). Here additional data augmentation does not bring a significant improvement in accuracy and in the most challenging cases, hinders convergence. For example, when introducing RICAP with , the accuracy drops approximately 2 points in CIFAR-10, 5 points in CIFAR-100, and 7 points in mini-ImageNet with respect to 87.90%, 62.66%, and 56.38% for unif-SGD with standard data augmentation.

5 Conclusion

This paper studied DNN training for image classification when the number of iterations is fixed (i.e. budgeted training) and explores the interaction of importance sampling techniques and data augmentation in this setup. We empirically showed that, in budgeted training, DNNs prefer variability over selection of important samples: adequate data augmentation surpasses state-of-the-art importance sampling methods and allows for up to a 70% reduction of the training time (budget) with no loss (and sometimes an increase) in accuracy. In future work, we plan to explore the limitations found in extreme budgets and extend the study to large-scale datasets where training DNNs becomes a longer process. Additionally, we find particularly interesting as future work to study the generalization of the conclusions presented in this paper to different tasks, types of data, and model architectures. Finally, we encourage the use of data augmentation techniques rather than importance sampling approaches in scenarios where the iterations budget is restricted, and motivate research on these scenarios to better exploit computational resources.

Acknowledgment

This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283 and SFI/12/RC/2289_P2.

References