1 Introduction
Many machine learning models can be formulated as the following empirical risk minimization problem:
(1) 
where is the parameter to learn, corresponds to the loss on the th training sample, and is the total number of training samples.
With the rapid growth of data in real applications, stochastic optimization methods have become more popular than batch ones to solve the problem in (1). The most popular stochastic optimization method is stochastic gradient decent (SGD) (Zhang, 2004; Xiao, 2009; Bottou, 2010; Duchi et al., 2010). One practical way to adopt SGD for learning is the socalled EpochSGD (Hazan and Kale, 2014) in Algorithm 1
, which has been widely used by mainstream machine learning platforms like Pytorch and TensorFlow. In each outer iteration (also called epoch) of Algorithm
1, EpochSGD first samples a sequence from according to a distributiondefined on the full training set. A typical distribution is the uniform distribution. We can also set the sequence to be a permutation of
(Tseng, 1998). In the inner iteration, the stochastic gradients computed based on the sampled sequence will be used to update the parameter. The minibatch size is one in the inner iteration of Algorithm 1. In real applications, larger minibatch size can also be used. After the inner iteration is completed, EpochSGD adjusts the step size to guarantee that is a nonincreasing sequence. In general, we take . Although many theoretical results suggest to be the average of , we usually take the last one to be the initialization of the next outer iteration.To further accelerate EpochSGD in Algorithm 1, three main categories of methods have recently been proposed. The first category is to adopt momentum, Adam or Nesterov’s acceleration (Nesterov, 2007; Leen and Orr, 1993; Tseng, 1998; Lan, 2012; Kingma and Ba, 2014; Ghadimi and Lan, 2016; AllenZhu, 2018) to modify the update rule of SGD in Line 6 of Algorithm 1. This category of methods has faster convergence rate than SGD when
is small, and empirical results show that these methods are more stable than SGD. However, due to the variance of
, the convergence rate of these methods is the same as that in SGD when is large.The second category is to design new stochastic gradients to replace in the inner iteration of Algorithm 1 such that the variance in the stochastic gradients can be reduced (Johnson and Zhang, 2013; ShalevShwartz and Zhang, 2013; Nitanda, 2014; ShalevShwartz and Zhang, 2014; Defazio et al., 2014; Schmidt et al., 2017). Representative methods include SAG (Schmidt et al., 2017) and SVRG (Johnson and Zhang, 2013). These methods can achieve faster convergence rate than vanilla SGD in most cases. However, the faster convergence of these methods are typically based on a smooth assumption for the objective function, which might not be satisfied in real problems. Another disadvantage of these methods is that they usually need extra memory cost and computation cost to get the stochastic gradients.
The third category is the importance sampling based methods, which try to design the distribution (Zhao and Zhang, 2015; Csiba et al., 2015; Namkoong et al., 2017; Katharopoulos and Fleuret, 2018; Borsos et al., 2018). With properly designed distribution , these methods can also reduce the variance of and hence achieve faster convergence rate than SGD. (Zhao and Zhang, 2015) designs a distribution according to the global Lipschitz or smoothness. The distribution is firstly calculated based on the training set and then is fixed during the whole training process. (Csiba et al., 2015; Namkoong et al., 2017) proposes an adaptive distribution which will change in each epoch. (Borsos et al., 2018)
adopts online optimization to get the adaptive distribution. There also exist some other heuristic importance sampling methods
(Shrivastava et al., 2016; Lin et al., 2017), which mainly focus on training samples with large loss (hard examples) and set the weight of samples with small loss to be small or .One shortcoming of SGD and its variants, including the accelerated variants introduced above, is that the sample size in each iteration (epoch) of training is the same as the size of the full training set. This can also be observed in Algorithm 1, where a sequence of indices must be sampled from . Even for the importance sampling based methods, each sample in the full training set has possibility to be sampled in each outer iteration (epoch) and hence no samples can be discarded during training.
In this paper, we propose a new method, called adaptive sample selection (ADASS), to solve the above shortcoming of existing SGD and its variants. The contributions of ADASS are outlined as follows:

During different epoches of training, ADASS only need to visit different training subsets which are adaptively selected from the full training set according to the Lipschitz constants of the loss functions on samples. It means that in ADASS the sample size in each epoch of training can be smaller than the size of the full training set, by discarding some samples.

ADASS can be seamlessly integrated with existing optimization methods, such as SGD and momentum SGD, for training acceleration.

Theoretical results show that the learning accuracy of ADASS is comparable to that of counterparts with full training set.

Empirical results on both shallow models and deep models also show that ADASS can accelerate the training process of existing methods without sacrificing accuracy.
Notation: We use boldface lowercase letters like
to denote vectors, and use boldface uppercase letters like
to denote matrices. denotes the optimum of in (1) and denotes norm. , , with the th element being and others being . One epoch means that the algorithm passes through the selected training samples once.2 A Simple Case: Least Square
We first adopt least square to give some hints for designing effective sample selection strategies, because least square is a simple model with closedform solution.
Given a training set , where and . Least square tries to optimize the following objective function:
(2) 
For convenience, let . Furthermore, we assume , which is generally satisfied when . Then, the optimal parameter of (2) can be directly computed as follows:
(3) 
Let be a permutation of , and . Then, and denote the features and supervised information of the selected samples indexed by . For simplicity, we assume . Then it is easy to get that
(4) 
We are interested in the difference between and
. If the difference is very small, it means that we can use less training samples to estimate
. We have the following lemma about the relationship between and . and satisfy the following equation:Let . Based on Lemma 2, we can get the following corollary. Assume . Let be a permutation of , and . If such that
with the smallest eigenvalue
, when , we haveHere, actually corresponds to the Lipschitz constant of around . We call the bound of the first inequality in Corollary 2 loss bound because it is related to the loss on the samples. And we call the bound of the second inequality in Corollary 2 Lipschitz bound because it is related to the Lipschitz constants of the loss functions on the samples. We can find that in both loss bound and Lipschitz bound, the terms on the righthand side of the two inequalities correspond to those discarded (unselected) training samples indexed by . Corollary 2 gives us a hint that in least square, to make the gap between and as small as possible, we should discard training samples with the smallest losses or smallest Lipschitz constants. That means we should select training samples with the largest losses or largest Lipschitz constants.
We design an experiment to further illustrate the results in Corollary 2. The feature is constructed from three different distributions: uniform distribution
. The correspondingis got by a linear transformation on
with a small gaussian noise. We compare three sample selection criterions: Lipschitz criterion according to Lipschitz bound, loss criterion according to loss bound, and random criterion with which samples are randomly selected. The result is shown in Figure 1, in which the yaxis denotes and the xaxis denotes the sampling ratio . We can find that both Lipschitz criterion and loss criterion achieve better performance than random criterion, for estimating with a subset of samples.3 Deep Analysis of Sample Selection Criterions
Based on the results of Corollary 2 about least square, it seems that both loss and Lipschitz constant can be adopted as criterions for sample selection. In this section, we give deep analysis about these two criterions and find that for general cases Lipschitz constant can be used for sample selection but loss cannot.
3.1 Loss based Sample Selection
Based on the loss criterion, in each iteration, the algorithm will select samples with the largest loss at current and learn with these selected samples to update . Intuitively, if the loss is large, it means the model has not fitted the th sample good enough and this sample need to be trained again.
Unfortunately, the loss based sample selection cannot theoretically guarantee the convergence of the learning procedure. We can give a negative example as follows: let . If we start from , then we will get . It means is a divergent sequence. In fact, even minimizes where is the selected samples with the largest loss at the th iteration, it can also make the other unselected sample loss functions increase.
The loss criterion has another disadvantage. Let , and define
The can be treated as some unknown noise. It is easy to find that , minimizing is equivalent to minimizing . However, can disrupt the samples in seriously. In Figure 1, it is possible to design suitable to make the blue line be the same as the green line. Hence, the loss criterion is also not robust for sample selection.
3.2 Lipschtiz Constant based Sample Selection
In this subsection, we theoretically prove that Lipschtiz constant is a good criterion for sample selection.
Assume . Let . We say is insignificant on if
(5) 
where . This definition can be explained as follows. Inspired by the empirical result in Section 2 and the negative example in Section 3.1, we realize that the selected samples can only be used on a local region. So the definition above focuses on the local region . The right side of (5) denotes the decreased value of , and the left side is the increased value of . Mathematically, if is insignificant on , we can obtain
(6) 
Since , the above equation implies that with initialization , we can make decrease by minimizing on the local region. So when we optimize on a local region, plays a leading role, and has insignificant effect and subsequently can be discarded.
One trivial decomposition of is , where . Then for any , , is insignificant on . In the following content, we will design a nontrivial decomposition of to facilitate sample selection. First, we give the following assumptions.
Assumption 1
(Local Lipschitz continuous) , there exists a constant such that ,
(7) 
Assumption 2
For any fixed , there exists a constant such that , with ,
(8) 
where is defined as .
For most loss functions used in machine learning, their gradients are bounded by a bounded closed domain which guarantees the Lipschitz continuous property. Hence, Assumption 1 is satisfied by most machine learning models. is the local Lipschitz constant which is determined by the specific sample and the neighborhood size of . Hence, we set different Lipschitz constants for different samples.
Assumption 2 can also be satisfied by most machine learning models, which is explained as follows. Let be the function defined in Assumption 2, and . If each is strongly convex and , then such satisfy (8) with . Lemma 3.2 implies that Assumption 2 can be easily satisfied for strongly convex objective functions. For convex objective functions, it is easy to transform them to strongly convex objective functions by adding a small norm.
For nonconvex objective functions, it is difficult to validate (8). However, we can empirically verify it. We randomly choose a and fix it. Then with the initialization , we train ResNet20 on a subset of cifar10 with the size (). We run momentum SGD to estimate the local minimum around . For each , we repeat experiments 10 times. The result is in Figure 2. We can find that is almost proportional to . Hence, Assumption 2 is also reasonable for nonconvex objective functions.
With the above two assumptions, we can obtain the following theorem. Let with , and . We define two functions as follows:
Let
where satisfy Assumption 2. Then is insignificant on . For any , according to Assumption 1 , we have
(9) 
Since satisfy Assumption 2, according to the definition of , we have
Since , is insignificant on .
According to the definition of in Theorem 3.2, to make , we firstly set with . If , then . Secondly, we can select samples with largest to construct and proper so that can be smaller than . It is reasonable because small local Lipschitz constant implies that the landscape of the loss function is flat and will not change too much on a local region.
The defined in Theorem 3.2 has another important property, which is stated in the following theorem. Assume are fixed, and assume and are defined as those in Theorem 3.2 with . Then minimizing is comparable to minimizing on with initialization , which means such that
where
(10)  
(11) 
Specifically, , .
Since is insignificant on , we have
which means
On the other hand, according to Assumption 2, we have
Then we have
According to Theorem 3.2, we can directly get that
(12) 
Moreover, we have the following corollary: With the defined in Theorem 3.2, we have
(13) 
Moreover, if and let , we have
Here are defined in (10) and (11). It implies that for any , on the local region , we should select samples with the largest Lipschitz constants so that the gap between and can be guaranteed to be small. Since is obtained from the subset , we can use less training samples for optimization. This is also consistent with the second inequality in Corollary 2. Please note that Corollary 2 is only for least square, but the result in Corollary 3.2 can be used for general cases of machine learning.
4 Adass
From the empirical results in Figure 1, we can find that if we discard some training samples permanently which means that some samples are not used during the whole training process, it is difficult to get the optimal solution of (1). Hence, different training subsets need to be adaptively selected from the full training set for different iterations (epoches). Because sample selection of our method is adaptive to different training states, we name our method adaptive sample slection (ADASS). ADASS can be seamlessly integrated with any optimization methods, such as SGD or momentum SGD, to accelerate the training of these methods.
ADASS adopts Lipschitz constant as criterion for sample selection. To reduce the sample selection cost, we periodically perform sample selection after every epochs in our implementation. Furthermore, we use to estimate the local Lipschitz constant. Then, the criterion for sample selection is to choose a subset of the largest such that
Here, is a hyperparameter denoting the threshold, denotes the selected samples in the th iteration, and has been omitted because both denominator and numerator have this common term.
The training procedure with ADASS is briefly listed in Algorithm 2. First, we train iterations with full training set to get . Then if the epoch satisfies , ADASS will select training samples adaptively. It means that after every epochs, ADASS will reselect training samples. Here, we adopt SGD and momentum SGD in Algorithm 2 for optimization. It is easy to see that when , ADASS is equal to vanila SGD or momentum SGD with full training set.
(14) 
4.1 Time Cost
In ADASS, it need to calculate the value
for sample selection. In large scale machine learning applications, this step will bring extra computation cost which can not be ignored. Here, we use deep neural network training on GPU as an example for demonstration. Let

: the time of loading one training sample;

: the time of calculating loss value of one training sample (forward);

: the time of calculating gradient of one training sample (backward).
When , ADASS is equal to SGD or momentum SGD so that we do not need to calculate loss value in Line 5 of Algorithm 2 for sample selection. The time cost of epochs of SGD is . When , we need to pass through the full training set once after every epochs for computing loss values. So the time cost of passing through with times of ADASS is . It is easy to get that
When , the time cost of ADASS is smaller than those of SGD and momentum SGD. For example, if , we can find that when , ADASS can accelerate the training process. However, we do not recommend to use large since the estimation error for Lipschitz constant might become larger with larger .
Train logistic regression on mnist. The sampling ratio is defined as
. For FS, each epoch contains times of gradient computation. For ADASS and RS, the th epoch contains times of gradient computation.5 Experiments
We conduct experiments to evaluate ADASS with Option II in Algorithm 2. All the experiments are conducted on the Pytorch platform with GPU Titan XP.
We compare ADASS with two baselines:

FS: Full training samples without discarding samples. This corresponds to vanilla stochastic methods.
For ADASS, we set . are trained (initialized) with FS. Hence, in all experiments, ADASS is the same as FS in the first 10 epochs. For all the three methods, we use momentum SGD as update rule.
5.1 Shallow Model (Convex)
The first experiment is to evaluate logistic regression (LR) with norm regularization, which is a shallow model and convex. The regularization coefficient is . The data set for evaluation is mnist ^{1}^{1}1http://yann.lecun.com/exdb/mnist/. We simply set for ADASS, and run three methods for 100 epochs ( in Algorithm 2).
The result is shown in Figure 3. The left figure shows the average training loss on the full training set (i.e., empirical risk). The middle one shows the test accuracy. We can find that ADASS (red line) and FS (blue line) achieve almost the same accuracy. However, RS achieves worse accuracy on both training loss and test accuracy. It implies that the training samples selected by ADASS are effective. The right figure shows sampling ratio in ADASS, which is defined as , during the whole training process. When , ADASS only need to select about half of training set. This means that ADASS can accelerate the training process of momentum SGD without sacrificing accuracy.
5.2 Deep Model (NonConvex)
The second experiment is for deep models which are nonconvex. Firstly, we evaluate ResNet20 on cifar10. The parameters for batch size, learning rate and momentum are the same as those in (He et al., 2016). We set for ADASS, and run the three methods for 160 epochs ( in Algorithm 2). The result is shown in Figure 4. We can find that ADASS (red line) and FS (blue line) achieve almost the same accuracy. RS achieves worse accuracy than FS, especially for training loss. More detailed results on ResNet20 are listed in Table 1. We can find that ADASS with achieves the best result on both training loss and test accuracy, by reducing about 30% time cost compared to FS (standard momentum SGD).
method  c  time (second)  sampling ratio  training loss  test accuracy 
FS  1  984.4  100%  0.016  91.4% 
RS    463.4  20.6%  0.09  90.5% 
RS    593.0  33.2%  0.059  90.8% 
ADASS  0.99  511.0  20.6%  0.019  90.8% 
ADASS  0.999  637.8  33.2%  0.016  91.4% 
We also conduct experiments on the large dataset ImageNet using two models: ResNet18 and ResNet50. The parameters for batch size, learning rate and momentum are the same as those in
(He et al., 2016). The result is shown in Figure 5. We can find that with less training samples, ADASS gets almost the same accuracy as that of FS.The experiments on deep models also show that ADASS can accelerate the training process of momentum SGD without sacrificing accuracy.


6 Conclusion
In this paper, we propose a new method, called ADASS, for training acceleration. In ADASS, the sample size in each epoch of training can be smaller than the size of the full training set, by adaptively discarding some samples. ADASS can be seamlessly integrated with existing optimization methods, such as SGD and momentum SGD, for training acceleration. Empirical results show that ADASS can accelerate the training process of existing methods without sacrificing accuracy.
References
 AllenZhu [2018] Zeyuan AllenZhu. Katyusha X: practical momentum method for stochastic sumofnonconvex optimization. In Proceedings of the 35th International Conference on Machine Learning, pages 179–185, 2018.
 Borsos et al. [2018] Zalan Borsos, Andreas Krause, and Kfir Y. Levy. Online variance reduction for stochastic optimization. In Conference On Learning Theory, pages 324–357, 2018.

Bottou [2010]
Léon Bottou.
Largescale machine learning with stochastic gradient descent.
In Proceedings of the 19th International Conference on Computational Statistics, 2010. 
Csiba et al. [2015]
Dominik Csiba, Zheng Qu, and Peter Richtárik.
Stochastic dual coordinate ascent with adaptive probabilities.
In Proceedings of the 32nd International Conference on Machine Learning, pages 674–683, 2015.  Defazio et al. [2014] Aaron Defazio, Francis R. Bach, and Simon LacosteJulien. SAGA: a fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
 Duchi et al. [2010] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In COLT 2010  The 23rd Conference on Learning Theory, Haifa, Israel, June 2729, 2010, pages 257–269, 2010.
 Ghadimi and Lan [2016] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program., 156(12):59–99, 2016.
 Hazan and Kale [2014] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic stronglyconvex optimization. Journal of Machine Learning Research, 15(1):2489–2512, 2014.

He et al. [2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition
, pages 770–778, 2016.  Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.

Katharopoulos and Fleuret [2018]
Angelos Katharopoulos and Franccois Fleuret.
Not all samples are created equal: Deep learning with importance sampling.
In Proceedings of the 35th International Conference on Machine Learning, pages 2530–2539, 2018.  Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 Lan [2012] Guanghui Lan. An optimal method for stochastic composite optimization. Math. Program., 133(12):365–397, 2012.
 Leen and Orr [1993] Todd K. Leen and Genevieve B. Orr. Optimal stochastic search and adaptive momentum. In Advances in Neural Information Processing Systems, pages 477–484, 1993.
 Lin et al. [2017] TsungYi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In International Conference on Computer Vision, pages 2999–3007, 2017.
 Namkoong et al. [2017] Hongseok Namkoong, Aman Sinha, Steve Yadlowsky, and John C. Duchi. Adaptive sampling probabilities for nonsmooth optimization. In Proceedings of the 34th International Conference on Machine Learning, pages 2574–2583, 2017.
 Nesterov [2007] Yu. Nesterov. Gradient methods for minimizing composite objective function, 2007.
 Nitanda [2014] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.
 Schmidt et al. [2017] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(12):83–112, 2017.
 ShalevShwartz and Zhang [2013] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
 ShalevShwartz and Zhang [2014] Shai ShalevShwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31th International Conference on Machine Learning, pages 64–72, 2014.
 Shrivastava et al. [2016] Abhinav Shrivastava, Abhinav Gupta, and Ross B. Girshick. Training regionbased object detectors with online hard example mining. In Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
 Tseng [1998] Paul Tseng. An incremental gradient(projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
 Xiao [2009] Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems, pages 2116–2124, 2009.
 Zhang [2004] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Machine Learning, Proceedings of the Twentyfirst International Conference, 2004.
 Zhao and Zhang [2015] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 1–9, 2015.
Comments
There are no comments yet.