Accelerating Mini-batch SARAH by Step Size Rules

06/20/2019 ∙ by Zhuang Yang, et al. ∙ NetEase, Inc SUN YAT-SEN UNIVERSITY 2

StochAstic Recursive grAdient algoritHm (SARAH), originally proposed for convex optimization and also proven to be effective for general nonconvex optimization, has received great attention due to its simple recursive framework for updating stochastic gradient estimates. The performance of SARAH significantly depends on the choice of step size sequence. However, SARAH and its variants often employ a best-tuned step size by mentor, which is time consuming in practice. Motivated by this gap, we proposed a variant of the Barzilai-Borwein (BB) method, referred to as the Random Barzilai-Borwein (RBB) method, to calculate step size for SARAH in the mini-batch setting, thereby leading to a new SARAH method: MB-SARAH-RBB. We prove that MB-SARAH-RBB converges linearly in expectation for strongly convex objective functions. We analyze the complexity of MB-SARAH-RBB and show that it is better than the original method. Numerical experiments on standard data sets indicate that MB-SARAH-RBB outperforms or matches state-of-the-art algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Stochastic gradient descent (SGD) type methods have been a core methodology in applications to large scale problems in machine learning and related areas [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. The classical SGD method only requires a single random example per-iteration to approximate the full gradient. Such strategy usually makes SGD perform with a low computational complexity per-iteration. While SGD makes rapid progress early on, the convergence rate of SGD is significantly deteriorated by the intrinsic variance of its stochastic estimator. Even for strongly and smooth problem, SGD only converges sub-linearly [13].

Traditionally, there are three common ways to decrease the variance caused by the stochastic estimate. The first one is taking a decreasing step size sequence [14, 15]. However, this will further reduce the convergence rate. Moreover, it’s known that the practical convergence of SGD is very sensitive to the choice of step size sequence, which needs to be hand-picked. A second approach is using a mini-batching technique [16, 17]. Obviously, this requires more computations. The last method is using important sampling strategy [18, 19]. Although effective, this technique is not always practical, as the computation of the sampling mechanism relates to the dimensionality of model parameters [20]. In summary, any variance reduction technique does not come for free.

In recent years, some advanced stochastic variance-reduced algorithms have emerged, which use the specific form of objective function and combine some deterministic and stochastic aspects to reduce variance of the steps. The popular examples of these methods are the stochastic average gradient (SAG) method[21], the SAGA method [22], the stochastic dual coordinate ascent (SDCA) method [23], the stochastic variance reduced gradient (SVRG) method [24], the accelerated mini-batch Prox-SVRG (Acc-Prox-SVRG) method [25], the mini-batch semi-stochastic gradient descent (mS2GD) method [17], the StochAstic Recursive grAdient algoritHm (SARAH) [26] and the Stochastic Path-Integrated Differential EstimatoR method (SPIDER) [27], all of which have faster convergence rate than that of SGD. Specifically, these methods work with a fixed step size. However, the step size is often chosen by mentor. Hence, this is time consuming in practice.

More recently, SARAH, originally proposed for convex optimization, is gaining tremendous popularity due to only requiring a simple framework for updating stochastic gradient estimates [28]. Moreover, SARAH has been proven to be effective for general nonconvex optimization [29, 30, 31, 32, 33, 34]. Actually, SARAH and SVRG [24] are two similar methods which perform a deterministic step often called outer loop, where the full gradient of the objective functions was calculated at the outer loop, then followed by stochastic processes. The only difference between SVRG and SARAH is how the iterative scheme is performed in the inner loop. In addition, SARAH is a recursive method as SAGA [22], but do not store gradients as SAGA. Especially, different from SVRG and other methods (e.g., SAG, SDCA, mS2GD, etc.), SARAH does not take an estimator that is unbiased in the last step. Instead, it is unbiased over a long history of the method. Specifically, the advantage of SARAH is that the iterative scheme of the inner loop itself can converge sub-linearly [26].

Although Nguyen et al. [26] pointed out that SARAH uses a large constant step size than that of SVRG, the step size is still chosen by mentor. Including the variants of SARAH also employed a constant step size [30, 34]. In addition, Pham et al. [33] proposed proximal SARAH (ProxSARAH) for stochastic composite nonconvex optimization and showed that ProxSARAH works with new constant and adaptive step sizes, where the constant step size is much larger than existing methods, including proximal SVRG (ProxSVRG) schemes [35] in the single sample case and adaptive step-sizes are increasing along the inner iterations rather than diminishing as in stochastic proximal gradient descent methods. However, it is complicated to compute adaptive step size for ProxSARAH. Especially, ProxSARAH needs to control two step size sequences, which make it difficult to use in practice.

To deal with this demerit associated with SARAH, we propose using the random Barzilai-Borwein (RBB) method to automatically calculate step size for the mini-batch version of SARAH (MB-SARAH), proposed by Nguyen et al. [29] for nonconvex optimization, thereby leading to a new SARAH method named as MB-SARAH-RBB. The RBB method, a variant of the Barzilai-Borwein (BB) method [36], has been proposed by Yang et al. [37] and use it to calculate step size for mini-batch algorithms. However, they just discussed the choice of step size of SVRG-type algorithms, i.e., mS2GD and Acc-Prox-SVRG.

The key contributions of this work are as follows:

  • 1) We propose to use the RBB method to compute step size for MB-SARAH and obtain a new SARAH method named as MB-SARAH-RBB. Unlike the work in [37], when using the RBB method to calculate step size, we multiply a constant parameter, which is pivotal to ensure the convergence of MB-SARAH-RBB.

  • 2) We prove the convergence of our MB-SARAH-RBB method and show that its complexity is better than SARAH in the mini-batch setting.

  • 3) We conduct experiments for MB-SARAH-RBB on solving logistic regression problem. Experimental results on three benchmark data sets show that the proposed method outperforms or matches state-of-the-art algorithms.

The rest of this paper is organized as follows. Section II discusses related works that are relevant to this paper. Section III presents problem formulation and background. Section IV proposes our MB-SARAH-RBB method. Section V presents the convergence analysis of MB-SARAH-RBB for strongly convex objective functions and discusses its complexity. Section VI conducts some empirical comparisons over some state-of-the-art approaches. Section VII concludes the paper.

Notations: Throughout this paper, we view vectors as columns, and use

to denote the transpose of a vector . We use the symbol, , to denote the Euclidean vector norm, i.e., . We use

to denote the expectation of a random variable

.

Ii Related Work

Early works that compute step sizes adaptively for SGD are based on (i) a function of the errors in the predictions or estimates, or (ii) a function of the gradient of the error measure. For example, Kesten [38] pointed out that when consecutive errors in the estimate of the value of a parameter obtained by the Robbins-Monro procedure [39] are of opposite signs, the estimate is in the vicinity of the true value and accordingly the step size ought to be reduced. Further, an alternative version of the gradient adaptive step size algorithm within a stochastic approximation formulation was presented by Benveniste et al. [40]

. In addition, RMSprop, propounded by Tieleman et al.

[41], adapts a step size per weight based on the observed sign changes in the gradients. For more related methods, we refer readers to [42, 43] and references therein.

Recently, due to its simplicity and numerical efficiency, many researchers try to incorporate the BB method and its variants into SGD. For instance, Sopyła et al. [44] presented several variants of the BB method for SGD to train the linear SVM. Tan et al. [45] used the BB method to calculate the step size for SGD and SVRG, thereby putting forward two new approaches: SGD-BB and SVRG-BB. Moreover, they showed that SVRG-BB have linear convergence for strongly convex objective functions. To further accelerate the convergence rate of SVRG-BB, mS2GD-BB, incorporating the BB method into mS2GD (a variant of SVRG), was proposed by Yang et al. [46]. They presented that mS2GD-BB has linear convergence in expectation for nonsmooth strongly convex objective functions. In addition, Yang et al. [47] introduced the BB method into accelerated stochastic gradient (ASGD) methods and obtained a series of new ASGD methods. Moreover, for their proposed methods, they finished the proof of the convergence analysis and pointed out that the complexity of their proposed methods achieves the same level as the best known stochastic gradient methods. Further, when considering a “big batch” for SGD, De et al. [48] introduced the backtracking line search and BB methods into SGD to calculate step size. Moreover, they pointed out that the performance of SGD, using an adaptive step size method based on the BB method, is better than that of using the backtracking line search on a range of convex problems. To obtain online step size, Yang et al. [37] put forward the RBB method and incorporated it into mS2GD and Acc-Prox-SVRG, generating two new approaches: mS2GD-RBB and Acc-Prox-SVRG-RBB. To avert the denominator being close to zero when using the BB, or RBB methods, the stabilized Barzilai-Borwein (SBB) step size was proposed by Ma et al. [49]. Especially, they introduced it into SVRG and obtained SVRG-SBB for dealing with the ordinal embedding problem. Moreover, they showed that the SVRG-SBB method converges with a rate, , where is the total number of iterations.

In addition to the above-mentioned methods, other strategies of choosing step size were used in SGD. For instance, two adaptive step size schemes, referred to as a recursive step size stochastic approximation (RSA) scheme and a cascading step size stochastic approximation (CSA) scheme, were put forward by Yousefian et al. [50]. They also finished the proof of convergence analysis of two new iteration schemes for strongly convex differentiable stochastic optimization problems. In addition, Mahsereci et al. [51] suggested performing line search for an estimated function, which is computed by a Gaussian process with random samples. An online step size can also be obtained by using a hypergradient descent, where can be found in [52]. To greatly reduce the dependence of the algorithm on initial parameters when using hypergradient, Yang et al. [43] introduced the online step size (OSS) into the the mini-batch nonconvex stochastic variance reduced gradient (MSVRG) method [53] and obtain the MSVRG-OSS method. Moreover, they showed that MSVRG-OSS has linear converges for strongly convex objective functions. Especially, they pointed out that the MSVRG-OSS method also can be used to deal with nonconvex problems. Other different types of choosing step size for SGD, we refer readers to [54, 42, 55, 56] and references therein.

Iii Problem formulation and background

We focus on the following problem

(1)

where is the sample size, and each is cost function estimating how well parameter fits the data of the -th sample. Throughout this work, we assume that each has Lipschitz continuous derivatives. Also, we assume that both each and , are strongly convex.

Many problems in applications are often formulated as Problem (1). For example, when setting , where is a regularization parameter, the Problem (1) becomes least squares. However, when setting , the Problem (1) becomes logistic regression. Some other prevalent models, e.g., SVM [57], sparse dictionary learning [1], low-rank matrix completion [58]

and deep learning

[59], can be written in the form of (1).

To proceed with the analysis of the proposed algorithm, we require making the following common assumptions.

Assumption 1.

Each convex function, , in (1) is -Lipschitz smooth, i.e., there exists such that for all and in ,

(2)

Note that this assumption implies that the objective function, , is also -Lipschitz smooth. Moreover, by the property of -Lipschitz smooth function (see in [60]), we have

(3)
Assumption 2.

is -strongly convex, i.e., there exists such that for all , ,

(4)

or equivalently

(5)

When setting , it is known in [15] that the strong convexity of implies that

(6)

In this paper, the complexity analysis aims to bound the number of iterations (or total number of stochastic gradient evaluations) which requires . In this case, we say that is an -accurate solution.

Iv The Algorithm

In the following, we begin with the introduction of the RBB step size, and then we put forward our MB-SARAH-RBB method, which incorporates the RBB step size into MB-SARAH.

Iv-a Random Barzilai-Borwein Step Size

To solve Problem (1), Yang et al. [61] proposed to use the RBB method to calculate step size for mS2GD, thereby obtaining: mS2GD-RBB. In the inner loop of mS2GD-RBB, the updating scheme of solution sequence is:

(7)

where is the step size sequence and defined as:

(8)

and is the stochastic estimate of and defined as:

(9)

where , , with size , , with size , and is an snapshot vector for which the gradient, , has already been previously calculated in the deterministic step.

Actually, the RBB method satisfies the so-called quasi-Newton property under the background of stochastic optimization. Specifically, the RBB method can be viewed as a variant of stochastic quasi-Newton method, where the second order information was used. During recent years, more and more researchers and communities show that stochastic quasi-Newton iterates almost as fast as a first order stochastic gradient but only needs less iterations to achieve the same accuracy [62, 63, 64, 65, 66].

Iv-B The proposed method

The MB-SARAH method, proposed by Nguyen et al. [29], is viewed as a variant of mS2GD. However, the pivotal difference between the mS2GD and MB-SARAH is that the latter uses a new kind of stochastic estimate of , i.e.,

(10)

For comparison, the stochastic estimate of mS2GD-RBB is written in a similar way as (9). Note that for mS2GD-RBB,

is an unbiased estimator of the gradient, i.e., from (

9), we have . However, it’s not true for MB-SARAH.

We introduce the RBB method into MB-SARAH and obtain a new SARAH method referred to as MB-SARAH-RBB. But different with mS2GD-RBB, when computing the random step size in MB-SARAH, we multiply a parameter, , in (8), i.e.,

(11)

where the parameter, , is important to control the convergence of MB-SARAH-RBB.

Now we are ready to describe our MB-SARAH-RBB method (Algorithm 1).

Parameters: update frequency , samples sizes and , initial point , initial step size , a positive constant .
for do
for  to  do
     Randomly pick subset of size ,
     
     Randomly pick subset of size , compute a RBB step size:
     
     
end for
end for
Algorithm 1 MB-SARAH-RBB

Remark: At the beginning of MB-SARAH-RBB, a step size, , requires to be specified. However, we observed from the numerical experiments that the performance of MB-SARAH-RBB is not sensitive to the choice of . It also can be seen from Algorithm 1 that, if we always set , then MB-SARAH-RBB is reduced to the original MB-SARAH method.

V Convergence Analysis

In this section, we finish the proof of convergence analysis of MB-SARAH-RBB and discuss its complexity. We first provide the following lemmas.

Lemma 1.

Under Assumption 1, consider MB-SARAH-RBB within one single outer loop in Algorithm 1, then we obtain

Proof.

Available in Appendix A-A

With minor modification of Lemma 3 in [29], we obtain the following lemma showing the upper bound for .

Lemma 2.

Under Assumption 1, consider defined by (10) in MB-SARAH-RBB, then for any ,

Using the above lemmas, we obtain the following convergence rate for MB-SARAH-RBB with one outer loop.

Theorem 1.

Under Assumptions 1, 2 and Lemmas 1, 2, let and choose with size and at random, respectively. Consider MB-SARAH-RBB (within one outer loop in Algorithm 1) with

(12)

then we have

Proof.

Available in Appendix A-B

This result shows that the inner loop of MB-SARAH-RBB with a single outer loop converges sublinearly. Actually, to obtain

it is sufficient to choose . Hence, the total complexity to require an -accurate solution is . Therefore, the following conclusion for complexity bound is obtained.

Corollary 1.

Under Assumption 1, consider MB-SARAH-RBB with one outer loop, then has sublinear convergence in expectation with a rate of , and the total complexity to achieve an -accurate solution is .

Since , compared with Corollary 1 in [29], we have that the complexity of our MB-SARAH-RBB method is better than the complexity of MB-SARAH which is when choosing an appropriate mini-batch size . Note that, in our MB-SARAH-RBB method, the parameter, , is greater than .

Now, we present the estimating convergence of MB-SARAH-RBB with multiple outer steps.

Theorem 2.

Under Assumptions 1, 2 and Lemmas 1, 2, let and set with size and at random, respectively. Consider MB-SARAH-RBB with

then we have

where .

Proof.

Available in Appendix A-C

To obtain , it is sufficient to set . Therefore, we have the following conclusion for the total complexity of the proposed method.

Corollary 2.

Suppose Assumption 1 hold, the total complexity of MB-SARAH-RBB to achieve an -accurate solution is .

Compared the complexity of MB-SARAH, Corollary 2 indicates that MB-SARAH-RBB has lower complexity when choosing an appropriate mini-batch size .

Vi Experiments

In this section, the effectiveness of our MB-SARAH-RBB method is verified with experiments. In particular, our experiments were performed on the well-worn problems of training ridge regression, i.e.,

(13)

where is a collection of training examples.

We tested our MB-SARAH-RBB method on the three publicly available data sets (a8a,w8a and ijcnn1)111, and can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.. Detailed information of the data sets are listed in Table I.


Dataset
Training size feature

a8a
22,696 123
w8a 49,749 300
ijcnn1 49,990 22

TABLE I: Data information of experiments

Vi-a Properties of MB-SARAH-RBB

In this subsection, we show the properties of MB-SARAH-RBB conducted using data sets listed in Table I. To clearly show the properties of our MB-SARAH-RBB method, we present the comparison results between MB-SARAH-RBB and MB-SARAH with the best-tuned step size. For ease of analysis, in MB-SARAH-RBB, we take the same batch samples, , as MB-SARAH to obtain the solution sequence, on different data sets. Therefore, we can easily see the cases of batch samples, , for obtaining step size sequence.

In addition, for MB-SARAH-RBB, we chose the parameter, , as 0.1 when the batch samples, , is small; otherwise, we take , or a slightly large number. Moreover, we set for MB-SARAH-RBB.

Fig. 1, 2 and 3 compare MB-SARAH-RBB with MB-SARAH. In all sub-figures, the horizontal axis represents the number of effective passes over the data, where each effective pass evaluates component gradients. The vertical axis is the sub-optimality, , where we obtain by performing MB-SARAH with the best-tuned step size. Moreover, the dashed lines represent MB-SARAH with different fixed step sizes and the solid lines represent MB-SARAH-RBB with different batch sizes and . The detailed information of the parameters is given in the legends of the sub-figures.

Fig. 1: Comparison of MB-SARAH-RBB and MB-SARAH with fixed step sizes on a8a. The dashed lines stand for MB-SARAH with different fixed step sizes . The solid lines correspond to MB-SARAH-RBB with different mini-batch sizes and .
Fig. 2: Comparison of MB-SARAH-RBB and MB-SARAH with fixed step sizes on w8a. The dashed lines stand for MB-SARAH with different fixed step sizes . The solid lines correspond to MB-SARAH-RBB with different mini-batch sizes and .
Fig. 3: Comparison of MB-SARAH-RBB and MB-SARAH with fixed step sizes on ijcnn1. The dashed lines stand for MB-SARAH with different fixed step sizes . The solid lines correspond to MB-SARAH-RBB with different mini-batch sizes and .

Fig. 1, 2 and 3 show that, MB-SARAH-RBB is comparable to or performs better than MB-SARAH with best-tuned step size. Also, Fig. 1, 2 and 3 indicate that, when fixed batch samples, , it is no need to set a large batch samples, , to obtain step size sequence. However, a small batch size, , makes MB-SARAH-RBB diverge.

In Algorithm 1, we pointed out that MB-SARAH-RBB is not sensitive to the choice of step size, . To present this case, we set three different step sizes () for MB-SARAH-RBB on and and the results were presented in Fig. 4. We set for and for .

Fig. 4: Different initial step sizes for MB-SARAH-RBB on (left) and (right).

It can be seen from Fig. 4 that, the performance of MB-SARAH-RBB is not influenced by the choice of .

Vi-B Comparison with mS2GD-RBB

mS2GD-RBB, proposed by Yang et al. [61], uses the similar strategy as us to compute step size for mS2GD. One of the key difference between mS2GD-RBB and MB-SARAH-RBB is that the latter multiply a positive constant, , in (8). To further show the efficacy of our MB-SARAH-RBB method, we compare these two methods. All parameters of mS2GD-RBB are set as suggested in [61]. Also, we use the dashed lines to represent mS2GD-RBB and the solid lines to represent MB-SARAH-RBB.

Fig. 5: Comparison of MB-SARAH-RBB and mS2GD-RBB on a8a. The dashed lines stand for mS2GD-RBB with different mini-batch sizes and . The solid lines correspond to MB-SARAH-RBB with different mini-batch sizes and .
Fig. 6: Comparison of MB-SARAH-RBB and mS2GD-RBB on ijcnn1. The dashed lines stand for mS2GD-RBB with different mini-batch sizes and . The solid lines correspond to MB-SARAH-RBB with different mini-batch sizes and .

Fig. 5 and 6 show that our MB-SARAH-RBB method performs better than or is comparable to mS2GD-RBB. It also indicates that the performance of the original MB-SARAH method can be improved by introducing the improved RBB method.

Vi-C Comparison with other related methods

In this section, we compare our MB-SARAH-RBB method with the following methods:
1) SAG-LS: Stochastic average gradient method with line search [67].
2) SAG-BB: Stochastic average gradient method with BB step size [45].
3) SVRG: Stochastic variance reduction gradient method [24]. For SVRG, the best constant step size was employed.
4) SVRG-BB: Stochastic variance reduction gradient method with BB step size [45].
5) mS2GD-BB: A batch version of SVRG-BB proposed in [46]. For mS2GD-BB, all parameters were set as suggested in [46].
6) SDCA: Stochastic descent coordinate ascent method [23]. We chose the parameters as suggested in [23]. Also, the best constant step size was employed
7) Acc-Prox-SVRG: an version of accelerated stochastic gradient method in [25]. We chose , (), and (), as suggested in [25]. Also, the best constant step size was employed.
8) Acc-Prox-SVRG-BB: an variant of Acc-Prox-SVRG, with the BB step size in [47]. We set the parameters of Acc-Prox-SVRG-BB as suggested in [47].
9) Acc-Prox-SVRG-RBB: an variant of Acc-Prox-SVRG, with the RBB step size in [61]. For Acc-Prox-SVRG-BB, we set the best batch size and for different data sets.
10) MSVRG-OSS: the MSVRG method with an online step size [43] The parameters were set as suggested in [43].

Fig. 7: Comparison of different methods on three data sets: (left), (middle) and (right).

As can be seen from Fig. 7, our MB-SARAH-RBB method outperforms or matches state-of-the-art algorithms.

Vii Conclusion

This paper is motivated by a defect related to SARAH for step size choice. Specifically, common implementations of such schemes provide little guidance in specifying step size parameters that prove crucial in practical performance. Accordingly, we propose using the RBB method to automatically evaluate step size for MB-SARAH and obtain MB-SARAH-RBB. We prove that our MB-SARAH-RBB method converges with a linear convergence rate for strongly convex objective functions. We analyze the complexity of MB-SARAH-RBB and show that the complexity of the original MB-SARAH method is improved by combining the RBB method. Numerical results show that our MB-SARAH-RBB method outperforms or matches state-of-the-art algorithms.

Appendix A Proofs

A-a Proof of Lemma 1

According to (3) and , we have

Employing the strong convexity of , we have the following upper boundary for the RBB step size from Algorithm 1.

Therefore, we ascertain that

where the last equality is according to the fact that .

By summing over , we have

Further, we have

where the last inequality follows since .

A-B Proof of Theorem 1

From Lemma 2, we have

Since , hence by summing over , we obtain

Further, we have

(14)

Therefore, by Lemma 1, we have

By the definition of in Algorithm 1 and , we have that

A-C Proof of Theorem 2

Note that and , . From Theorem 1, we obtain

Hence, taking expectation, we obtain

References

  • [1] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, in: Proceedings of the 26th International Conference on Machine Learning, ACM, 2009, pp. 689–696.
  • [2] T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algorithms, in: Proceedings of the 21st International Conference on Machine Learning, ACM, 2004, p. 116.
  • [3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE transactions on pattern analysis and machine intelligence 32 (9) (2009) 1627–1645.
  • [4]

    V. Blanz, T. Vetter, Face recognition based on fitting a 3d morphable model, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (9) (2003) 1063–1074.

  • [5] M. Hoffman, F. R. Bach, D. M. Blei, Online learning for latent dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 856–864.
  • [6] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
  • [7]

    Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, T. Huang, Large-scale image classification: fast feature extraction and svm training, in: CVPR 2011, IEEE, 2011, pp. 1689–1696.

  • [8]

    Z. Zhang, P. Luo, C. C. Loy, X. Tang, Facial landmark detection by deep multi-task learning, in: European Conference on Computer Vision, Springer, 2014, pp. 94–108.

  • [9]

    Q. Tao, Q.-K. Gao, D.-J. Chu, G.-W. Wu, Stochastic learning via optimizing the variational inequalities, IEEE Transactions on Neural Networks and Learning Systems 25 (10) (2014) 1769–1778.

  • [10] M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez, P. M. Djuric, Adaptive importance sampling: the past, the present, and the future, IEEE Signal Processing Magazine 34 (4) (2017) 60–79.
  • [11] C. Du, J. Zhu, B. Zhang, Learning deep generative models with doubly stochastic gradient mcmc, IEEE Transactions on Neural Networks and Learning Systems 29 (7) (2017) 3084–3096.
  • [12] X.-L. Li, Preconditioned stochastic gradient descent, IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1466.
  • [13]

    E. Moulines, F. R. Bach, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459.

  • [14] A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on optimization 19 (4) (2009) 1574–1609.
  • [15] L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review 60 (2) (2018) 223–311.
  • [16] A. Cotter, O. Shamir, N. Srebro, K. Sridharan, Better mini-batch algorithms via accelerated gradient methods, in: Advances in Neural Information Processing Systems, 2011, pp. 1647–1655.
  • [17] J. Konečnỳ, J. Liu, P. Richtárik, M. Takáč, Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE Journal of Selected Topics in Signal Processing 10 (2) (2016) 242–255.
  • [18] D. Needell, R. Ward, N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm, in: Advances in Neural Information Processing Systems, 2014, pp. 1017–1025.
  • [19] D. Csiba, P. Richtárik, Importance sampling for minibatches, The Journal of Machine Learning Research 19 (1) (2018) 962–982.
  • [20]

    T. Fu, Z. Zhang, Cpsg-mcmc: Clustering-based preprocessing method for stochastic gradient mcmc, in: Artificial Intelligence and Statistics, 2017, pp. 841–850.

  • [21] N. L. Roux, M. Schmidt, F. R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in: Advances in Neural Information Processing Systems, 2012, pp. 2663–2671.
  • [22] A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in: Advances in Neural Information Processing Systems, 2014, pp. 1646–1654.
  • [23] S. Shalev-Shwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, Journal of Machine Learning Research 14 (Feb) (2013) 567–599.
  • [24] R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in: Advances in Neural Information Processing Systems, 2013, pp. 315–323.
  • [25] A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp. 1574–1582.
  • [26] L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, Sarah: A novel method for machine learning problems using stochastic recursive gradient, in: International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 2613–2621.
  • [27] C. Fang, C. J. Li, Z. Lin, T. Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator, in: Advances in Neural Information Processing Systems, 2018, pp. 689–699.
  • [28] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, J. R. Kalagnanam, Finite-sum smooth optimization with sarah, def 1 (2019) 1.
  • [29] L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, Stochastic recursive gradient algorithm for nonconvex optimization, arXiv preprint arXiv:1705.07261.
  • [30] L. M. Nguyen, K. Scheinberg, M. Takáč, Inexact sarah algorithm for stochastic optimization, arXiv preprint arXiv:1811.10105.
  • [31] S. Horváth, P. Richtárik, Nonconvex variance reduced optimization with arbitrary sampling, arXiv preprint arXiv:1809.04146.
  • [32] D. Zhou, Q. Gu, Stochastic recursive variance-reduced cubic regularization methods, arXiv preprint arXiv:1901.11518.
  • [33] N. H. Pham, L. M. Nguyen, D. T. Phan, Q. Tran-Dinh, ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization, arXiv preprint arXiv:1902.05679.
  • [34] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, J. R. Kalagnanam, Optimal finite-sum smooth non-convex optimization with sarah, arXiv preprint arXiv:1901.07648.
  • [35] L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM Journal on Optimization 24 (4) (2014) 2057–2075.
  • [36] J. Barzilai, J. M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis 8 (1) (1988) 141–148.
  • [37] Z. Yang, C. Wang, Z. Zhang, J. Li, Random Barzilai-Borwein step size for mini-batch algorithms, Engineering Applications of Artificial Intelligence 72 (2018) 124 – 135.
  • [38] H. Kesten, Accelerated stochastic approximation, Annals of Mathematical Statistics 29 (1) (1958) 41–59.
  • [39] H. Robbins, S. Monro, A stochastic approximation method, Annals of Mathematical Statistics (1951) 400–407.
  • [40] A. Benveniste, M. M tivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer Berlin Heidelberg, 1990.
  • [41] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude., COURSERA: Neural Networks for Machine Learning.
  • [42] A. P. George, W. B. Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Machine Learning 65 (1) (2006) 167–198.
  • [43] Z. Yang, C. Wang, Z. Zhang, J. Li, Mini-batch algorithms with online step size, Knowledge-Based Systems 165 (2019) 228–240.
  • [44] K. Sopyła, P. Drozda, Stochastic gradient descent with barzilai–borwein update step for svm, Information Sciences 316 (2015) 218–233.
  • [45] C. Tan, S. Ma, Y. H. Dai, Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, in: Advances in Neural Information Processing Systems, 2016, pp. 685–693.
  • [46] Z. Yang, C. Wang, Y. Zang, J. Li, Mini-batch algorithms with Barzilai–Borwein update step, Neurocomputing 314 (2018) 177–185.
  • [47] Z. Yang, C. Wang, Z. Zhang, J. Li, Accelerated stochastic gradient descent with step size selection rules, Signal Processing 159 (2019) 171–186.
  • [48] S. De, A. Yadav, D. Jacobs, T. Goldstein, Automated inference with adaptive batches, in: International Conference on Artificial Intelligence and Statistics, 2017.
  • [49] K. Ma, J. Zeng, J. Xiong, Q. Xu, X. Cao, W. Liu, Y. Yao, Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein step size, in: AAAI Conference on Artificial Intelligence, 2018.
  • [50] F. Yousefian, A. Nedić, U. V. Shanbhag, On stochastic gradient and subgradient methods with adaptive steplength sequences, Automatica 48 (1) (2012) 56–67.
  • [51] M. Mahsereci, P. Hennig, Probabilistic line searches for stochastic optimization, Journal of Machine Learning Research 18 (119) (2017) 1–59.
  • [52] A. G. Baydin, R. Cornish, D. M. Rubio, M. W. Schmidt, F. D. Wood, Online learning rate adaptation with hypergradient descent, in: International Conference on Learning Representations, 2018.
  • [53] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in: International Conference on Machine Learning, 2016, pp. 314–323.
  • [54] L. B. Almeida, T. Langlois, J. D. Amaral, A. Plakhov, Parameter adaptation in stochastic optimization, in: On-line learning in neural networks, Cambridge University Press, 1999, pp. 111–134.
  • [55] T. Schaul, S. Zhang, Y. Lecun, No more pesky learning rates, in: International Conference on Machine Learning, 2013, pp. 343–351.
  • [56] T. Tieleman, G. Hinton, Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning, Technical Report.
  • [57] Z. Wang, K. Crammer, S. Vucetic, Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale svm training, Journal of Machine Learning Research 13 (Oct) (2012) 3103–3131.
  • [58] S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in: Advances in Neural Information Processing Systems, 2016, pp. 3873–3881.
  • [59] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in: International Conference on Machine Learning, 2013, pp. 1139–1147.
  • [60] Y. Nesterov, Introductory lectures on convex optimization : basic course, Kluwer Academic, 2004.
  • [61] Z. Yang, C. Wang, Z. Zhang, J. Li, Random Barzilai–Borwein step size for mini-batch algorithms, Engineering Applications of Artificial Intelligence 72 (2018) 124–135.
  • [62] A. Bordes, L. Bottou, P. Gallinari, SGD-QN: Careful quasi-newton stochastic gradient descent, Journal of Machine Learning Research 10 (Jul) (2009) 1737–1754.
  • [63] R. H. Byrd, G. M. Chin, J. Nocedal, Y. Wu, Sample size selection in optimization methods for machine learning, Mathematical programming 134 (1) (2012) 127–155.
  • [64] R. H. Byrd, S. L. Hansen, J. Nocedal, Y. Singer, A stochastic quasi-newton method for large-scale optimization, SIAM Journal on Optimization 26 (2) (2016) 1008–1031.
  • [65] N. Agarwal, B. Bullins, E. Hazan, Second-order stochastic optimization for machine learning in linear time, The Journal of Machine Learning Research 18 (1) (2017) 4148–4187.
  • [66] N. Tripuraneni, M. Stern, C. Jin, J. Regier, M. I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in: Advances in Neural Information Processing Systems, 2018, pp. 2899–2908.
  • [67] M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Clifton, A. Sarkar, Non-uniform stochastic average gradient method for training conditional random fields., in: AISTATS, 2015.