I Introduction
Stochastic gradient descent (SGD) type methods have been a core methodology in applications to large scale problems in machine learning and related areas [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. The classical SGD method only requires a single random example periteration to approximate the full gradient. Such strategy usually makes SGD perform with a low computational complexity periteration. While SGD makes rapid progress early on, the convergence rate of SGD is significantly deteriorated by the intrinsic variance of its stochastic estimator. Even for strongly and smooth problem, SGD only converges sublinearly [13].
Traditionally, there are three common ways to decrease the variance caused by the stochastic estimate. The first one is taking a decreasing step size sequence [14, 15]. However, this will further reduce the convergence rate. Moreover, it’s known that the practical convergence of SGD is very sensitive to the choice of step size sequence, which needs to be handpicked. A second approach is using a minibatching technique [16, 17]. Obviously, this requires more computations. The last method is using important sampling strategy [18, 19]. Although effective, this technique is not always practical, as the computation of the sampling mechanism relates to the dimensionality of model parameters [20]. In summary, any variance reduction technique does not come for free.
In recent years, some advanced stochastic variancereduced algorithms have emerged, which use the specific form of objective function and combine some deterministic and stochastic aspects to reduce variance of the steps. The popular examples of these methods are the stochastic average gradient (SAG) method[21], the SAGA method [22], the stochastic dual coordinate ascent (SDCA) method [23], the stochastic variance reduced gradient (SVRG) method [24], the accelerated minibatch ProxSVRG (AccProxSVRG) method [25], the minibatch semistochastic gradient descent (mS2GD) method [17], the StochAstic Recursive grAdient algoritHm (SARAH) [26] and the Stochastic PathIntegrated Differential EstimatoR method (SPIDER) [27], all of which have faster convergence rate than that of SGD. Specifically, these methods work with a fixed step size. However, the step size is often chosen by mentor. Hence, this is time consuming in practice.
More recently, SARAH, originally proposed for convex optimization, is gaining tremendous popularity due to only requiring a simple framework for updating stochastic gradient estimates [28]. Moreover, SARAH has been proven to be effective for general nonconvex optimization [29, 30, 31, 32, 33, 34]. Actually, SARAH and SVRG [24] are two similar methods which perform a deterministic step often called outer loop, where the full gradient of the objective functions was calculated at the outer loop, then followed by stochastic processes. The only difference between SVRG and SARAH is how the iterative scheme is performed in the inner loop. In addition, SARAH is a recursive method as SAGA [22], but do not store gradients as SAGA. Especially, different from SVRG and other methods (e.g., SAG, SDCA, mS2GD, etc.), SARAH does not take an estimator that is unbiased in the last step. Instead, it is unbiased over a long history of the method. Specifically, the advantage of SARAH is that the iterative scheme of the inner loop itself can converge sublinearly [26].
Although Nguyen et al. [26] pointed out that SARAH uses a large constant step size than that of SVRG, the step size is still chosen by mentor. Including the variants of SARAH also employed a constant step size [30, 34]. In addition, Pham et al. [33] proposed proximal SARAH (ProxSARAH) for stochastic composite nonconvex optimization and showed that ProxSARAH works with new constant and adaptive step sizes, where the constant step size is much larger than existing methods, including proximal SVRG (ProxSVRG) schemes [35] in the single sample case and adaptive stepsizes are increasing along the inner iterations rather than diminishing as in stochastic proximal gradient descent methods. However, it is complicated to compute adaptive step size for ProxSARAH. Especially, ProxSARAH needs to control two step size sequences, which make it difficult to use in practice.
To deal with this demerit associated with SARAH, we propose using the random BarzilaiBorwein (RBB) method to automatically calculate step size for the minibatch version of SARAH (MBSARAH), proposed by Nguyen et al. [29] for nonconvex optimization, thereby leading to a new SARAH method named as MBSARAHRBB. The RBB method, a variant of the BarzilaiBorwein (BB) method [36], has been proposed by Yang et al. [37] and use it to calculate step size for minibatch algorithms. However, they just discussed the choice of step size of SVRGtype algorithms, i.e., mS2GD and AccProxSVRG.
The key contributions of this work are as follows:

1) We propose to use the RBB method to compute step size for MBSARAH and obtain a new SARAH method named as MBSARAHRBB. Unlike the work in [37], when using the RBB method to calculate step size, we multiply a constant parameter, which is pivotal to ensure the convergence of MBSARAHRBB.

2) We prove the convergence of our MBSARAHRBB method and show that its complexity is better than SARAH in the minibatch setting.

3) We conduct experiments for MBSARAHRBB on solving logistic regression problem. Experimental results on three benchmark data sets show that the proposed method outperforms or matches stateoftheart algorithms.
The rest of this paper is organized as follows. Section II discusses related works that are relevant to this paper. Section III presents problem formulation and background. Section IV proposes our MBSARAHRBB method. Section V presents the convergence analysis of MBSARAHRBB for strongly convex objective functions and discusses its complexity. Section VI conducts some empirical comparisons over some stateoftheart approaches. Section VII concludes the paper.
Notations: Throughout this paper, we view vectors as columns, and use
to denote the transpose of a vector . We use the symbol, , to denote the Euclidean vector norm, i.e., . We useto denote the expectation of a random variable
.Ii Related Work
Early works that compute step sizes adaptively for SGD are based on (i) a function of the errors in the predictions or estimates, or (ii) a function of the gradient of the error measure. For example, Kesten [38] pointed out that when consecutive errors in the estimate of the value of a parameter obtained by the RobbinsMonro procedure [39] are of opposite signs, the estimate is in the vicinity of the true value and accordingly the step size ought to be reduced. Further, an alternative version of the gradient adaptive step size algorithm within a stochastic approximation formulation was presented by Benveniste et al. [40]
. In addition, RMSprop, propounded by Tieleman et al.
[41], adapts a step size per weight based on the observed sign changes in the gradients. For more related methods, we refer readers to [42, 43] and references therein.Recently, due to its simplicity and numerical efficiency, many researchers try to incorporate the BB method and its variants into SGD. For instance, Sopyła et al. [44] presented several variants of the BB method for SGD to train the linear SVM. Tan et al. [45] used the BB method to calculate the step size for SGD and SVRG, thereby putting forward two new approaches: SGDBB and SVRGBB. Moreover, they showed that SVRGBB have linear convergence for strongly convex objective functions. To further accelerate the convergence rate of SVRGBB, mS2GDBB, incorporating the BB method into mS2GD (a variant of SVRG), was proposed by Yang et al. [46]. They presented that mS2GDBB has linear convergence in expectation for nonsmooth strongly convex objective functions. In addition, Yang et al. [47] introduced the BB method into accelerated stochastic gradient (ASGD) methods and obtained a series of new ASGD methods. Moreover, for their proposed methods, they finished the proof of the convergence analysis and pointed out that the complexity of their proposed methods achieves the same level as the best known stochastic gradient methods. Further, when considering a “big batch” for SGD, De et al. [48] introduced the backtracking line search and BB methods into SGD to calculate step size. Moreover, they pointed out that the performance of SGD, using an adaptive step size method based on the BB method, is better than that of using the backtracking line search on a range of convex problems. To obtain online step size, Yang et al. [37] put forward the RBB method and incorporated it into mS2GD and AccProxSVRG, generating two new approaches: mS2GDRBB and AccProxSVRGRBB. To avert the denominator being close to zero when using the BB, or RBB methods, the stabilized BarzilaiBorwein (SBB) step size was proposed by Ma et al. [49]. Especially, they introduced it into SVRG and obtained SVRGSBB for dealing with the ordinal embedding problem. Moreover, they showed that the SVRGSBB method converges with a rate, , where is the total number of iterations.
In addition to the abovementioned methods, other strategies of choosing step size were used in SGD. For instance, two adaptive step size schemes, referred to as a recursive step size stochastic approximation (RSA) scheme and a cascading step size stochastic approximation (CSA) scheme, were put forward by Yousefian et al. [50]. They also finished the proof of convergence analysis of two new iteration schemes for strongly convex differentiable stochastic optimization problems. In addition, Mahsereci et al. [51] suggested performing line search for an estimated function, which is computed by a Gaussian process with random samples. An online step size can also be obtained by using a hypergradient descent, where can be found in [52]. To greatly reduce the dependence of the algorithm on initial parameters when using hypergradient, Yang et al. [43] introduced the online step size (OSS) into the the minibatch nonconvex stochastic variance reduced gradient (MSVRG) method [53] and obtain the MSVRGOSS method. Moreover, they showed that MSVRGOSS has linear converges for strongly convex objective functions. Especially, they pointed out that the MSVRGOSS method also can be used to deal with nonconvex problems. Other different types of choosing step size for SGD, we refer readers to [54, 42, 55, 56] and references therein.
Iii Problem formulation and background
We focus on the following problem
(1) 
where is the sample size, and each is cost function estimating how well parameter fits the data of the th sample. Throughout this work, we assume that each has Lipschitz continuous derivatives. Also, we assume that both each and , are strongly convex.
Many problems in applications are often formulated as Problem (1). For example, when setting , where is a regularization parameter, the Problem (1) becomes least squares. However, when setting , the Problem (1) becomes logistic regression. Some other prevalent models, e.g., SVM [57], sparse dictionary learning [1], lowrank matrix completion [58]
and deep learning
[59], can be written in the form of (1).To proceed with the analysis of the proposed algorithm, we require making the following common assumptions.
Assumption 1.
Each convex function, , in (1) is Lipschitz smooth, i.e., there exists such that for all and in ,
(2) 
Note that this assumption implies that the objective function, , is also Lipschitz smooth. Moreover, by the property of Lipschitz smooth function (see in [60]), we have
(3) 
Assumption 2.
is strongly convex, i.e., there exists such that for all , ,
(4) 
or equivalently
(5) 
When setting , it is known in [15] that the strong convexity of implies that
(6) 
In this paper, the complexity analysis aims to bound the number of iterations (or total number of stochastic gradient evaluations) which requires . In this case, we say that is an accurate solution.
Iv The Algorithm
In the following, we begin with the introduction of the RBB step size, and then we put forward our MBSARAHRBB method, which incorporates the RBB step size into MBSARAH.
Iva Random BarzilaiBorwein Step Size
To solve Problem (1), Yang et al. [61] proposed to use the RBB method to calculate step size for mS2GD, thereby obtaining: mS2GDRBB. In the inner loop of mS2GDRBB, the updating scheme of solution sequence is:
(7) 
where is the step size sequence and defined as:
(8) 
and is the stochastic estimate of and defined as:
(9) 
where , , with size , , with size , and is an snapshot vector for which the gradient, , has already been previously calculated in the deterministic step.
Actually, the RBB method satisfies the socalled quasiNewton property under the background of stochastic optimization. Specifically, the RBB method can be viewed as a variant of stochastic quasiNewton method, where the second order information was used. During recent years, more and more researchers and communities show that stochastic quasiNewton iterates almost as fast as a first order stochastic gradient but only needs less iterations to achieve the same accuracy [62, 63, 64, 65, 66].
IvB The proposed method
The MBSARAH method, proposed by Nguyen et al. [29], is viewed as a variant of mS2GD. However, the pivotal difference between the mS2GD and MBSARAH is that the latter uses a new kind of stochastic estimate of , i.e.,
(10) 
For comparison, the stochastic estimate of mS2GDRBB is written in a similar way as (9). Note that for mS2GDRBB,
is an unbiased estimator of the gradient, i.e., from (
9), we have . However, it’s not true for MBSARAH.We introduce the RBB method into MBSARAH and obtain a new SARAH method referred to as MBSARAHRBB. But different with mS2GDRBB, when computing the random step size in MBSARAH, we multiply a parameter, , in (8), i.e.,
(11) 
where the parameter, , is important to control the convergence of MBSARAHRBB.
Now we are ready to describe our MBSARAHRBB method (Algorithm 1).
Remark: At the beginning of MBSARAHRBB, a step size, , requires to be specified. However, we observed from the numerical experiments that the performance of MBSARAHRBB is not sensitive to the choice of . It also can be seen from Algorithm 1 that, if we always set , then MBSARAHRBB is reduced to the original MBSARAH method.
V Convergence Analysis
In this section, we finish the proof of convergence analysis of MBSARAHRBB and discuss its complexity. We first provide the following lemmas.
Lemma 1.
Proof.
Available in Appendix AA ∎
With minor modification of Lemma 3 in [29], we obtain the following lemma showing the upper bound for .
Using the above lemmas, we obtain the following convergence rate for MBSARAHRBB with one outer loop.
Theorem 1.
Proof.
Available in Appendix AB ∎
This result shows that the inner loop of MBSARAHRBB with a single outer loop converges sublinearly. Actually, to obtain
it is sufficient to choose . Hence, the total complexity to require an accurate solution is . Therefore, the following conclusion for complexity bound is obtained.
Corollary 1.
Under Assumption 1, consider MBSARAHRBB with one outer loop, then has sublinear convergence in expectation with a rate of , and the total complexity to achieve an accurate solution is .
Since , compared with Corollary 1 in [29], we have that the complexity of our MBSARAHRBB method is better than the complexity of MBSARAH which is when choosing an appropriate minibatch size . Note that, in our MBSARAHRBB method, the parameter, , is greater than .
Now, we present the estimating convergence of MBSARAHRBB with multiple outer steps.
Theorem 2.
Proof.
Available in Appendix AC ∎
To obtain , it is sufficient to set . Therefore, we have the following conclusion for the total complexity of the proposed method.
Corollary 2.
Suppose Assumption 1 hold, the total complexity of MBSARAHRBB to achieve an accurate solution is .
Compared the complexity of MBSARAH, Corollary 2 indicates that MBSARAHRBB has lower complexity when choosing an appropriate minibatch size .
Vi Experiments
In this section, the effectiveness of our MBSARAHRBB method is verified with experiments. In particular, our experiments were performed on the wellworn problems of training ridge regression, i.e.,
(13) 
where is a collection of training examples.
We tested our MBSARAHRBB method on the three publicly available data sets (a8a,w8a and ijcnn1)^{1}^{1}1, and can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.. Detailed information of the data sets are listed in Table I.
Dataset 
Training size  feature  

a8a 
22,696  123  
w8a  49,749  300  
ijcnn1  49,990  22  

Via Properties of MBSARAHRBB
In this subsection, we show the properties of MBSARAHRBB conducted using data sets listed in Table I. To clearly show the properties of our MBSARAHRBB method, we present the comparison results between MBSARAHRBB and MBSARAH with the besttuned step size. For ease of analysis, in MBSARAHRBB, we take the same batch samples, , as MBSARAH to obtain the solution sequence, on different data sets. Therefore, we can easily see the cases of batch samples, , for obtaining step size sequence.
In addition, for MBSARAHRBB, we chose the parameter, , as 0.1 when the batch samples, , is small; otherwise, we take , or a slightly large number. Moreover, we set for MBSARAHRBB.
Fig. 1, 2 and 3 compare MBSARAHRBB with MBSARAH. In all subfigures, the horizontal axis represents the number of effective passes over the data, where each effective pass evaluates component gradients. The vertical axis is the suboptimality, , where we obtain by performing MBSARAH with the besttuned step size. Moreover, the dashed lines represent MBSARAH with different fixed step sizes and the solid lines represent MBSARAHRBB with different batch sizes and . The detailed information of the parameters is given in the legends of the subfigures.
Fig. 1, 2 and 3 show that, MBSARAHRBB is comparable to or performs better than MBSARAH with besttuned step size. Also, Fig. 1, 2 and 3 indicate that, when fixed batch samples, , it is no need to set a large batch samples, , to obtain step size sequence. However, a small batch size, , makes MBSARAHRBB diverge.
In Algorithm 1, we pointed out that MBSARAHRBB is not sensitive to the choice of step size, . To present this case, we set three different step sizes () for MBSARAHRBB on and and the results were presented in Fig. 4. We set for and for .
It can be seen from Fig. 4 that, the performance of MBSARAHRBB is not influenced by the choice of .
ViB Comparison with mS2GDRBB
mS2GDRBB, proposed by Yang et al. [61], uses the similar strategy as us to compute step size for mS2GD. One of the key difference between mS2GDRBB and MBSARAHRBB is that the latter multiply a positive constant, , in (8). To further show the efficacy of our MBSARAHRBB method, we compare these two methods. All parameters of mS2GDRBB are set as suggested in [61]. Also, we use the dashed lines to represent mS2GDRBB and the solid lines to represent MBSARAHRBB.
ViC Comparison with other related methods
In this section, we compare our MBSARAHRBB method with the following methods:
1) SAGLS: Stochastic average gradient method with line search [67].
2) SAGBB: Stochastic average gradient method with BB step size [45].
3) SVRG: Stochastic variance reduction gradient method [24]. For SVRG, the best constant step size was employed.
4) SVRGBB: Stochastic variance reduction gradient method with BB step size [45].
5) mS2GDBB: A batch version of SVRGBB proposed in [46]. For mS2GDBB, all parameters were set as suggested in [46].
6) SDCA: Stochastic descent coordinate ascent method [23]. We chose the parameters as suggested in [23]. Also, the best constant step size was employed
7) AccProxSVRG: an version of accelerated stochastic gradient method in [25]. We chose , (), and (), as suggested in [25]. Also, the best constant step size was employed.
8) AccProxSVRGBB: an variant of AccProxSVRG, with the BB step size in [47]. We set the parameters of AccProxSVRGBB as suggested in [47].
9) AccProxSVRGRBB: an variant of AccProxSVRG, with the RBB step size in [61]. For AccProxSVRGBB, we set the best batch size and for different data sets.
10) MSVRGOSS: the MSVRG method with an online step size [43] The parameters were set as suggested in [43].
As can be seen from Fig. 7, our MBSARAHRBB method outperforms or matches stateoftheart algorithms.
Vii Conclusion
This paper is motivated by a defect related to SARAH for step size choice. Specifically, common implementations of such schemes provide little guidance in specifying step size parameters that prove crucial in practical performance. Accordingly, we propose using the RBB method to automatically evaluate step size for MBSARAH and obtain MBSARAHRBB. We prove that our MBSARAHRBB method converges with a linear convergence rate for strongly convex objective functions. We analyze the complexity of MBSARAHRBB and show that the complexity of the original MBSARAH method is improved by combining the RBB method. Numerical results show that our MBSARAHRBB method outperforms or matches stateoftheart algorithms.
Appendix A Proofs
Aa Proof of Lemma 1
According to (3) and , we have
Employing the strong convexity of , we have the following upper boundary for the RBB step size from Algorithm 1.
Therefore, we ascertain that
where the last equality is according to the fact that .
By summing over , we have
Further, we have
where the last inequality follows since .
AB Proof of Theorem 1
From Lemma 2, we have
Since , hence by summing over , we obtain
Further, we have
(14) 
Therefore, by Lemma 1, we have
By the definition of in Algorithm 1 and , we have that
AC Proof of Theorem 2
Note that and , . From Theorem 1, we obtain
Hence, taking expectation, we obtain
References
 [1] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, in: Proceedings of the 26th International Conference on Machine Learning, ACM, 2009, pp. 689–696.
 [2] T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algorithms, in: Proceedings of the 21st International Conference on Machine Learning, ACM, 2004, p. 116.
 [3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained partbased models, IEEE transactions on pattern analysis and machine intelligence 32 (9) (2009) 1627–1645.

[4]
V. Blanz, T. Vetter, Face recognition based on fitting a 3d morphable model, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (9) (2003) 1063–1074.
 [5] M. Hoffman, F. R. Bach, D. M. Blei, Online learning for latent dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 856–864.
 [6] L. Bottou, Largescale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.

[7]
Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, T. Huang, Largescale image classification: fast feature extraction and svm training, in: CVPR 2011, IEEE, 2011, pp. 1689–1696.

[8]
Z. Zhang, P. Luo, C. C. Loy, X. Tang, Facial landmark detection by deep multitask learning, in: European Conference on Computer Vision, Springer, 2014, pp. 94–108.

[9]
Q. Tao, Q.K. Gao, D.J. Chu, G.W. Wu, Stochastic learning via optimizing the variational inequalities, IEEE Transactions on Neural Networks and Learning Systems 25 (10) (2014) 1769–1778.
 [10] M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez, P. M. Djuric, Adaptive importance sampling: the past, the present, and the future, IEEE Signal Processing Magazine 34 (4) (2017) 60–79.
 [11] C. Du, J. Zhu, B. Zhang, Learning deep generative models with doubly stochastic gradient mcmc, IEEE Transactions on Neural Networks and Learning Systems 29 (7) (2017) 3084–3096.
 [12] X.L. Li, Preconditioned stochastic gradient descent, IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1466.

[13]
E. Moulines, F. R. Bach, Nonasymptotic analysis of stochastic approximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459.
 [14] A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on optimization 19 (4) (2009) 1574–1609.
 [15] L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for largescale machine learning, SIAM Review 60 (2) (2018) 223–311.
 [16] A. Cotter, O. Shamir, N. Srebro, K. Sridharan, Better minibatch algorithms via accelerated gradient methods, in: Advances in Neural Information Processing Systems, 2011, pp. 1647–1655.
 [17] J. Konečnỳ, J. Liu, P. Richtárik, M. Takáč, Minibatch semistochastic gradient descent in the proximal setting, IEEE Journal of Selected Topics in Signal Processing 10 (2) (2016) 242–255.
 [18] D. Needell, R. Ward, N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm, in: Advances in Neural Information Processing Systems, 2014, pp. 1017–1025.
 [19] D. Csiba, P. Richtárik, Importance sampling for minibatches, The Journal of Machine Learning Research 19 (1) (2018) 962–982.

[20]
T. Fu, Z. Zhang, Cpsgmcmc: Clusteringbased preprocessing method for stochastic gradient mcmc, in: Artificial Intelligence and Statistics, 2017, pp. 841–850.
 [21] N. L. Roux, M. Schmidt, F. R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in: Advances in Neural Information Processing Systems, 2012, pp. 2663–2671.
 [22] A. Defazio, F. Bach, S. LacosteJulien, SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives, in: Advances in Neural Information Processing Systems, 2014, pp. 1646–1654.
 [23] S. ShalevShwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, Journal of Machine Learning Research 14 (Feb) (2013) 567–599.
 [24] R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in: Advances in Neural Information Processing Systems, 2013, pp. 315–323.
 [25] A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp. 1574–1582.
 [26] L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, Sarah: A novel method for machine learning problems using stochastic recursive gradient, in: International Conference on Machine LearningVolume 70, JMLR. org, 2017, pp. 2613–2621.
 [27] C. Fang, C. J. Li, Z. Lin, T. Zhang, Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator, in: Advances in Neural Information Processing Systems, 2018, pp. 689–699.
 [28] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.W. Weng, J. R. Kalagnanam, Finitesum smooth optimization with sarah, def 1 (2019) 1.
 [29] L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, Stochastic recursive gradient algorithm for nonconvex optimization, arXiv preprint arXiv:1705.07261.
 [30] L. M. Nguyen, K. Scheinberg, M. Takáč, Inexact sarah algorithm for stochastic optimization, arXiv preprint arXiv:1811.10105.
 [31] S. Horváth, P. Richtárik, Nonconvex variance reduced optimization with arbitrary sampling, arXiv preprint arXiv:1809.04146.
 [32] D. Zhou, Q. Gu, Stochastic recursive variancereduced cubic regularization methods, arXiv preprint arXiv:1901.11518.
 [33] N. H. Pham, L. M. Nguyen, D. T. Phan, Q. TranDinh, ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization, arXiv preprint arXiv:1902.05679.
 [34] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.W. Weng, J. R. Kalagnanam, Optimal finitesum smooth nonconvex optimization with sarah, arXiv preprint arXiv:1901.07648.
 [35] L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM Journal on Optimization 24 (4) (2014) 2057–2075.
 [36] J. Barzilai, J. M. Borwein, Twopoint step size gradient methods, IMA Journal of Numerical Analysis 8 (1) (1988) 141–148.
 [37] Z. Yang, C. Wang, Z. Zhang, J. Li, Random BarzilaiBorwein step size for minibatch algorithms, Engineering Applications of Artificial Intelligence 72 (2018) 124 – 135.
 [38] H. Kesten, Accelerated stochastic approximation, Annals of Mathematical Statistics 29 (1) (1958) 41–59.
 [39] H. Robbins, S. Monro, A stochastic approximation method, Annals of Mathematical Statistics (1951) 400–407.
 [40] A. Benveniste, M. M tivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer Berlin Heidelberg, 1990.
 [41] T. Tieleman, G. Hinton, Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude., COURSERA: Neural Networks for Machine Learning.
 [42] A. P. George, W. B. Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Machine Learning 65 (1) (2006) 167–198.
 [43] Z. Yang, C. Wang, Z. Zhang, J. Li, Minibatch algorithms with online step size, KnowledgeBased Systems 165 (2019) 228–240.
 [44] K. Sopyła, P. Drozda, Stochastic gradient descent with barzilai–borwein update step for svm, Information Sciences 316 (2015) 218–233.
 [45] C. Tan, S. Ma, Y. H. Dai, Y. Qian, BarzilaiBorwein step size for stochastic gradient descent, in: Advances in Neural Information Processing Systems, 2016, pp. 685–693.
 [46] Z. Yang, C. Wang, Y. Zang, J. Li, Minibatch algorithms with Barzilai–Borwein update step, Neurocomputing 314 (2018) 177–185.
 [47] Z. Yang, C. Wang, Z. Zhang, J. Li, Accelerated stochastic gradient descent with step size selection rules, Signal Processing 159 (2019) 171–186.
 [48] S. De, A. Yadav, D. Jacobs, T. Goldstein, Automated inference with adaptive batches, in: International Conference on Artificial Intelligence and Statistics, 2017.
 [49] K. Ma, J. Zeng, J. Xiong, Q. Xu, X. Cao, W. Liu, Y. Yao, Stochastic Nonconvex Ordinal Embedding with Stabilized BarzilaiBorwein step size, in: AAAI Conference on Artificial Intelligence, 2018.
 [50] F. Yousefian, A. Nedić, U. V. Shanbhag, On stochastic gradient and subgradient methods with adaptive steplength sequences, Automatica 48 (1) (2012) 56–67.
 [51] M. Mahsereci, P. Hennig, Probabilistic line searches for stochastic optimization, Journal of Machine Learning Research 18 (119) (2017) 1–59.
 [52] A. G. Baydin, R. Cornish, D. M. Rubio, M. W. Schmidt, F. D. Wood, Online learning rate adaptation with hypergradient descent, in: International Conference on Learning Representations, 2018.
 [53] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in: International Conference on Machine Learning, 2016, pp. 314–323.
 [54] L. B. Almeida, T. Langlois, J. D. Amaral, A. Plakhov, Parameter adaptation in stochastic optimization, in: Online learning in neural networks, Cambridge University Press, 1999, pp. 111–134.
 [55] T. Schaul, S. Zhang, Y. Lecun, No more pesky learning rates, in: International Conference on Machine Learning, 2013, pp. 343–351.
 [56] T. Tieleman, G. Hinton, Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning, Technical Report.
 [57] Z. Wang, K. Crammer, S. Vucetic, Breaking the curse of kernelization: Budgeted stochastic gradient descent for largescale svm training, Journal of Machine Learning Research 13 (Oct) (2012) 3103–3131.
 [58] S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in: Advances in Neural Information Processing Systems, 2016, pp. 3873–3881.
 [59] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in: International Conference on Machine Learning, 2013, pp. 1139–1147.
 [60] Y. Nesterov, Introductory lectures on convex optimization : basic course, Kluwer Academic, 2004.
 [61] Z. Yang, C. Wang, Z. Zhang, J. Li, Random Barzilai–Borwein step size for minibatch algorithms, Engineering Applications of Artificial Intelligence 72 (2018) 124–135.
 [62] A. Bordes, L. Bottou, P. Gallinari, SGDQN: Careful quasinewton stochastic gradient descent, Journal of Machine Learning Research 10 (Jul) (2009) 1737–1754.
 [63] R. H. Byrd, G. M. Chin, J. Nocedal, Y. Wu, Sample size selection in optimization methods for machine learning, Mathematical programming 134 (1) (2012) 127–155.
 [64] R. H. Byrd, S. L. Hansen, J. Nocedal, Y. Singer, A stochastic quasinewton method for largescale optimization, SIAM Journal on Optimization 26 (2) (2016) 1008–1031.
 [65] N. Agarwal, B. Bullins, E. Hazan, Secondorder stochastic optimization for machine learning in linear time, The Journal of Machine Learning Research 18 (1) (2017) 4148–4187.
 [66] N. Tripuraneni, M. Stern, C. Jin, J. Regier, M. I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in: Advances in Neural Information Processing Systems, 2018, pp. 2899–2908.
 [67] M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Clifton, A. Sarkar, Nonuniform stochastic average gradient method for training conditional random fields., in: AISTATS, 2015.
Comments
There are no comments yet.