1 Introduction
Consider the empirical risk minimization (ERM) problem:
(1) 
where each is smooth and is strongly convex. Each represents a regularized loss over a sampled data point. Solving the ERM problem is often time consuming for large number of samples , so much so that algorithms scanning through all the data points at each iteration are not competitive. Gradient descent (GD) falls into this category, and in practice its stochastic version is preferred.
Stochastic gradient descent (SGD), on the other hand, allows to solve the ERM incrementally by computing at each iteration an unbiased estimate of the full gradient, for randomly sampled in (Robbins & Monro, 1951). On the downside, for SGD to converge one needs to tune a sequence of asymptotically vanishing step sizes, a cumbersome and timeconsuming task for the user. Recent works have taken advantage of the sum structure in Eq. 1 to design stochastic variance reduced gradient algorithms (Johnson & Zhang, 2013; ShalevShwartz & Zhang, 2013; Defazio et al., 2014; Schmidt et al., 2017). In the strongly convex setting, these methods lead to fast linear convergence instead of the slow rate of SGD. Moreover, they only require a constant step size, informed by theory, instead of sequence of decreasing step sizes.
In practice, most variance reduced methods rely on a minibatching strategy for better performance. Yet most convergence analysis (with the Katyusha algorithm of AllenZhu (2017) being an exception) indicates that a minibatch size of gives the best overall complexity, disagreeing with practical findings, where larger minibatch often gives better results. Here, we show both theoretically and numerically that is not the optimal minibatch size for the SAGA algorithm (Defazio et al., 2014).
Our analysis leverage recent results in (Gower et al., 2018), where the authors prove that the iteration complexity and the step size of SAGA, and a larger family of methods called the JacSketch methods, depend on an expected smoothness constant. This constant governs the tradeoff between the increased cost of an iteration as the minibatch size is increased, and the decreased total complexity. Thus if this expected smoothness constant could be calculated a priori, then we could set the optimal minibatch size and step size. We provide simple formulas for computing the expected smoothness constant when sampling minibatches without replacement, and use them to calculate optimal minibatches and significantly larger step sizes for SAGA.
In particular, we provide two bounds on the expected smoothness constant, each resulting in a particular step size formula. We first derive the simple bound and then develop a matrix concentration inquality to obtain the refined Bernstein bound. We also provide substantial theoretical motivation and numerical evidence for practical estimate of the expected smoothness constant. For illustration, we plot in Figure 1 the evolution of each resulting step size as the minibatch size grows on a classification problem (Section 5 has more details on our experimental settings).
) logistic regression problem applied to the featurescaled
covtype.binary dataset from LIBSVM .Furthermore, our bounds provide new insight into the total complexity, denoted hereafter, of SAGA. For example, when using our simple bound we show for regularized generalized linear models (GLM), with as in Eq. 10, that is piecewise linear in the minibatch size :
with , and is the desired precision.
This complexity bound, and others presented in Section 3.3 show that SAGA enjoys a linear speedup as we increase the minibatch size until an optimal one (as illustrated in Figure 2). After this point, the total complexity increases. We use this observation to develop optimal and practical minibatch sizes and step sizes.
The rest of the paper is structured as follows. In Section 2 we first introduce variance reduction techniques after presenting our main assumption, the expected smoothnes assumption. We highlight how this assumption is necessary to capture the improvement in iteration complexity, and conclude the section by showing that to calculate the expected smoothness constant we need evaluate an intractable expectation. Which brings us to Section 3 where we directly address this issue and provide several tractable upperbounds of the expected smoothness constant. We then calculate optimal minibatch sizes and step sizes by using our new bounds. Finally, we give numerical experiments in Section 5 that verify our theory on artificial and real datasets. We also show how these new settings for the minibatch size and step size lead to practical performance gains.
2 Background
2.1 Controlled stochastic reformulation and JacSketch
We can introduce variance reduced versions of SGD in a principled manner by using a
sampling vector
.Definition 1.
We say that a random vector with distribution is a sampling vector if
With a sampling vector we can rewrite (1) through the following stochastic reformulation
(2) 
where is called a subsampled function. The stochastic Problem (2) and our original Problem (1) are equivalent :
Consequently the gradient is an unbiased estimate of and we could use SGD method to solve (2). To tackle the variance of these stochastic gradients we can further modify (2) by introducing control variates which leads to the following controlled stochastic reformulation:
(3) 
where are the control variates. Clearly (3) is also equivalent to (1) since has zero expectation. Thus, we can solve (3) using an SGD algorithm where the stochastic gradients are given by
(4) 
That is, starting from a vector , given a positive step size , we can iterate the steps
(5) 
where are samples at each iteration.
The JacSketch algorithm introduced by Gower et al. (2018) fits this format (5) and uses a linear control , where is a matrix of parameters. This matrix is updated at each iteration so as to increase the correlation between and and decrease the variance of the resulting stochastic gradients. Carefully updating the covariates through results in a method that has stochastic gradients with decreasing variance, , which is why JacSketch is a stochastic variance reduced algorithm. This is also why the user can set a single constant step size a priori instead of tuning a sequence of decreasing ones. The SAGA algorithm, and all of its minibatching variants, are instances of the JacSketch method.
2.2 The expected smoothness constant
In order to analyze stochastic variance reduced methods, some form of smoothness assumption needs to be made. The most common assumption is
(6) 
for each . That is each is uniformly smooth with smoothness constant , as is assumed in (Defazio et al., 2014; Hofmann et al., 2015; Raj & Stich, 2018) for variants of SAGA^{1}^{1}1The same assumption is made in proofs of SVRG (Johnson & Zhang, 2013), S2GD (Konečný & Richtárik, 2017) and the SARAH algorithm (Nguyen et al., 2017).. In the analyses of these papers it was shown that the iteration complexity of SAGA is proportional to and the step size is inversely proportional to
But as was shown in (Gower et al., 2018), we can set a much larger step size by making use of the smoothness of the subsampled functions For this Gower et al. (2018) introduced the notion of expected smoothness, which we extend here to all sampling vectors and control variates.
Definition 2 (Expected smoothness constant).
Consider a sampling vector with distribution We say that the expected smoothness assumption holds with constant if for every we have that
(7) 
Remark 1.
Note that we refer to any positive constant that satisfies (7) as an expected smoothness constant. Indeed is a valid constant in the extended reals, but as we will see, the smaller , the better for our complexity results.
Gower et al. (2018) show that the expected smoothness constant plays the same role that does in the previously existing analysis of SAGA, namely that the step size is inversely proportional to and the iteration complexity is proportional to (see details in Theorem 1). Furthermore, by assuming that is –smooth, the expected smoothness constant is bounded
(8) 
as was proven in Theorem 4.17 in (Gower et al., 2018). Also, the bounds and are attained when using a uniform single element sampling and a full batch, respectively. And as we will show, the constants and can be orders of magnitude apart on large dimensional problems. Thus we could set much larger step sizes for larger minibatch sizes if we could calculate . Though calculating is not easy, as we see in the next lemma.
Lemma 1.
Let be an unbiased sampling vector. Suppose that is smooth and each is convex for . It follows that the expected smoothness constant holds with .
Proof.
The proof is given in Section A.1. ∎
Unfortunately, if the sampling has a very large combinatorial number of possible realizations — for instance sampling minibatches without replacement — then this expectation becomes intractable to calculate. This observation motivates the development of functional upperbounds of the expected smoothness constant that can be efficiently evaluated.
2.3 Minibatch without replacement: –nice sampling
Now we will choose a distribution of the sampling vector based on a minibatch sampling without replacement. We denote a minibatch as and its size as .
Definition 3 (nice sampling).
We can construct a sampling vector based on a nice sampling by setting , where is the canonical basis of . Indeed, is a sampling vector according to Definition 1 since for every we have
(9) 
where denotes the indicator function of the random set . Now taking expectation in (9) gives
using .
Here we are interested in the minibatch SAGA algorithm with nice sampling, which we refer to as the nice SAGA. In particular, nice SAGA is the result of using nice sampling, together with a linear model for the control variate . Different choices of the control variate also recover popular algorithms such as gradient descent, SGD or the standard SAGA method (see Table 1 for some examples).
A naive implementation of nice SAGA based on the JacSketch algorithm is given in Algorithm 1^{2}^{2}2We also provide a more efficient implementation that we used for our experiments in the appendix in Algorithm 2..
3 Upper bounds on the expected smoothness
To determine an optimal minibatch size for nice SAGA, we first state our assumptions and provide bounds of the smoothness of the subsampled function. We then define as the minibatch size that minimizes the total complexity of the considered algorithm, the total number of stochastic gradient computed. Finally we provide upperbounds on the expected smoothness constant , through which we can deduce optimal minibatch sizes. Many proofs are deferred to the supplementary material.
Parameters  

GD  
SGD  0  
SAGA  
nice SAGA 
3.1 Assumptions and notation
We consider that the objective function is a GLM with quadratic regularization controlled by a parameter :
(10) 
with is the Euclidean norm, are convex functions and a sequence of observations in . This framework covers regularized logistic regression by setting for some binary labels in
, ridge regression if
for real observations , and conditional random fields for when the ’s are structured outputs.We assume that the second derivative of each is uniformly bounded, which holds for our aforementioned examples.
Assumption 1 (Bounded second derivatives).
There exists such that .
For a batch , we rewrite the subsampled function as
and its second derivative is thus given by
(11) 
where
denotes the identity matrix of size
.For a symmetric matrix , we write (resp.
) for its largest (resp. smallest) eigenvalue. Assumption
1 directly implies the following.Lemma 2 (Subsample smoothness constant).
Let , and let denote the column concatenation of the vectors with
The smoothness constant of the subsampled loss function
is given by(12) 
Another key quantity in our analysis is the strong convexity parameter.
Definition 4.
The strong convexity parameter is given by
Since we have an explicit regularization term with , is strongly convex and
We additionally define , resp. , as the smoothness constant of the individual function , resp. the whole function . We also recall the definitions of the maximum of the individual smoothness constants by and their average by . The three constants satisfies
(13) 
3.2 Path to the optimal minibatch size
Our starting point is the following theorem taken from combining Theorem 3.6 and Eq. (103) in (Gower et al., 2018)^{3}^{3}3Note that has been added to every smoothness constant since the analysis in Gower et al. (2018) depends on the smoothness of and the smoothness of the subsampled functions ..
Theorem 1.
Through Theorem 1 we can now explicitly see how the expected smoothness constant controls both the step size and the resulting iteration complexity. This is why we need bounds on so that we can set the step size. In particular, we will show that the expected smoothness constant is a function of the minibatch size . Consequently so is the step size, the iteration complexity and the total complexity. We denote the total complexity defined as the number of stochastic gradients computed, hence with (15),
(16)  
Once we have determined as a function of , we will calculate the minibatch size that optimizes the total complexity .
As we have shown in Lemma 1, computing a precise bound on can be computationally intractable. This is why we focus on finding upper bounds on that can be computed, but also tight enough to be useful. To verify that our bounds are sufficiently tight, we will always have in mind the bounds given in (8). In particular, after expressing our bounds of as a function of ,we would like the bounds (8) to be attained for and
3.3 Expected smoothness
All bounds we develop on are based on the following lemma, which is a specialization of (1) for nice sampling.
Proposition 1 (Expected smoothness constant).
For the nice sampling, with , the expected smoothness constant is given by
(17) 
Proof.
Let the nice sampling as defined in Definition 3 and let be its corresponding sampling vector. Note that
Finally from Lemma 1, we have that:
Taking the maximum over all gives the result. ∎
The first bound we present is technically the simplest to derive, which is why we refer to it as the simple bound.
Theorem 2 (Simple bound).
For a nice sampling , for , we have that
(18) 
Proof.
The proof, given in Section A.2, starts by using the that for all subsets , which follows from repeatedly applying Lemma 8 in the appendix. The remainder of the proof follows by straightforward counting arguments. ∎
The previous bound interpolates, respectively for
and , between and On the one hand, we have that is a good bound for when is small, since . Though may not be a good bound for large , since , thanks to (13). Thus does not achieve the lefthand side of (8). Indeed can be far from . For instance^{5}^{5}5We numerically explore such extreme settings in Section 5, if is a quadratic function, then we have that and . Thus if the eigenvalues of are all equal then . Alternatively, if one eigenvalue is significantly larger than the rest then .Due to this shortcoming of we now derive the Bernstein bound. This bound explicitly depends on instead of , and is developed through a specialized variant of a matrix Bernstein inequality (Tropp, 2012, 2015) for sampling without replacement in Appendix C.
Theorem 3 (Bernstein bound).
The expected smoothness constant is upper bounded by
(19) 
Checking again the bounds of , we have on the one hand that thus there is a little bit of slack for small. On the other hand, using (see Lemma 10 in appendix), we have that
which depends only logarithmically on . Thus we expect the Bernstein bound to be more useful in the large domains, as compared to the simple bound. We confirm this numerically in Section 5.1.
Remark 2.
The simple bound is relatively tight for small, while the Bernstein bound is better for large and large . Fortunately, we can obtain a more refined bound by taking the minimum of the simple and the Bernstein bounds. This is highlighted numerically in Section 5.
Next we propose a practical estimate of that is tight for both small and large minibatch sizes.
Definition 5 (Practical estimate).
(20) 
Indeed and achieving both limits of (8). The downside to is that it is not an upper bound of . Rather, we are able to show that is very close to a valid smoothness constant, but it can be slightly smaller. Our theoretical justification for using comes from a mid step in the proof of the Bernstein bound which is captured in the next lemma.
Lemma 3.
Let for and let be a nice sampling over for every . It follows that
(21) 
with .
Proof.
The proof is given in Section A.3. ∎
Lemma 3 shows that the expected smoothness constant is upperbounded by
and an additional term. In this additional term we have the largest eigenvalue of a random matrix . This matrix is zero in expectation, and we also find that its eigenvalues oscillate around zero. Indeed, we provide extensive experiments in
Section 5 confirming that is very close to given in (17).4 Optimal minibatch sizes
Now that we have established the simple and the Bernstein bounds, we can minimize the total complexity (16) in the minibatch size.
Remark 3.
The righthand side term is common to all our bounds since it does not depend on . It linearly decreases from to .
We note that is a linearly increasing function of , because (as proven in Lemma 10). One can easily verify that and cross, as presented in Figure 2, by looking at initial and final values:

[leftmargin=*]

At , . So, .

At , . Since , we get .
Consequently, solving in gives the optimal minibatch size
(22) 
For the Bernstein bound, plugging (19) into (16) leads to
(23) 
where
The function is also linearly increasing in and its initial and final values are

[leftmargin=*]

At ,

At , . Since , we get .
Yet, it is unclear whether is dominated by . This is why we need to distinguish two cases to minimize the total complexity, which leads to the following solution
In the first case, the problem is wellconditioned and and do cross at a minibatch size between and . In the second case, the total complexity is governed by because for all , and the resulting optimal minibatch size is .
5 Numerical study
All the experiments were run in Julia and the code is freely available on github.com/gowerrobert/StochOpt.jl.
5.1 Upperbounds of the expected smoothness constant
First we experimentally verify that our upperbounds hold and how much slack there is between them and given in Equation 17. For artificially generated small data sets, we compute Equation 17 and compare it to our simple and Bernstein bounds, and our practical estimate. Our data are matrices defined as follows
In Figure 4 we see that is arbitrarily close to , making it hard to distinguish the two line plots. This was the case in many other experiments, which we defer to Section E.1. For this reason, we use in our experiments with the SAGA method.
Furthermore, in accordance with our discussion in Section 3.3, we have that and are close to when is small and large, respectively. In Section E.2 we show, by using publicly available datasets from LIBSVM^{6}^{6}6https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ and the UCI repository^{7}^{7}7https://archive.ics.uci.edu/ml/datasets/, that the simple bound performs better than the Bernstein bound when , and conversely for significantly larger than or when scaling the data.
5.2 Related step size estimation
Different bounds on also give different step sizes (14). Plugging in our estimates , and into (14) gives the step sizes , and , respectively. We compare our resulting step sizes to where is given by Eq. 17 and to the step size given by Hofmann et al. (2015), which is , where We can see in Figure 4, that for , all the step sizes are approximately the same, with the exceptions of the Bernstein step size. For , all of our step sizes are larger than , in particular is significantly larger. These observations are verified in other artificial and real data examples in Sections E.4 and E.3.
5.3 Comparison with previous SAGA settings
Here we compare the performance of SAGA when using the minibatch size and step size given in (Defazio et al., 2014), and given in Hofmann et al. (2015), to our new practical minibatch size and step size . Our goal is to verify how much our parameter setting can improve practical performance. We also compare with a step size
obtained by grid search over odd powers of
. These methods are run until they reach a relative error of .We find in Figure 5 that our parameter settings significantly outperforms the previously suggested parameters, and is even comparable to grid search. In Section E.5, we show that the settings
can lead to very poor performance compared to our settings. We also show that our settings are performing very well both in terms of epochs and time.
5.4 Optimality of our minibatch size
In the last experiment, detailed in Section E.6, we show that our estimation of the optimal minibatch size is close to the best one found through a grid search. We build a grid of minibatch sizes^{8}^{8}8Our grid is , being added when needed. and, as in Section 5.3, compute the empirical complexity required to achieve a relative error of .
In Figure 6 we can see that the empirical complexity of the optimal minibatch size calculated through grid search is very close to the resulting empirical complexity of using . What is even more interesting, is that seems to predict a regime change, where using a larger minibatch size results in a much larger empirical complexity.
6 Conclusions
We have explained the crucial role of the expected smoothness constant in the convergence of a family of stochastic variancereduced descent algorithms. We have developped functional upperbounds of this constant and used them to build larger step sizes and closedform optimal minibatch values for the nice SAGA algorithm. Our experiments on artificial and real datasets showed the validity of our upperbounds and the improvement in the total complexity using our step and optimal minibatch sizes. Our results suggest a new parameter setting for minibatch SAGA, that significantly outperforms previous suggested ones, and is even comparable with a gridsearch approach, without the computational burden of the later.
Acknowledgements
This work was supported by grants from DIM Math Innov Région IledeFrance (ED574  FMJH) and by a public grant as part of the Investissement d’avenir project, reference ANR11LABX0056LMH, LabEx LMH, in a joint call with Gaspard Monge Program for optimization, operations research and their interactions with data sciences.
References
 AllenZhu (2017) AllenZhu, Z. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In STOC, 2017.
 Bach (2012) Bach, F. Sharp analysis of lowrank kernel matrix approximations. ArXiv eprints, August 2012.

Chang & Lin (2011)
Chang, C.C. and Lin, C.J.
Libsvm: a library for support vector machines.
ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.  Defazio et al. (2014) Defazio, A., Bach, F., and Lacostejulien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pp. 1646–1654. 2014.
 Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017.
 Gower et al. (2018) Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasigradient methods: Variance reduction via jacobian sketching. arXiv preprint arXiv:1805.02632, 2018.
 Gross & Nesme (2010) Gross, D. and Nesme, V. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.

Hoeffding (1963)
Hoeffding, W.
Probability inequalities for sums of bounded random variables.
Journal of the American statistical association, 58(301):13–30, 1963.  Hofmann et al. (2015) Hofmann, T., Lucchi, A., LacosteJulien, S., and McWilliams, B. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pp. 2305–2313, 2015.
 Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pp. 315–323. Curran Associates, Inc., 2013.
 Konečný & Richtárik (2017) Konečný, J. and Richtárik, P. Semistochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3:9, 2017.
 Nesterov (2014) Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014.
 Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2613–2621. PMLR, Aug 2017.
 Raj & Stich (2018) Raj, A. and Stich, S. U. Svrg meets saga: ksvrg — a tale of limited memory. arXiv:1805.00982, 2018.
 Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
 Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, Mar 2017.
 ShalevShwartz & Zhang (2013) ShalevShwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, February 2013.
 Tropp (2011) Tropp, J. A. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115–126, 2011.
 Tropp (2012) Tropp, J. A. Userfriendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012. doi: 10.1007/s102080119099z.
 Tropp (2015) Tropp, J. A. An Introduction to Matrix Concentration Inequalities. ArXiv eprints, January 2015.
Appendix A Proofs of the upper bounds of
a.1 Master lemma
Proof of Lemma 1.
Since the ’s are convex, each realization of is convex, and it follows from equation 2.1.7 in (Nesterov, 2014) that
(24) 
Taking expectation over the sampling gives
where in the last equality the full gradient vanishes because it is computed at optimality. The result now follows by comparing the above with the definition of expected smoothness in (7). ∎
a.2 Proof of the simple bound
Proof of Theorem 2.
To derive this bound on we use that
(25) 
which follows from repeatedly applying Lemma 8. For , it follows from Equation 17 and Equation 25 that
(26) 
Using a double counting argument we can show that
(27) 
Inserting this into Equation 26 gives
(28) 
We also verify that this bound is valid for nice sampling. Indeed, we already have that in this case . ∎
a.3 Proof of the Bernstein bound
To start the proof of Theorem 3, we rewrite the expected smoothness constant as the maximum over an expectation. Let be a nice sampling over We can write
(29)  
One can come back to the definition of the subsample smoothness constant Equation 12 and interpret previous expression as an expectation of the largest eigenvalue of a sum of matrices. This insight allows us to apply a matrix Bernstein inequality, see Theorem 7, to bound .
For the proof of Theorem 3, we first need the two following results.
Lemma 4.
Let , and let be a nice sampling over the set . It follows that
Comments
There are no comments yet.