SGD: General Analysis and Improved Rates

01/27/2019 ∙ by Robert Mansel Gower, et al. ∙ King Abdullah University of Science and Technology 5

We propose a general yet simple theorem describing the convergence of SGD under the arbitrary sampling paradigm. Our theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches. This is the first time such an analysis is performed, and most of our variants of SGD were never explicitly considered in the literature before. Our analysis relies on the recently introduced notion of expected smoothness and does not rely on a uniform bound on the variance of the stochastic gradients. By specializing our theorem to different mini-batching strategies, such as sampling with replacement and independent sampling, we derive exact expressions for the stepsize as a function of the mini-batch size. With this we can also determine the mini-batch size that optimizes the total complexity, and show explicitly that as the variance of the stochastic gradient evaluated at the minimum grows, so does the optimal mini-batch size. For zero variance, the optimal mini-batch size is one. Moreover, we prove insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the optimization problem

(1)

where each is smooth (but not necessarily convex). Further, we assume that has a unique111This assumption can be relaxed; but for simplicity of exposition we enforce it. global minimizer and is –strongly quasi-convex Karimi et al. (2016); Necoara et al. (2018):

(2)

for all .

1.1 Background and contributions

Stochastic gradient descent (SGD) Robbins & Monro (1951); Nemirovski & Yudin (1978, 1983); Shalev-Shwartz et al. (2007); Nemirovski et al. (2009); Hardt et al. (2016)

, has become the workhorse for training supervised machine learning problems which have the generic form (

1).

Linear convergence of SGD. Among the first non-asymptotic analyses of SGD was done by Moulines & Bach (2011) who provided, among other results, a linear convergence analysis for strongly convex up to a certain noise level. Needell et al. (2016) improved upon these results by removing the quadratic dependency on the condition number in the iteration complexity results, and considered importance sampling. The analysis of Needell et al. (2016) was later extended to a mini-batch variant where the mini-batches are formed by partitioning the data Needell & Ward (2017). These works are the main starting point for ours.

Contributions: We further tighten and generalize these results to virtually all forms of sampling. We introduce an expected smoothness assumption (Assumption 2.1), first introduced in Gower et al. (2018) in the context of a certain class of variance-reduced methods. This assumption is a joint property of and the sampling scheme utilized by an SGD method, and allows us prove a generic complexity result (Theorem 3.1) that holds for arbitrary sampling schemes . Our work is the first time SGD is analysed under this assumption. We obtain linear convergence rates without strong convexity; in particular, assuming strong quasi-convexity (this class includes some non-convex functions as well). Furthermore, we do not require the functions to be convex.

Gradient noise assumptions. Shamir & Zhang (2013) extended the analysis of SGD to convex non-smooth optimization (including the strongly convex case). However, their proofs still rely on the assumption that the variance of the stochastic gradient is bounded for all iterates of the algorithm: there exists such that for all . The same assumption was used in the analysis of several recent papers Recht et al. (2011); Hazan & Kale (2014); Rakhlin et al. (2012). Bottou et al. (2018) establish a linear convergence of SGD under a somewhat less restrictive condition, namely for all . Recently, Nguyen et al. (2018) turn this assumption into a theorem by establishing formulas and under some reasonable conditions, and provide further insights into the workings of SGD and and its parallel asynchronous cousin, Hogwild!. Similar conditions have been also proved and used in the analysis of decentralized variants of SGD Lian et al. (2017); Assran et al. (2018). Based on a strong growth condition of the stochastic gradients, Cevher & Vu (2017) give sufficient and necessary conditions for the linear convergence of SGD. This strong growth condition holds when using SGD for solving a consistent linear system, but it does not hold for a wide range of problems.

Contributions: Our analysis does not directly assume a growth condition. Instead, we make use of the remarkably weak expected smoothness assumption.

Optimal mini-batch size. Recently it was experimentally shown by Goyal et al. (2017)

that using larger mini-batches sizes is key to efficient training of large scale non-convex problems, leading to the training of ImageNet in under 1 hour. The authors conjectured that the stepsize should grow linearly with the mini-batch size.

Contributions: We prove (see Section 4) that this is the case, upto a certain optimal mini-batch size, and provide exact formulas for the dependency of the stepsizes on the mini-batch sizes.

Learning schedules. Chee & Toulis (2018) develop techniques for detecting the convergence of SGD within a region around the solution.

Contributions: We provide a closed-form formula for when should SGD switch from a constant stepsize to a decreasing stepsize (see Theorem 3.2). Further, we clearly show how the optimal stepsize (learning rate) increases and the iteration complexity decreases as the mini-batch size increases for both independent sampling and sampling with replacement. We also recover the well known convergence rate of gradient descent (GD) when the mini-batch size is ; this is the first time a generic SGD analysis recovers the correct rate of GD.

Over-parameterized models. There has been some recent work in analysing SGD in the setting where the underlying model being trained has more parameters than there is data available. In this zero–noise setting, Ma et al. (2018) showed that SGD converges linearly.

Contributions:

In the case of over-parametrized models, we extend the findings of 

Ma et al. (2018)222Recently, the results of Ma et al. (2018) were extended to the accelerated case by Vaswani et al. (2018); however, we do not study accelerated methods in this work. to independent sampling and sampling with replacement by showing that the optimal mini-batch size is . Moreover, we provide results in the more general setting where the model is not necessarily over-parametrized.

Practical performance. We corroborate our theoretical results with extensive experimental testing.

1.2 Stochastic reformulation

In this work we provide a single theorem through which we can analyse all importance sampling and mini-batch variants of SGD. To do this, we need to introduce a

sampling vector

which we will use to re-write our problem (1).

LD.

We say that a random vector drawn from some distribution is a sampling vector if its mean is the vector of all ones:

(3)

With each distribution we now introduce a stochastic reformulation of (1) as follows

(4)

By the definition of the sampling vector, and

are unbiased estimators of

and respectively, and hence probem (4) is indeed equivalent (i.e., a reformulation) of the original problem (1). In the case of the gradient, for instance, we get

(5)

Similar but different stochastic reformulations were recently proposed by Richtárik & Takáč (2017) and further used in (Loizou & Richtárik, 2017) for the more special problem of solving linear systems, and by Gower et al. (2018) in the context of variance-reduced methods. Reformulation (4) can be solved using SGD in a natural way:

(6)

where is sampled i.i.d at each iteration and is a stepsize. However, for different distributions , (6) has a different interpretation as an SGD method for solving the original problem (1). In our main result we will analyse (6) for any satisfying (3). By substituting specific choices of , we obtain specific variants of SGD for solving (1).

2 Expected Smoothness and Gradient Noise

In our analysis of SGD (6) applied to the stochastic reformulation (4) we rely on a generic and remarkably weak assumption of expected smoothness, which we now define and relate to existing growth conditions.

2.1 Expected smoothness

Expected smoothness Gower et al. (2018) is an assumption that combines both the properties of the distribution and the smoothness properties of function .

LA (Expected Smoothness).

We say that is –smooth in expectation with respect to distribution if there exists such that

(7)

for all . For simplicity, we will write to say that (7) holds. When is clear from the context, we will often ignore mentioning it, and simply state that the expected smoothness constant is

In Section 3.3 we show how convexity and –smoothness of implies expected smoothness. However, the opposite implication does not hold. Indeed, the expected smoothness assumption can hold even when the ’s and are not convex, as we show in the next example.

LE (Non-convexity and expected smoothness).

Let for , where is a –smooth and non-convex function which has a global minimum (such functions exist333There exists invex functions that satisfy these conditions Karimi et al. (2016). As an example is smooth, non-convex, and has a unique global minimizer.). Consequently and . Letting we have

where the last inequality follows from Proposition A.1. So, for .

2.2 Gradient noise

Our second key assumption is finiteness of gradient noise, defined next:

LA (Finite Gradient Noise).

The gradient noise , defined by

(8)

is finite.

This is a very weak assumption, and should intuitively be really seen as an assumption on rather than on . For instance, if the sampling vector is non-negative with probability one and is finite for all , then is finite. When (1

) is the training problem of an over-parametrized model, which often occurs in deep neural networks, each individual loss function

attains its minimum at , and thus It follows that .

2.3 Key lemma and connection to the weak growth condition

A common assumption used to prove the convergence of SGD is uniform boundedness of the stochastic gradients444Or it is assumed that for all iterates. But this too has issues since it implicitly assumes that the iterates remain within a compact set, and yet it it used to prove the convergence to within a compact set, raising issues of a circular argument.: there exist such that for all . However, this assumption often does not hold, such as in the case when is strongly convex Bottou et al. (2018); Nguyen et al. (2018). We do not assume such a bound. Instead, we use the following direct consequence of expected smoothness to bound the expected norm of the stochastic gradients.

LL.

If , then

(9)

When the gradient noise is zero (), inequality (9) is known as the weak growth condition Vaswani et al. (2018). We have the following corollary:

LC.

If and if , then satisfies the weak growth condition

with

This corollary should be contrasted with Proposition 2 in Vaswani et al. (2018) and Lemma 1 in Nguyen et al. (2018), where it is shown, by assuming the functions to be smooth and convex, that the weak growth condition holds with . However, as we will show in Lemma C.1, , and hence our bound is often tighter.

3 Convergence Analysis

3.1 Main results

We now present our main theorem, and include its proof to highlight how we make use of expected smoothness and gradient noise.

LT.

Assume is -quasi-strongly convex and that . Choose for all . Then iterates of SGD given by (6) satisfy:

(10)

Hence, given any , choosing stepsize

(11)

and

(12)

implies

Proof.

Let . From (6), we have

Taking expectation conditioned on we obtain:

Taking expectations again and using Lemma 2.4:

where we used in the last inequality that since Recursively applying the above and summing up the resulting geometric series gives

(13)

To obtain an iteration complexity result from the above, we use standard techniques as shown in Section A.1. ∎

Note that we do not assume nor to be convex. Theorem 3.1 states that SGD converges linearly up to the additive constant which depends on the gradient noise and on the stepsize . We obtain a more accurate solution with a smaller stepsize, but then the convergence rate slows down. Since we control , we also control and (we compute these parameters for several distributions in Section 3.3).

Furthermore, we can control this additive constant by carefully choosing the stepsize, as shown in the next result.

LT (Decreasing stepsizes).

Assume is -quasi-strongly convex and that . Let and

(14)

If , then SGD iterates given by (6) satisfy:

(15)

3.2 Choosing

For (6) to be efficient, the sampling vector should be sparse. For this reason we will construct so that only a (small and random) subset of its entries are non-zero.

Before we formally define , let us first establish some random set terminology. Let and let , where are the standard basis vectors in . These subsets will be selected using a random set valued map , in the literature referred to by the name sampling Richtárik & Takáč (2016); Qu & Richtárik (2016). A sampling is uniquely characterized by choosing subset probabilities for all subsets of :

(16)

where . We will only consider proper samplings. A sampling is called proper if is positive for all .

The first analysis of a randomized optimization method with an arbitrary (proper) sampling was performed by Richtárik & Takáč (2016) in the context of randomized coordinate descent for strongly convex functions. This arbitrary sampling paradigm was later adopted in many other settings, including accelerated coordinate descent for strongly convex functions Hanzely & Richtárik (2018), coordinate and accelerated descent for convex functions Qu & Richtárik (2016), primal-dual methods Qu et al. (2015); Chambolle et al. (2018), variance-reduced methods with convex Csiba & Richtárik (2015) and nonconvex Horváth & Richtárik (2018) objectives. Arbitrary sampling arises as a special case of our more general analysis by specializing the sampling vector to one dependent on a sampling . We now define practical sampling vector as follows:

LL.

Let be a proper sampling, and let Then the random vector given by

(17)

is a sampling vector.

Proof.

Note that where is the indicator function of the event . It follows that . ∎

We can further specialize and define the following commonly used samplings. Each sampling gives rise to a particular sampling vector (i.e., distribution ), which in turn gives rise to a particular stochastic reformulation (4) and SGD variant (6).

Independent sampling. The sampling includes every , independently, with probability . This type of sampling was considered in different contexts in Horváth & Richtárik (2018); Hanzely & Richtárik (2018).

Partition sampling. A partition of is a set consisting of subsets of such that and for any , with . A partition sampling is a sampling such that for all and .

Single element sampling. Only the singleton sets for have a non-zero probability of being sampled; that is, . We have .

–nice sampling. We say that is a –nice if samples from all subsets of of cardinality uniformly at random. In this case we have that for all So, for all subsets with elements.

3.3 Bounding and

By assuming that the functions are convex and smooth we can calculate closed form expressions for the expected smoothness and gradient noise . In particular we make the following smoothness assumption:

LA.

There exists a symmetric positive definite matrix such that

(18)

for all and where In this case we say that is –smooth. Furthermore, we assume that each is convex.

To better relate the above assumption to the standard smoothness assumptions we make the following remark.

LR.

As a consequence of Assumption 3.4 we also have that each is –smooth and is –smooth. Let

Using Assumption 3.4 and a sampling we establish the following bounds on .

LT.

Let be a proper sampling, and (i.e., is defined by (17). Let be -smooth, and be defined by . Then , where

(19)

and . If , then

(20)

By applying the above result to specific samplings, we obtain the following practical bounds on :

LP.

(i) For single element sampling , we have

(21)

(ii) For independent sampling , we have

(22)

(iii) For -nice sampling , we have

(23)
(24)

(iv) For partition sampling with partition , we have

(25)

For , formulas for the gradient noise are provided in the next result:

LT.

Let . Then

(26)

Specializing the above theorem to specific samplings gives the following formulas for :

LP.

(i) For single element sampling , we have

(27)

(ii) For independent sampling with , we have

(28)

(iii) For -nice sampling , we have

(29)

(iv) For partition sampling with partition , we have

(30)

Generally, we do not know the values of . But if we have prior knowledge that belongs to some set , we can obtain upper bounds for for these samplings from Proposition 3.9 in a straightforward way.

4 Optimal Mini-Batch Size

Here we develop the iteration complexity for different samplings by plugging in the bounds on and given in Section 3.3 into Theorem 3.1. To keep the notation brief, in this section we drop the logarithmic term from the iteration complexity results. Furthermore, for brevity and to better compare our results to others in the literature, we will use , and (see Remark 3.5). Finally let for brevity.

Gradient descent. As a first sanity check, we consider the case where with probability one. That is, each iteration (6) uses the full batch gradient. Thus and it is not hard to see that for in (23) we have Consequently, the resulting iteration complexity (12) is now . This is exactly the rate of gradient descent, which is precisely what we would expect since the resulting method is gradient descent. Though an obvious sanity check, we believe this is the first convergence theorem of SGD that includes gradient descent as a special case. Clearly, this is a necessary pre-requisite if we are to hope to understand the complexity of mini-batching.

4.1 Nonzero gradient noise

To better appreciate how our iteration complexity evolves with increased mini-batch sizes, we now consider independent sampling with and -nice sampling.

Independent sampling. Inserting the bound on  (22) and  (28) into (12) gives the following iteration complexity

(31)

This is a completely new mini-batch complexity result, which opens up the possibility of optimizing the mini-batch size and probabilities of sampling. For instance, if we fix uniform probabilities with then (4.1) becomes , where

(32)

This complexity result corresponds to using the stepsize

(33)

if , otherwise only the left-hand-side term in the minimization remains. The stepsize (33) is increasing since both and decrease as increases.

With such a simple expression for the iteration complexity we can choose a mini-batch size that optimizes the total complexity. By defining the total complexity as the number of iterations times the number of gradient evaluations () per iteration gives

(34)

Minimizing in is easy because is a max of a linearly increasing term and a linearly decreasing term in . Furthermore . Consequently, if , then , otherwise

(35)

Since is proportional to the noise and and is proportional to the smoothness the condition holds when there is comparatively a lot of noise or the precision is high. As we will see in Section 4.2 this logic extends to the case where the noise is zero, where the optimal mini-batch size is

–nice sampling. Inserting the bound on  (24) and  (29) into (12) gives the iteration complexity , where

(36)
(37)

which holds for the stepsize

(38)

Again, this is an increasing function of

We are now again able to calculate the mini-batch size that optimizes the total complexity given by Once again is a max of a linearly increasing term and a linearly decreasing term in . Furthermore . Consequently, if then , otherwise

(39)

4.2 Zero gradient noise

Consider the case where the gradient noise is zero (). According to Theorem 3.1, the resulting complexity of SGD with constant stepsize is given by the very simple expression

(40)

where we have dropped the logarithmic term . In this setting, due to Corollary 2.5, we know that satisfies the weak growth condition. Thus our results are directly comparable to those developed in Ma et al. (2018) and in Vaswani et al. (2018).

In particular, Theorem 1 in Ma et al. (2018) states that when running SGD with mini-batches based on sampling with replacement, the resulting iteration complexity is

(41)

again dropping the logarithmic term. Now gaining insight into the complexity (40) is a matter of studying the expected smoothness parameter for different sampling strategies.

Independent sampling. Setting (thus ) and using uniform probabilities with in (4.1) gives

(42)

–nice sampling. If we use a uniform sampling and then the resulting iteration complexity is given by

(43)

Iteration complexities (41), (42) and (43) tell essentially the same story. Namely, the complexity improves as increases to , but this improvement is not enough when considering the total complexity (multiplying by ). Indeed, for total complexity, these results all say that is optimal.

5 Importance Sampling

In this section we propose importance sampling for single element sampling and independent sampling with , respectively. Due to lack of space, the details of this section are in the appendix, Section I. Again we drop the log term in (12) and adopt the notation in Remark 3.5.

5.1 Single element sampling

For single element sampling, plugging (21) and (27) into (12) gives the following iteration complexity

where and . In order to optimize this iteration complexity over , we need to solve a dimensional linearly constrained nonsmooth convex minimization problem, which could be harder than the original problem (1). So instead, we will focus on minimizing and over seperately. We will then use these two resulting (sub)optimal probabilities to construct a sampling.

In particular, for single element sampling we can recover the partially biased sampling developed in Needell et al. (2016). First, from (21) it is easy to see that the probabilities that minimize are for all . Using these suboptimal probabilities we can construct a partially biased sampling by letting Plugging this sampling in (21) gives , and from (27), we have . This sampling is the same as the partially biased sampling in  Needell et al. (2016). From (4.1) in Theorem 3.1, we get that the total complexity is now given by

(44)

For uniform sampling, and . Hence, compared to uniform sampling, the iteration complexity of partially biased sampling is at most two times larger, but could be smaller in the extreme case where

5.2 Minibatches

Importance sampling for minibatches was first considered in (Csiba & Richtárik, 2018); but not in the context of SGD. Here we propose the first importance sampling for minibatch SGD. In Section I.2 in the appendix we introduce the use of partially biased sampling together with independent sampling with and show that we can achieve a total complexity of (by Proposition I.3)

(45)

which not only eliminates the dependence on , but also improves as the mini-batch size increases.

6 Experiments

In this section, we empirically validate our theoretical results. We perform three experiments in each of which we highlight a different aspect of our contributions.

In the first two experiments we focus on ridge regression and regularized logistic regression problems (problems with strongly convex objective

and components ) and we evaluate the performance of SGD on both synthetic and real data. In particular, in the first experiment (Section 6.1) we numerically verify the performance of SGD (in the case of uniform single element sampling) as predicted from Theorems 3.1 and 3.2 for both constant and decreasing step-sizes. In the second experiment (Section 6.2) we compare the convergence of SGD for several choices of the distribution (different sampling strategies) as described in the previous sections. In the last experiment (Section 6.3

) we focus on the problem of principal component analysis (PCA) which by construction can be seen as a problem with a strongly convex objective

but with non-convex functions Allen-Zhu & Yuan (2016); Garber & Hazan (2015); Shalev-Shwartz (2016).

In all experiments, to evaluate SGD we use the relative error measure . For all implementations, the starting point is standard Gaussian. We run each method until

or until a pre-specified maximum number of epochs is achieved. For the horizontal axis we always use the number of epochs. The code for the experiments is written in Python 3

For more experiments we refer the interested reader to Section J of the Appendix.

Regularized Regression Problems: In the case of the ridge regression problem we solve:

while for the -regularized logistic regression problem we solve:

In both problems are the given data and is the regularization parameter. For the generation of the synthetic data in both problems, the rows of matrix (

) were sampled from the standard Gaussian distribution

. For the synthetic data in the case of ridge regression we choose vector to be Gaussian vector while in the case of logistic regression where The regularization parameter varies depending on the experiment. For our experiments on real data we choose several LIBSVM Chang & Lin (2011) datasets.

Figure 1: Comparison between constant and decreasing step size regimes of SGD. Ridge regression problem (first row): on left - synthetic data, on right - real dataset: abalone from LIBSVM. Logistic regression problem(second row): on left - synthetic data, on right - real data-set: a1a from LIBSVM. In all experiments .

6.1 Constant vs decreasing step size

We now compare the performance of SGD in the constant and decreasing stepsize regimes considered in Theorems 3.1 (see (11)) and 3.2 (see (14)), respectively. As expected from theory, we see in Figure 1 that the decreasing stepsize regime is vastly superior at reaching a higher precision than the constant step-size variant. In our plots, the vertical red line denotes the value of predicted from Theorem 3.2 and highlights the point where SGD needs to change its update rule from constant to decreasing step-size.

Figure 2: Performance of SGD with several minibatch strategies for logistic regression. Top: the real data-set w3a from LIBSVM. Bottom: synthetic data.

6.2 Minibatches

In Figures 2 and 5 we compare the single element sampling (uniform and importance), independent sampling (uniform, uniform with optimal batch size and importance) and nice sampling (with some and with optimal ). The probabilities of importance samplings in the single element sampling and independent sampling are calculated by formulas (65) and (73) (see the Appendix). Formulas for optimal minibatch size in independent sampling and -nice samplings are given in (35) and (39), respectively. Observe that minibatching with optimal gives the best convergence. In addition, note that for constant step size, importance sampling variants depend on the accuracy . It is clear that before the error approaches required accuracy, importance sampling is comparable or better than their uniform sampling.

6.3 Sum-of-non-convex functions

In Figure 3, our goal is to illustrate that Theorem 3.1 holds even if the functions are non convex. The scheme of the experiment is similar to the one from Allen-Zhu & Yuan (2016). In particular, we first generate random vectors from . Then we consider the minimization problem:

where are diagonal matrices satisfying . In particular, to guarantee that , we randomly select half of the matrices and assign their -th diagonal value equal to ; for the other half we assign to be . We repeat that for all diagonal values. Note that under this construction, each is non-convex function. Once again, in the first plot we observe that while both are equally fast in the beginning, the decreasing stepsize variant is better at reaching higher accuracy than the fixed stepsize variant. In the second plot we see, as expected, that all four minibatch versions of SGD outperform single element SGD. However, while the -nice and -independent samplings with lead to a slight improvement only, the theoretically optimal choice leads to a vast improvement.

Figure 3: Top: Comparison between constant and decreasing step size regimes of SGD for PCA. Bottom: comparison of different sampling strategies of SGD for PCA.

References

Appendix A Elementary Results

In this section we collect some elementary results; some of them we use repeatedly.

LP.

Let be –smooth, and assume it has a minimizer on . Then

Proof.

Lipschitz continuity of the gradient implies that