A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

06/26/2018 ∙ by Yossi Arjevani, et al. ∙ Weizmann Institute of Science 0

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from τ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only 1/τ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gradient-based optimization methods are widely used in machine learning and other large-scale applications, due to their simplicity and scalability. However, in their standard formulation, they are also strongly synchronous and iterative in nature: In each iteration, the update step is based on the gradient at the current iterate, and we need to wait for this computation to finish before moving to the next iterate. For example, to minimize some function

, plain stochastic gradient descent initializes at some point , and computes iterates of the form

(1)

where is the gradient of at , is the step size and are independent zero-mean noise terms. Unfortunately, in several important applications, a direct implementation of this is too costly. For example, consider a setting where we wish to optimize a function using a distributed platform, consisting of several machines with shared memory. We can certainly implement gradient descent, by letting one of the machines compute the gradient at each iteration, but this is clearly wasteful, since just one machine is non-idle at any given time. Thus, it is highly desirable to use methods which parallelize the computation. One approach is to employ mini-batch gradient methods, which parallelize the computation of the stochastic gradient, and their analysis is relatively well understood (e.g. [6, 5, 19, 24]). However, these methods are still generally iterative and synchronous in nature, and hence can suffer from problems such as having to wait for the slowest machine at each iteration.

A second and popular approach is to utilize asynchronous gradient methods. With these methods, each update step is not necessarily based just on the gradient of the current iterate, but possibly on the gradients of earlier iterates (often called stale updates). For example, when optimizing a function using several machines, each machine might read the current iterate from a shared parameter server, compute the gradient at that iterate, and then update the parameters, even though other machines might have performed other updates to the parameters in the meantime. Although such asynchronous methods often work well in practice, analyzing them is much trickier than synchronous methods.

In our work, we focus on arguably the simplest possible variant of these methods, where we perform plain stochastic gradient descent on a convex function on , with a fixed delay of in the gradient computation:

(2)

where we assume that . Compared to Eq. (1), we see that the gradient is computed with respect to rather than . Already in this simple formulation, the precise effect of the delay on the convergence rate is not completely clear. For example, for a given number of iterations , how large can be before we might expect a significant deterioration in the accuracy? And under what conditions? Although there exist some prior results in this direction (which we survey in the related work section below), these questions have remained largely open.

In this paper, we aim at providing a tight, finite-time convergence analysis for stochastic gradient descent with delays, focusing on the simple case where is a convex quadratic function. Although a quadratic assumption is non-trivial, it arises naturally in problems such as least squares, and is an important case study since all smooth and convex function are locally quadratic close to their minimum (hence, our results should still hold in a local sense). In future work, we hope to show that our results are also applicable more generally.

First, we consider the case of deterministic delayed gradient descent (DGD, defined in Eq. (2) with ). Assuming the step size is chosen appropriately, we prove that

after iterations, over the class of -strongly convex -smooth quadratic functions with a minimum at , and

over the class of -smooth convex quadratic functions with minimum at . In terms of iteration complexity, the number of iterations required to achieve a fixed optimization error of at most in the strongly convex and the convex cases is therefore

(3)

respectively, where is the so-called condition number111Following standard convention, we use here the -notation to hide constants, and tilde -notation to hide constants and factors polylogarithmic in the problem parameters.. When is a bounded constant, these bounds match the known iteration complexity of standard gradient descent without delays [17]. However, as increases, both bounds deteriorate linearly with . Notably, in our setting of delayed gradients, this implies that DGD is no better than a trivial algorithm, which performs a single gradient step, and then waits for rounds till the delayed gradient is received, before performing the next step (thus, the algorithm is equivalent to non-delayed gradient descent with gradient steps, resulting in the same linear deterioration of the iteration complexity with ).

Despite these seemingly weak guarantees, we show that they are in fact tight in terms of , by proving that this linear dependence on is unavoidable with standard gradient-based methods (including gradient descent). The dependence on the other problem parameters in our lower bounds is a bit weaker than our upper bounds, but can be matched by an accelerated gradient descent procedure (see Sec. 3 for more details).

In the second part of our paper, we consider the case of stochastic delayed gradient descent (SDGD, defined in (2)). Assuming satisfies and that the step size is appropriately tuned, we prove that

(4)

for -strongly convex, -smooth quadratic functions with minimum at , and

(5)

for -smooth convex quadratic functions. In terms of iteration complexity, these correspond to

(6)

in the strongly convex and convex cases respectively, where again . As in the deterministic case, when is a bounded constant, these bounds match the known iteration complexity bounds for standard gradient descent without delays [3, 20]. Moreover, these bounds match the bounds for the deterministic case in Eq. (3) when (i.e. zero noise), as they should. However, in sharp contrast to the deterministic case, the dependence on in Eq. (6) is quite different: The delay only appears in second-order terms (as ), and its influence becomes negligible when is small enough. The same effect can be seen in Eq. (4) and Eq. (5): Once the number of iterations is large enough, the first term in both bounds dominates, and no longer plays a role. More specifically:

  • [leftmargin=*]

  • In the strongly convex case, the effect of the delay becomes negligible once the target accuracy is sufficiently smaller than , or when the number of iterations is sufficiently larger than . In other words, assuming the condition number is bounded, we can have the delay nearly as large as the total number of iterations (up to log-factors), without significant deterioration in the convergence rate. Note that this is a mild requirement, since if , the algorithm receives no gradients and makes no updates.

  • In the convex case, the effect of the delay becomes negligible once the target accuracy is sufficiently smaller than , or when the number of iterations is sufficiently larger than . Compared to the strongly convex case, here the regime is the same in terms of , but the regime in terms of is more restrictive: We need to scale quadratically (rather than linearly) with . Thus, the maximal delay with no performance deterioration is order of .

Finally, it is interesting to compare our bounds to those of mini-batch stochastic gradient descent (SGD), which can be seen as a synchronous gradient-based method to cope with delays, especially in distributed optimization and learning problems [6, 5, 1]. In mini-batch SGD, each update step is performed only after accumulating and averaging a mini-batch of stochastic gradients, all with respect to the same point:

Although the algorithm makes an update only every stochastic gradient computations, the averaging reduces the stochastic noise, and helps speed up convergence. Moreover, this can be seen as a particular type of algorithm with delayed updates (with the delay correspond to ), as we use to compute iterate . The important difference is that it is an inherently synchronous method, that waits for all stochastic gradients to be computed before performing an update step. Remarkably, the bounds we proved above for delayed SGD are essentially identical to those known for mini-batch SGD, with the delay replaced by the mini-batch size (at least in the convex case where mini-batch SGD has been more thoroughly analyzed). This indicates that an asynchronous method like delayed SGD can potentially match the performance of synchronous methods like mini-batch SGD, even without requiring synchronization – an important practical advantage.

Analyzing gradient descent with delays is notoriously tricky, due to the dependence of the updates on iterates produced many iterations ago. The technique we introduce for deriving our upper bounds is primarily based on generating functions, and might be useful for studying other optimization algorithms. We discuss this approach more thoroughly in Section 2. The rest of the paper is devoted mostly to presenting the formal theorems and an explanation of how they are derived (with technical details relegated to the supplementary material).

Related Work

There is a huge literature on asynchronous versions of gradient-based methods (see for example the seminal book [2]), including treating the effect of delay. However, most of these do not consider the setting we study here. For example, there has been much recent interest in asynchronous algorithms, in a model where there is a delay in updating individual coordinates

in a shared parameter vector (e.g., the Hogwild! algorithm of

[18], or more recently [14, 13]). Of course, this is a different model than ours, where the updates use a full gradient vector. Other works (such as [21]) focus on a setting where different agents in a network can perform local communication, which is again a different model than ours. Yet other works focus on sharp but asymptotic results, and do not provide guarantees after a fixed number of iterations (e.g., [4]).

Moving closer to our setting, [15] showed convergence for delayed gradient descent, with the result implying an convergence rate for convex functions. A similar bound on average regret has been shown in an adversarial online learning setting, for general convex functions, and this bound is known to be optimal [10]. These results differ from our setting, in that they consider possibly non-smooth functions, in which the dependence on is no better than even without delays and no noise, and where the delay always plays a significant role. In contrast, we focus here on smooth functions, where rates better than are possible, and where the effect of is more subtle. In [7], the authors study a setting very similar to ours in the deterministic case, and manage to prove a linear convergence rate, but for a less standard algorithm, different than the one we study here (with iterates of the form ).

Perhaps the works closest to ours are [1, 8], which study stochastic gradient descent with delayed gradients. Moreover, they consider a setting more general than ours, where the delay at each iteration is any integer up to (rather than fixed ), and the functions are not necessarily quadratic. On the flip side, their bounds are significantly weaker. For example, for smooth convex functions and an appropriate step size, [1, Corollary 1] show a bound of

in terms of . Note that this bound is vacuous in the deterministic or near-deterministic case (where ), and is weaker than our bounds. With a different choice of the step size, it is possible to get a non-vacuous bound even if , but the dependence on becomes even stronger. [8] improve the bound to

in the convex and strongly convex case respectively. Even if , the iteration complexity is and , and implies a quadratic dependence on (whereas in our bounds the scaling is linear). When is positive, the effect of delay on the bound is negligible only up to (in contrast to or even in our bounds). We note that there are several other works which study a similar setting (such as [22]), but do not result in bounds which improve on the above.

Finally, we note that [12] attempt to show that for stochastic gradient descent with delayed updates, the dependence on the delay is negligible after sufficiently many iterations. Unfortunately, as pointed out in [1], the analysis contains a bug which make the results invalid.

2 Framework and the Generating Functions Approach

Throughout, we will assume that is a convex quadratic function specified by

(7)

where

is a positive semi-definite matrix whose eigenvalues

are in (where is the smoothness parameter), and . To make the optimization problem meaningful, we further assume that is bounded from below, which implies that it has some minimizer at which the gradient vanishes (for completeness, we provide a proof in Lemma 3 in the supplementary material). Letting , it is easily verified that

(8)

so our goal will be to analyze the dynamics of .

To explain our technique, consider the iterates of DGD on the function , which can be written as . Since , we have , by which it follows that the error term , satisfies the recursion , and (by definition of the algorithm) . By some simple arguments, our analysis then boils down to bounding the elements of the scalar-valued version of this sequence, namely

(9)

for some integer and non-negative real number . To analyze this sequence, we rely on tools from the area of generating functions, which have proven very effective in studying growth rates of sequences in many areas of mathematics. We now turn to briefly describe these functions and our approach (for general surveys on generating functions, see [25, 9, 23], to name a few).

Generally speaking, generating functions are formal power series associated with infinite sequences of numbers . Concretely, given a sequence of numbers in a ring , we define the corresponding generating function as a formal power series in , defined as . The set of all formal power series in over is denoted by . Moreover, given two power series defined by sequences and , we can define their addition as the power series corresponding to , and their multiplication as the coefficients of the Cauchy product of the power series, namely . In particular, over the reals, endowed with addition and multiplication is a commutative ring, and the set of matrices with elements in (with the standard addition and multiplication operations) forms a matrix algebra, denoted by . We will often use the fact that any matrix, whose entries are power series with scalar coefficients, can also be written as a power series with matrix-valued coefficients: More formally, is naturally identified with the ring of formal power series with real matrix coefficients . To extract the coefficients of a given , we shall use the conventional bracket notation , defined to be a matrix whose entries are the ’th coefficients of the respective formal power series.

Returning to Eq. (9), we write as a formal power series denoted by , and proceed as follows,

(10)

Denoting

and rearranging terms gives

(11)

(by a well-known fact, is invertible in , as its constant term 1 is trivially invertible in – see surveys mentioned above). We now see that the problem of bounding the coefficients

is reduced to that of estimating the coefficients of the rational function

, written as a power series. Note that for the analogous problem where the elements of the sequence are vectors and the factor is replaced by for some square matrix , the same derivation as above yields (likewise, is invertible in as its constant term is invertible in ).

To estimate the coefficients of , we form its corresponding partial fraction decomposition. First, we note that as a polynomial of degree , has roots (possibly complex-valued, and all non-zero since for any ). Assuming is chosen so that all the roots are distinct (equivalently, for ), we have by a standard derivation

Thus,

(12)

To bound the magnitude of and , we invoke the following lemma, whose proof (in the supplementary material) relies on some tools from complex analysis:

Lemma 1.

Let , and assume , then

  1. is a real scalar satisfying , and for , .

  2. , for any .

With this lemma at hand, we have

where the last inequality is due to Lemma 5 (provided in the supplementary material). Moreover, one can use elementary arguments to show that for any , as long as (see Lemma 2 in the supplementary material). Overall, for any , we have

(13)

which, using Eq. (11), gives the desired bounds on the elements defined in Eq. (9).

3 Deterministic Delayed Gradient Descent

We start by analyzing the convergence of DGD for -strongly convex and -smooth quadratic functions, where the eigenvalues of are assumed to lie in for some .

Following the same line of the derivation as in Eq. (10), we obtain . Letting , it follows that for any ,

(14)

where follows by the linearity, by eigendecomposition of (that reveals that the spectral norm of a matrix polynomial equals the absolute value of the same polynomial in one of its eigenvalues), by Ineq. 13 for , and by the fact that for all . Moreover, by Eq. (8) and the fact that all eigenvalues of are at most , we arrive at the following bound:

Theorem 1.

For any delay and , running DGD with step size on a -smooth, -strongly convex quadratic function yields

In particular, setting , we get that

Note that the assumption that is very mild, since if then the algorithm trivially makes no updates after rounds.

We now turn to analyze the case of -smooth convex quadratic functions, where the eigenvalues of the matrix are assumed to lie in . Following the same derivation as in Ineq. 14 and using Ineq. 8, we have for any and ,

(15)

where is Euler’s number, is by the fact that the spectral norm of a matrix polynomial equals the absolute value of the same polynomial in one of its eigenvalues, and is by the fact that for any (see Lemma 7 in the supplementary material).We have thus arrived at the following bound for the convex case:

Theorem 2.

For any delay and , running DGD with step size on a -smooth convex quadratic function yields

In particular, if we set , we get that

As discussed in the introduction, the theorems above imply that a delay of increases the iteration complexity by a factor of . We now show lower bounds which imply that this linear dependence on is unavoidable, for a large family of gradient-based algorithms (of which gradient descent is just a special case). Specifically, we will consider any iterative algorithm producing iterates which satisfies the following:

(16)

This is a standard assumption in proving optimization lower bounds (see [17]), and is satisfied by most standard gradient-based methods, and in particular our DGD algorithm. We also note that this algorithmic assumption can be relaxed at the cost of a more involved proof, similar to [16, 26] in the non-delayed case.

Theorem 3.

Consider any algorithm satisfying Eq. (16). Then the following holds for any and sufficiently large dimensionality :

  • [leftmargin=*]

  • There exists a -smooth, -strongly convex function over , such that

  • There exists a -smooth, convex quadratic function over , such that

The proof of the theorem is very similar to standard optimization lower bounds for gradient-based methods without delays (e.g. [17, 11]), and is presented in the supplementary material. In fact, our main contribution is to recognize that the proof technique easily extends to incorporate delays.

In terms of iteration complexity, these bounds correspond to in the strongly convex case, and in the convex case, which show that the linear dependence on is inevitable. The dependence on the other problem parameters is somewhat better than in our upper bounds, but this is not just an artifact of the analysis: In our delayed setting, the lower bounds can be matched by running accelerated gradient descent (AGD) [17], where each time we perform an accelerated gradient descent step, and then stay idle for iterations till we get the gradient of the current point. Overall, we perform accelerated gradient steps, and can apply the standard analysis of AGD to get an iteration complexity which is times the iteration complexity of AGD without delays. These match the lower bounds above up to constants. We believe it is possible to prove a similar upper bound for AGD performing an update with a delayed gradient at every iteration (like our DGD procedure), but the analysis is more challenging than for plain gradient descent, and we leave it to future work.

4 Stochastic Delayed Gradient Descent

In this section, we study the case of noisy (stochastic) gradient updates, and the SDGD algorithm, in which the influence of the delay is quite different than in the noiseless case. Instantiating SDGD for quadratic (defined in (7)) results in the following update rule

(17)

where are independent zero-mean noise terms satisfying . As before, in terms of the error term , Eq. (17) reads as . Given a realization of , we denote its associated formal power series by . By an analysis similar to before, we get that the formal power series of the error terms satisfies

We can now bound the error terms by extracting the corresponding coefficients of . Letting , we have for any

(18)

where follows by the linearity of the bracket operation and the assumption that for all (hence ), follows by the Cauchy product for formal power series, and by the hypothesis that are independent and satisfy for all . We then upper bound both terms, building on Ineq. 13 (see the supplementary material for a full derivation), resulting in the following theorem:

Theorem 4.

Assuming the step satisfies , and , the following holds for SDGD:

  • [leftmargin=*]

  • For -strongly convex, -smooth quadratic convex functions, is at most

    In particular, by tuning appropriately,

  • For -smooth quadratic convex functions, is at most

    In particular, by tuning appropriately,

As discussed in the introduction in detail, the theorem implies that the effect of is negligible once is sufficiently large.

References

  • [1] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
  • [2] Dimitri P Bertsekas and John N Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989.
  • [3] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • [4] Sorathan Chaturapruek, John C Duchi, and Christopher Ré. Asynchronous stochastic convex optimization: the noise is in the noise and sgd don’t care. In Advances in Neural Information Processing Systems, pages 1531–1539, 2015.
  • [5] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pages 1647–1655, 2011.
  • [6] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
  • [7] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. A delayed proximal gradient method with linear convergence rate. In Machine Learning for Signal Processing (MLSP), 2014 IEEE International Workshop on, pages 1–6. IEEE, 2014.
  • [8] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740–3754, 2016.
  • [9] Philippe Flajolet and Robert Sedgewick. Analytic combinatorics. cambridge University press, 2009.
  • [10] Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1453–1461, 2013.
  • [11] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. arXiv preprint arXiv:1507.02000, 2015.
  • [12] John Langford, Alex J Smola, and Martin Zinkevich. Slow learners are fast. Advances in Neural Information Processing Systems, 22:2331–2339, 2009.
  • [13] Rémi Leblond, Fabian Pederegosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. arXiv preprint arXiv:1801.03749, 2018.
  • [14] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.
  • [15] A Nedić, Dimitri P Bertsekas, and Vivek S Borkar. Distributed asynchronous incremental subgradient methods. Studies in Computational Mathematics, 8(C):381–407, 2001.
  • [16] AS Nemirovsky and DB Yudin. Problem complexity and method efficiency in optimization. 1983. Willey-Interscience, New York, 1983.
  • [17] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
  • [18] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
  • [19] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on, pages 850–857. IEEE, 2014.
  • [20] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
  • [21] Benjamin Sirb and Xiaojing Ye. Decentralized consensus algorithm with delayed and stochastic gradients. arXiv preprint arXiv:1604.05649, 2016.
  • [22] Suvrit Sra, Adams Wei Yu, Mu Li, and Alexander J Smola. Adadelay: Delay adaptive distributed stochastic convex optimization. arXiv preprint arXiv:1508.05003, 2015.
  • [23] Richard P Stanley. Enumerative combinatorics. vol. i, the wadsworth & brooks/cole mathematics series, wadsworth & brooks, 1986.
  • [24] Martin Takác, Avleen Singh Bijral, Peter Richtárik, and Nati Srebro. Mini-batch primal and dual methods for svms. In ICML (3), pages 1022–1030, 2013.
  • [25] Herbert S Wilf. generatingfunctionology. AK Peters/CRC Press, 2005.
  • [26] Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in neural information processing systems, pages 3639–3647, 2016.

Appendix A Proof of Lemma 1

Recall that , and its roots, denoted by , are ordered such that . In order to bound from above the magnitude of , we analyze a related polynomial which takes the following explicit form

The roots of are precisely (note that, , hence ). Thus, bounding from above (below) the magnitude of the roots of gives an upper (lower) bound for .

We first establish that for any , has a real-valued root in . Indeed, for any such , we have on the one hand,

and on the other hand (using the fact that for all ),

(19)

so by continuity of , we get that a real-valued root exists in .

Next, we show that non-dominant roots of are of absolute value smaller than . To this end, we invoke Rouché’s theorem, which states that for any two holomorphic functions in some region with closed contour , if for any , then and have the same number of zeros (counted with multiplicity) inside . In particular, choosing , and , it follows that if for all such that , then (which equals our polynomial ) has the same number of zeros as inside (namely, exactly ). However, since is a degree polynomial, it has exactly roots, so the only root of absolute value larger than is the real-valued one we found earlier. It remains to verify the condition for all such that . For that, it is sufficient to show that for all such , or equivalently, .

By the inequality (see Lemma 4 below), we have

It is straightforward to verify that

implying that

where in the last inequality we used the assumption that . As mentioned earlier, the roots of are exactly the reciprocals of the roots of , therefore we conclude

(20)

We now turn to bound from above. By definition, any root of satisfies . Thus, (note that as mentioned in the first part of the proof, ). This, in turn, gives