# Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions.

## Authors

• 10 publications
• 28 publications
• 7 publications
• 5 publications
• 1 publication
• ### Slow and Stale Gradients Can Win the Race

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous ...
03/23/2020 ∙ by Sanghamitra Dutta, et al. ∙ 1

• ### Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

We consider the setting where a master wants to run a distributed stocha...
02/25/2020 ∙ by Serge Kas Hanna, et al. ∙ 0

• ### Distributed SGD Generalizes Well Under Asynchrony

The performance of fully synchronized distributed systems has faced a bo...
09/29/2019 ∙ by Jayanth Regatti, et al. ∙ 0

• ### Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Large-scale machine learning training, in particular distributed stochas...
10/19/2018 ∙ by Jianyu Wang, et al. ∙ 0

• ### Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms

Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variet...
06/22/2015 ∙ by Christopher De Sa, et al. ∙ 0

• ### Faster Asynchronous SGD

Asynchronous distributed stochastic gradient descent methods have troubl...
01/15/2016 ∙ by Augustus Odena, et al. ∙ 0

• ### Asynchronous Decentralized Distributed Training of Acoustic Models

Large-scale distributed training of deep acoustic models plays an import...
10/21/2021 ∙ by Xiaodong Cui, et al. ∙ 6

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient descent (SGD) is the backbone of most state-of-the-art machine learning algorithms. Thus, improving the stability and convergence rate of SGD algorithms is critical for making machine learning algorithms fast and efficient.

Traditionally SGD is run serially at a single node. However, for massive datasets, running SGD serially at a single server can be prohibitively slow. A solution that has proved successful in recent years is to parallelize the training across many learners (processing units). This method was first used at a large-scale in Google’s DistBelief [dean2012large] which used a central parameter server (PS) to aggregate gradients computed by learner nodes. While parallelism dramatically speeds up training, distributed machine learning frameworks face several challenges such as:

Straggling Learners. In synchronous SGD, the PS waits for all learners to push gradients before it updates the model parameters. Random delays in computation (referred to as straggling) are common in today’s distributed systems [dean2013tail]. Waiting for slow and straggling learners can diminish the speed-up offered by parallelizing the training.

Gradient Staleness. To alleviate the problem of stragglers, SGD can be run in an asynchronous manner, where the central parameters are updated without waiting for all learners. However, learners may return stale gradients that were evaluated at an older version of the model, and this can make the algorithm unstable.

The key contributions of this work are:

1. [leftmargin=*]

2. Most SGD algorithms optimize the trade-off between training error, and the number of iterations or epochs. However, the wallclock time per iteration is a random variable that depends on the gradient aggregation algorithm. We present a rigorous analysis of the trade-off between error and the actual runtime (instead of iterations),

modelling runtimes as random variables with a general distribution. This analysis is then used to compare different SGD variants such as -sync SGD, -async SGD and -batch-async SGD, as illustrated in Figure 1.

3. We present a new convergence analysis of asynchronous SGD and some of its variants, where we relax several commonly made assumptions such as bounded delays and gradients, exponential service times, and independence of the staleness process.

4. We propose a novel learning rate schedule to compensate for gradient staleness, and improve the stability and convergence of asynchronous SGD, while preserving its fast runtime.

### 1.1 Related Works

Single Node SGD: Analysis of gradient descent dates back to classical works [boyd2004convex] in the optimization community. The problem of interest is the minimization of empirical risk of the form:

 minw{F(w)def=1NN∑n=1f(w,ξn)}. (1)

Here, denotes the th data point and its label where , and

denotes the composite loss function. Gradient descent is a way to iteratively minimize this objective function by updating the parameter

in the opposite direction of the gradient of at every iteration, as given by:

 wj+1=wj−η∇F(wj)=wj−ηNN∑n=1∇f(wj,ξn).

The computation of over the entire dataset is expensive. Thus, stochastic gradient descent [robbins1951stochastic]

with mini-batching is generally used in practice, where the gradient is evaluated over small, randomly chosen subsets of the data. Smaller mini-batches result in higher variance of the gradients, which affects convergence and error floor

[dekel2012optimal, li2014efficient, bottou2016optimization]. Algorithms such as AdaGrad [duchi2011adaptive] and Adam [kingma2015adam] gradually reduce learning rate to achieve a lower error floor. Another class of algorithms includes stochastic variation reduction techniques that include SVRG [johnson2013accelerating], SAGA [roux2012stochastic] and their variants listed out in [nguyen2017sarah]. For a detailed survey of different SGD variants, refer to [ruder2016overview].

Synchronous SGD and Stragglers: To process large datasets, SGD is parallelized across multiple learners with a central PS. Each learner processes one mini-batch, and the PS aggregates all the gradients. The convergence of synchronous SGD is same as mini-batch SGD, with a -fold larger mini-batch, where is the number of learners. However, the time per iteration grows with the number of learners, because some straggling learners that slow down randomly [dean2013tail]. Thus, it is important to juxtapose the error reduction per iteration with the runtime per iteration to understand the true convergence speed of distributed SGD.

To deal with stragglers and speed up machine learning, system designers have proposed several straggler mitigation techniques such as [harlap2016addressing] that try to detect and avoid stragglers. An alternate direction of work is to use redundancy techniques, e.g., replication or erasure codes, as proposed in [joshi2014delay, wang2015using, joshi2015queues, joshi2017efficient, lee2017speeding, tandon2017gradient, dutta2016short, halbawi2017improving, yang2017coded, yang2016fault, yu2017polynomial, karakus2017encoded, karakus2017straggler, charles2017approximate, li2017terasort, fahim2017optimal, ye2018communication, li2018fundamental, NewsletterPaper, DNNPaperISIT, mallick2018rateless] to deal with the stragglers, as also discussed in Remark 1.

Asynchronous SGD and Staleness: A complementary approach to deal with the issue of straggling is to use asynchronous SGD. In asynchronous SGD, any learner can evaluate the gradient and update the central PS without waiting for the other learners. Asynchronous variants of existing SGD algorithms have also been proposed and implemented in systems [dean2012large, gupta2016model, cipar2013solving, cui2014exploiting, ho2013more].

In general, analyzing the convergence of asynchronous SGD with the number of iterations is difficult in itself because of the randomness of gradient staleness. There are only a few pioneering works such as [tsitsiklis1986distributed, lian2015asynchronous, mitliagkas2016asynchrony, recht2011hogwild, agarwal2011distributed, mania2017perturbed, chaturapruek2015asynchronous, zhang2016staleness, peng2016arock, hannah2017more, hannah2016unbounded, sun2017asynchronous, leblond2017asaga] in this direction. In [tsitsiklis1986distributed], a fully decentralized analysis was proposed that considers no central PS. In [recht2011hogwild], a new asynchronous algorithm called Hogwild was proposed and analyzed under bounded gradient and bounded delay assumptions. This direction of research has been followed upon by several interesting works such as [lian2015asynchronous] which proposed novel theoretical analysis under bounded delay assumption for other asynchronous SGD variants. In [peng2016arock, hannah2017more, hannah2016unbounded, sun2017asynchronous]

, the framework of ARock was proposed for parallel co-ordinate descent and analyzed using Lyapunov functions, relaxing several existing assumptions such as bounded delay assumption and the independence of the delays and the index of the blocks being updated. In algorithms such as Hogwild, ARock etc. every learner only updates a part of the central parameter vector

at every iteration and are thus essentially different in spirit from conventional asynchronous SGD settings [lian2015asynchronous, agarwal2011distributed] where every learner updates the entire . In an alternate direction of work [mania2017perturbed], asynchrony is modelled as a perturbation.

### 1.2 Our Contributions

Existing machine learning algorithms mostly try to optimize the trade-off of error with the number of iterations, epochs or “work complexity” [bottou2016optimization]. Time to complete a task has traditionally been calculated in terms of work complexity measures [sedgewick2011algorithms], where the time taken to complete a task is a deterministic function of the size of the task (number of operations). However, due to straggling and synchronization bottle-necks in the system, the same task can often take different time to compute across different learners or iterations. We bring statistical perspective to the traditional work complexity analysis that incorporates the randomness introduced due to straggling. In this paper, we provide a systematic approach to analyze the expected error with runtime for both synchronous and asynchronous SGD, and some variants like -sync, -batch-sync, -async and -batch-async SGD by modelling the runtimes at each learner as i.i.d. random variables with a general distribution.

We also propose a new error convergence analysis for Async and -async SGD that holds for strongly convex objectives and can also be extended to non-convex formulations. In this analysis we relax the bounded delay assumption in [lian2015asynchronous] and the bounded gradient assumption in [recht2011hogwild]. We also remove the assumption of exponential computation time and the staleness process being independent of the parameter values [mitliagkas2016asynchrony] as we will elaborate in Section 3.2. Interestingly, our analysis also brings out the regimes where asynchrony can be better or worse than synchrony in terms of speed of convergence. Further, we propose a new learning rate schedule to compensate for staleness, and stabilize asynchronous SGD that is related but different from momentum tuning in [mitliagkas2016asynchrony, zhang2017yellowfin] as we clarify in Remark 2.

The rest of the paper is organized as follows. Section 2 describes our problem formulation introducing the system model and assumptions. Section 3 provides the main results of the paper – analytical characterization of expected runtime and new convergence analysis for Async and -async SGD and the proposed learning rate schedule to compensate for staleness. The analysis of expected runtime is elaborated further in Section 4. Proofs and detailed discussions are presented in the Appendix.

## 2 Problem Formulation

Our objective is to minimize the risk function of the parameter vector as mentioned in creftype 1 given training samples. Let denote the total set of training samples, i.e., a collection of some data points with their corresponding labels or values. We use the notation to denote a random seed which consists of either a single data and its label or a single mini-batch ( samples) of data and their labels.

### 2.1 System Model

We assume that there is a central parameter server (PS) with parallel learners as shown in Figure 2. The learners fetch the current parameter vector from the PS as and when instructed in the algorithm. Then they compute gradients using one mini-batch and push their gradients back to the PS as and when instructed in the algorithm. At each iteration, the PS aggregates the gradients computed by the learners and updates the parameter . Based on how these gradients are fetched and aggregated, we have different variants of synchronous or asynchronous SGD.

The time taken by a learner to compute gradient of one mini-batch is denoted by random variable for . We assume that the s are i.i.d. across mini-batches and learners.

### 2.2 Performance Metrics

There are two metrics of interest: Expected Runtime and Error.

###### Definition 1 (Expected Runtime per iteration).

The expected runtime per iteration is the expected time (average time) taken to perform each iteration, i.e., the expected time between two consecutive updates of the parameter at the central PS.

###### Definition 2 (Expected Error).

The expected error after iterations is defined as , the expected gap of the risk function from its optimal value.

Our aim is to determine the trade-off between the expected error (measures the accuracy of the algorithm) and the expected runtime after a total of iterations for the different SGD variants.

### 2.3 Variants of Sgd

We now describe the SGD variants considered in this paper. Please refer to Figure 3 and Figure 4 for a pictorial illustration.

-sync SGD: This is a generalized form of synchronous SGD, also suggested in [gupta2016model, chen2016revisiting] to offer some resilience to straggling as the PS does not wait for all the learners to finish. The PS only waits for the first out of learners to push their gradients. Once it receives gradients, it updates and cancels the remaining learners. The updated parameter vector is sent to all learners for the next iteration. The update rule is given by:

 wj+1=wj−ηKK∑l=1g(wj,ξl,j). (2)

Here denotes the index of the learners that finish first, denotes the mini-batch of samples used by the -th learner at the -th iteration and denotes the average gradient of the loss function evaluated over the mini-batch of size . For , the algorithm is exactly equivalent to a fully synchronous SGD with learners.

-batch-sync: In -batch-sync, all the learners start computing gradients with the same . Whenever any learner finishes, it pushes its update to the PS and evaluates the gradient on the next mini-batch at the same . The PS updates using the first mini-batches that finish and cancels the remaining learners. Theoretically, the update rule is still the same as creftype 2 but here now denotes the index of the mini-batch (out of the mini-batches that finished first) instead of the learner. However -batch-sync will offer advantages over -sync in runtime per iteration as no learner is idle.

-async SGD: This is a generalized version of asynchronous SGD, also suggested in [gupta2016model]. In -async SGD, all the learners compute their respective gradients on a single mini-batch. The PS waits for the first out of that finish first, but it does not cancel the remaining learners. As a result, for every update the gradients returned by each learner might be computed at a stale or older value of the parameter . The update rule is thus given by:

 wj+1=wj−ηKK∑l=1g(wτ(l,j),ξl,j). (3)

Here denotes the index of the learners that contribute to the update at the corresponding iteration, is one mini-batch of samples used by the -th learner at the -th iteration and denotes the iteration index when the -th learner last read from the central PS where . Also, is the average gradient of the loss function evaluated over the mini-batch based on the stale value of the parameter . For , the algorithm is exactly equivalent to fully asynchronous SGD, and the update rule can be simplified as:

 wj+1=wj−ηg(wτ(j),ξj). (4)

Here denotes the set of samples used by the learner that updates at the -th iteration such that and denotes the iteration index when that particular learner last read from the central PS. Note that .

-batch-async: Observe in Figure 4 that -async also suffers from some learners being idle while others are still working on their gradients until any finish. In -batch-async (proposed in [lian2015asynchronous]), the PS waits for mini-batches before updating itself but irrespective of which learner they come from. So wherever any learner finishes, it pushes its gradient to the PS, fetches current parameter at PS and starts computing gradient on the next mini-batch based on the current value of the PS. Surprisingly, the update rule is again similar to creftype 3 theoretically except that now denotes the indices of the mini-batches that finish first instead of the learners and denotes the version of the parameter when the learner computing the th mini-batch last read from the PS. While the error convergence of -batch-async is similar to -async, it reduces the runtime per iteration as no learner is idle.

###### Remark 1.

Recent works such as [tandon2017gradient] propose erasure coding techniques to overcome straggling learners. Instead, the SGD variants considered in this paper such as -sync and -batch-sync SGD exploit the inherent redundancy in the data itself, and ignore the gradients returned by straggling learners. If the data is well-shuffled such that it can be assumed to be i.i.d. across learners, then for the same effective batch-size, ignoring straggling gradients will give equivalent error scaling as coded strategies, and at a lower computing cost. However, coding strategies may be useful in the non i.i.d. case, when the gradients supplied by each learner provide diverse information that is important to capture in the trained model.

### 2.4 Assumptions

Closely following [bottou2016optimization], we also make the following assumptions:

1. [leftmargin=*]

2. is an smooth function. Thus,

 ||∇F(w1)−∇F(w2)||2≤L||w1−w2||2. (5)
3. is strongly convex with parameter . Thus,

 2c(F(w)−F∗)≤||∇F(w)||22  ∀ w. (6)

Refer to Appendix A for discussion on strong convexity. Our results also extend to non-convex objectives, as discussed in Section 3.

4. The stochastic gradient is an unbiased estimate of the true gradient:

 Eξj|wk[g(wk,ξj)]=∇F(wk)  ∀ k≤j. (7)

Observe that this is slightly different from the common assumption that says for all . Observe that all for is actually not independent of the data . We thus make the assumption more rigorous by conditioning on for . Our requirement means that is the value of the parameter at the PS before the data was accessed and can thus be assumed to be independent of the data .

5. Similar to the previous assumption, we also assume that the variance of the stochastic update given at iteration before the data point was accessed is also bounded as follows:

 Eξj|wk[||g(wk,ξj)−∇F(wk)||22]≤σ2m+MGm||∇F(wk)||22 ∀ k≤j. (8)

In the following Table 1, we provide a list of the notations used in this paper for referencing.

## 3 Main Results

### 3.1 Runtime Analysis

We compare the theoretical wall clock runtime of the different SGD variants to illustrate the speed-up offered by different asynchronous and batch variants. A detailed discussion is provided in Section 4.

###### Theorem 1.

Let the wall clock time of each learner to process a single mini-batch be i.i.d. random variables . Then the ratio of the expected runtimes per iteration for synchronous and asynchronous SGD is

 E[TSync]E[TAsync]=PE[XP:P]E[X]

where is the order statistic of i.i.d. random variables .

This result analytically characterizes the speed-up offered by asynchronous SGD for any general distribution on the wall clock time of each learner. To prove this result, we use ideas from renewal theory, as we discuss in Section 4. In the following corollary, we highlight this speed-up for the special case of exponential computation time.

###### Corollary 1.

Let the wall clock time of each learner to process a single mini-batch be i.i.d. exponential random variables . Then the ratio of the expected runtimes per iteration for synchronous and asynchronous SGD is approximately given by .

Thus, the speed-up scales with and can diverge to infinity for large . We illustrate the speed-up for different distributions in Figure 5. It might be noted that a similar speed-up as Corollary 1 has also been obtained in a recent work [hannah2017more] under exponential assumptions.

The next result illustrates the advantages offered by -batch-sync and async over their corresponding counterparts -sync and -async respectively.

###### Theorem 2.

Let the wall clock time of each learner to process a single mini-batch be i.i.d. exponential random variables . Then the ratio of the expected runtimes per iteration for -async (or sync) SGD and -batch-async (or sync) SGD is

 E[TK−async]E[TK−batch−async]=PE[XK:P]KE[X]≈PlogPP−KK

where is the order statistic of i.i.d. random variables .

To prove this, we derive an exact expression (see Lemma 5 in Section 4) for the expected runtime of -batch-async SGD, for any given i.i.d. distribution of s, not necessarily exponential. The expected runtime per iteration is obtained as , using ideas from renewal theory. The full proof of Theorem 2 is also provided in Section 4.

Theorem 2 shows that as increases, the speed-up using -batch-async increases and can be upto

times higher. For non-exponential distributions, we simulate the behaviour of expected runtime in

Figure 6 for -sync, -async and -batch-async respectively for Pareto and Shifted Exponential.

### 3.2 Error Analysis Under Fixed Learning Rate

Theorem 3 below gives a convergence analysis of -async SGD for fixed , relaxing the following assumptions in existing literature.

• [leftmargin=*]

• In several prior works such as [mitliagkas2016asynchrony, lee2017speeding, dutta2016short, hannah2017more], it is often assumed, for the ease of analysis, that runtimes are exponentially distributed. In this paper, we extend our analysis for any general service time .

• In [mitliagkas2016asynchrony], it is also assumed that the staleness process is independent of . While this assumption simplifies the analysis greatly, it is not true in practice. For instance, for a two learner case, the parameter after iterations depends on whether the update from to was based on a stale gradient at or the current gradient at , depending on which learner finished first. In this work, we remove this independence assumption.

• Instead of the bounded delay assumption in [lian2015asynchronous], we use a general staleness bound

 E[||∇F(wj)−∇F(wτ(l,j))||22]≤γE[||∇F(wj)||22]

which allows for large, but rare delays.

• In [recht2011hogwild], the norm of the gradient is assumed to be bounded. However, if we assume that for some constant , then using creftype 6 we obtain implying that itself is bounded which is a very strong and restrictive assumption, that we relax in this result.

Some of these assumptions have been addressed in the context of alternative asynchronous SGD variants in the recent works of [hannah2017more, hannah2016unbounded, sun2017asynchronous, leblond2017asaga].

###### Theorem 3.

Suppose the objective is -strongly convex and the learning rate . Also assume that for some ,

 E[||∇F(wj)−∇F(wτ(l,j))||22]≤γE[||∇F(wj)||22].

Then, the error of -async SGD after iterations is,

 E[F(wJ)]−F∗≤ηLσ22cγ′Km+(1−ηcγ′)J(E[F(w0)]−F∗−ηLσ22cγ′Km) (9)

where and

is a lower bound on the conditional probability that

, given all the past delays and parameters.

Here, is a measure of staleness of the gradients returned by learners; smaller indicates less staleness.

The full proof is provided in Appendix C. We first prove the result for in Section C.1 for ease of understanding, and then provide the more general proof for any in Section C.2. We use Lemma 1 below to prove Theorem 3.

###### Lemma 1.

Suppose that is the conditional probability that given all the past delays and all the previous , and for all . Then,

 E[||∇F(wτ(l,j))||22]≥p0E[||∇F(wj)||22]. (10)
###### Proof.

By the law of total expectation,

 E[||∇F(wτ(l,j))||22] =p(l,j)0E[||∇F(wτ(l,j))||22|τ(j)=j]+(1−p(l,j)0)E[||∇F(wτ(l,j))||22|τ(j)≠j] ≥p0E[||∇F(wj)||22].

For the exponential distribution, is equal to as we discuss in Lemma 2. For non-exponential distributions, it is a constant in . For some special classes of distributions like new-longer-than-used (new-shorter-than-used) as defined in Definition 3, we can formally show that lies in () respectively. The following Lemma 2 below provides bounds on .

###### Lemma 2 (Bounds on p0).

Define , i.e. the largest constant such that .

• For exponential computation times, for all and is thus invariant of and .

• For new-longer-than-used (See Definition 3) computation times, and thus .

• For new-shorter-than-used computation times, and thus .

The proof is provided in Section C.1.1.

For -batch-async, the update rule is same as -async except that the index denotes the index of the mini-batch. Thus, the error analysis will be exactly similar. Our analysis can also be extended to non-convex as we show in Section C.2.1.

Now let us compare with -sync SGD. We observe that the analysis of -sync SGD is same as serial SGD with mini-batch size . Thus,

###### Lemma 3 (Error of K-sync).

[bottou2016optimization] Suppose that the objective is -strongly convex and learning rate . Then, the error after iterations of -sync SGD is

 E[F(wJ)−F∗]

Can stale gradients win the race? For the same , observe that the error given by Theorem 3 decays at the rate for -async or -batch-async SGD while for -sync, the decay rate with number of iterations is . Thus, depending on the values of and , the decay rate of -async or -batch-async SGD can be faster or slower than -sync SGD. The decay rate of -async or -batch-async SGD is faster if . As an example, one might consider an exponential or new-shorter-than-used service time where and can be made smaller by increasing . It might be noted that asynchronous SGD can still be faster than synchronous SGD with respect to wall clock time even if its decay rate with respect to number of iterations is lower as every iteration is much faster in asynchronous SGD (Roughly times faster for exponential service times).

The maximum allowable learning rate for synchronous SGD is which can be much higher than that for asynchronous SGD,i.e., . Similarly the error-floor for synchronous is as compared to asynchronous whose error floor is .

In Figure 7, we compare the theoretical trade-offs between synchronous ( in Lemma 3) and asynchronous SGD ( in Theorem 3). Async-SGD converges very quickly, but to a higher floor. Figure 8 shows the same comparison on the MNIST dataset, along with -batch-async SGD.

### 3.3 Variable Learning Rate for Staleness Compensation

The staleness of the gradient is random, and can vary across iterations. Intuitively, if the gradient is less stale, we want to weigh it more while updating the parameter , and if it is more stale we want to scale down its contribution to the update. With this motivation, we propose the following condition on the learning rate at different iterations.

 ηjE[||wj−wτ(j)||22]≤C (11)

for a constant . This condition is also inspired from our error analysis in Theorem 3, because it helps remove the assumption . Using creftype 11, we obtain the following convergence result.

###### Theorem 4.

Suppose the learning rate in the -th iteration , and

 ηjE[||wj−wτ(j)||22]≤C

for some constant . Then, we have

 E[F(wJ)]−F∗ ≤Δ+(E[F(w0)]−F∗)J∏j=1(1−ρj)

where , and the error floor , where .

The proof is provided in Section C.3. In our analysis of Asynchronous SGD, we observe that the term is the most difficult to bound. For fixed learning rate, we had assumed that is bounded by . However, if we impose the condition creftype 11 on , we do not require this assumption. Our proposed condition actually provides a bound for the staleness term as follows:

 ηj2 E[||∇F(wj)−∇F(wτ(j))||22]≤ηjL22E[||wj−wτ(j)||22]≤CL22. (12)

Proposed Algorithmic Modification Inspired by this analysis, we propose the learning rate schedule,

 ηj=min⎧⎨⎩C||wj−wτ(j)||22,ηmax⎫⎬⎭ (13)

where is a suitably large ceiling on learning rate. It ensures stability when the first term in (13) becomes large due to the staleness being small. The is chosen of the same order as the desired error floor. To implement this schedule, the PS needs to store the last read model parameters for every learner. In Figure 9 we illustrate how this schedule can stabilize asynchronous SGD. We also show simulation results that characterize the performance of this algorithm in comparison with naive asynchronous SGD with fixed learning rate.

###### Remark 2.

The idea of variable learning rate is related to the idea of momentum tuning in [mitliagkas2016asynchrony, zhang2017yellowfin] and may have a similar effect of stabilizing the convergence of asynchronous SGD. However, learning rate tuning is arguably more general since asynchrony results in a momentum term in the gradient update (as shown in [mitliagkas2016asynchrony, zhang2017yellowfin]) only under the assumption that the staleness process is geometric and independent of .

## 4 Runtime Analysis

In this section, we provide our analysis of the expected runtime of different variants of SGD. These lemmas are then used in the proofs of Theorem 1 and Theorem 2.

### 4.1 Runtime of K-Sync Sgd

###### Lemma 4 (Runtime of K-sync SGD).

The expected runtime per iteration for -sync SGD is,

 E[T] =E[XK:P] (14)

where is the order statistic of i.i.d. random variables .

###### Proof of Lemma 4.

We assume that the learners have an i.i.d. computation times. When all the learners start together, and we wait for the first out of i.i.d. random variables to finish, the expected computation time for that iteration is , where denotes the -th statistic of i.i.d. random variables . ∎

Thus, for a total of iterations, the expected runtime is given by .

###### Remark 3.

For , the expected runtime per iteration is given by,

 E[T]=1μP∑i=P−K+11i≈1μ⎛⎜⎝logPP−Kμ⎞⎟⎠

where the last step uses an approximation from [sheldon2002first]. For justification, the reader is referred to Section B.1.

### 4.2 Runtime of K-Batch-Sync Sgd

The expected runtime of -batch-sync SGD is not analytically tractable in general, but for , the runtime per iteration is distributed as . Refer to Section B.2 for explanation. Thus, for -batch-sync SGD, the expected time per iteration is given by,

 E[T]=KPμ.

### 4.3 Runtime of K-Batch-Async Sgd

###### Lemma 5 (Runtime of K-batch-async SGD).

The expected runtime per iteration for -batch-async SGD in the limit of large number of iterations is given by:

 E[T]=KE[X]P. (15)

Unlike the results for the synchronous variants, this result on average runtime per iteration holds only in the limit of large number of iterations. To prove the result we use ideas from renewal theory. For a brief background on renewal theory, the reader is referred to Section B.3.

###### Proof of Lemma 5.

For the -th learner, let be the number of times the -th learner pushes its gradient to the PS over in time . The time between two pushes is an independent realization of . Thus, the inter-arrival times are i.i.d. with mean inter-arrival time . Using the elementary renewal theorem [gallager2013stochastic, Chapter 5] we have,

 limt→∞E[Ni(t)]t=1E[Xi]. (16)

Thus, the rate of gradient pushes by the -th learner is . As there are learners, we have a superposition of renewal processes and thus the average rate of gradient pushes to the PS is

 limt→∞P∑i=1E[Ni(t)]t=P∑i=11E[Xi]=PE[X]. (17)

Every pushes are one iteration. Thus, the expected runtime per iteration or effectively the expected time for pushes is given by

Thus, for a total of iterations, the average runtime can be approximated as when is large. Note that Fully-Synchronous SGD is actually -sync SGD with , i.e., waiting for all the learners to finish. On the other hand, Fully-Asynchronous SGD is actually -batch-async with . Now, we provide the proofs of Theorem 1 and Corollary 1 respectively, that provide a comparison between these two variants.

###### Proof of Theorem 1.

By taking the ratio of the expected runtimes per iteration in Lemma 4 with and Lemma 5 with , we get the result in Theorem 1. ∎

###### Proof of Corollary 1.

The expectation of the maximum of i.i.d.  is [sheldon2002first]. This can be substituted in Theorem 1 to get Corollary 1. ∎

### 4.4 Runtime of K-Async Sgd

The expected runtime per iteration of -async SGD is not analytically tractable for non-exponential , but we obtain an upper bound on it for a class of distributions called the “new-longer-than-used” distributions, as defined below.

###### Definition 3 (New-longer-than-used).

A random variable is said to have a new-longer-than-used distribution if the following holds for all :

 Pr(U>u+t|U>t)≤Pr(U>u).

Most of the continuous distributions we encounter like normal, exponential, gamma, beta are new-longer-than-used. Alternately, the hyper exponential distribution is new-shorter-than-used and it satisfies for all .

###### Lemma 6 (Runtime of K-async SGD).

Suppose that each has a new-longer-than-used distribution. Then, the expected runtime per iteration for -async is upper-bounded as

 E[T] ≤E[XK:P] (18)

where is the order statistic of i.i.d. random variables .

The proof of this lemma is provided in Section B.4.

We provided a comparison of the expected runtimes of -async and -batch-async SGD variants in Theorem 2, for the special case of exponential computation times. Here, we provide the proof of Theorem 2.

###### Proof of Theorem 2.

For the exponential , equality holds in (18) in Lemma 6, as we justify in Section B.4.1. The expectation can be derived as . For exponential , the expected runtime per iteration for -batch-async is given by from Lemma 5. ∎

In Figure 10, we pictorially illustrate the expected error-runtime trade-offs of -async with -batch-async SGD.

## 5 Conclusions

The speed of distributed SGD depends on the error reduction per iteration, as well as the runtime per iteration. This paper presents a novel runtime analysis of synchronous and asynchronous SGD, and their variants for any general distribution on the wall-clock time of each learner. When juxtaposed with the error analysis, we get error-runtime trade-offs that can be used to compare different SGD algorithms. We also give a new analysis of asynchronous SGD by relaxing some commonly made assumptions and also propose a novel learning rate schedule to compensate for gradient staleness.

In the future we plan to explore methods to gradually increase synchrony, so that we can achieve fast convergence as well as low error floor. We are also looking into the use of local updates to minimize the frequency of communication between the PS and learners, that is closely related to [zhang2016parallel, yin2017gradient, zhou2017convergence, zhang2015deep].

### Acknowledgements

The authors thank Mark Wegman, Pulkit Grover and Jianyu Wang for their suggestions and feedback.

## Appendix A Strong Convexity Discussion

###### Definition 4 (Strong-Convexity).

A function is defined to be -strongly convex, if the following holds for all and in the domain:

For strongly convex functions, the following result holds for all in the domain of .

 2c(h(u)−h∗)≤||∇h(u)||22 (19)

The proof is derived in [bottou2016optimization]. For completeness, we give the sketch here.

###### Proof.

Given a particular , let us define the quadratic function as follows:

 q(u′)=h(u)+∇h(u)T(u′−u)+c2||u′−u||22

Now, is minimized at and the value is . Thus, from the definition of strong convexity we now have,

 h∗ ≥h(u)+∇h(u)T(u′−u)+c2||u′−u||22 ≥h(u)−12c||∇h(u)||22  [minimum value of q(u′)].

## Appendix B Runtime Analysis Proofs

Here we provide all the remaining proofs and supplementary information for the results in Section 4.

### b.1 Runtime of K-sync SGD

-th statistic of exponential distributions: Here we give a sketch of why the -th order statistic of exponentials scales as . A detailed derivation can be obtained in [sheldon2002first]. Consider i.i.d. exponential distributions with parameter . The minimum of independent exponential random variables with parameter is exponential with parameter . Conditional on , the second smallest value is distributed like the sum of and an independent exponential random variable with parameter . And so on, until the -th smallest value which is distributed like the sum of and an independent exponential random variable with parameter . Thus,

 XK:P=YP+YP−1+⋯+YP−K+1

where the random variables s are independent and exponential with parameter . Thus,

 E[XK:P]=P∑i=P−K+11iμ=HP−HP−Kμ≈logPP−Kμ.

Here and denote the -th and -th harmonic numbers respectively.

For the case where , the expectation is given by,

 E[XP:P]=1μP∑i=11i=1μHP≈1μlogP.

### b.2 Runtime of K-batch-sync SGD

In general, the expected runtime per iteration of -batch-sync SGD is not tractable but for the special case of exponentials it follows the distribution . This is obtained from the memoryless property of exponentials.

All the learners start their computation together. The expected time taken by the first mini-batch to be completed is the minimum of i.i.d. exponential random variables is another exponential random variable distributed as . At the time when the first mini-batch is complete, from the memoryless property of exponentials, it may be viewed as i.i.d. exponential random variables starting afresh again. Thus, the time to complete each mini-batch is distributed as , and an iteration being the sum of the time to complete such mini-batches, has the distribution .

### b.3 Runtime of K-batch-async SGD

Here we include a discussion on renewal processes for completeness, to provide a background for the proof of Lemma 5, which gives the expected runtime of -batch-async SGD. The familiar reader can merely skim through this and refer to the proof provided in the main section of the paper in Section 4.

###### Definition 5 (Renewal Process).

A renewal process is an arrival process where the inter-arrival intervals are positive, independent and identically distributed random variables.

###### Lemma 7 (Elementary Renewal Theorem).

[gallager2013stochastic, Chapter 5] Let be a renewal counting process denoting the number of renewals in time . Let be the mean inter-arrival time. Then,

 limt→∞E[N(t)]t=1E[Z]. (20)

Observe that for asynchronous SGD or -batch-async SGD, every gradient push by a learner to the PS can be thought of as an arrival process. The time between two consecutive pushes by a learner follows the distribution of and is independent as computation time has been assumed to be independent across learners and mini-batches. Thus the inter-arrival intervals are positive, independent and identically distributed and hence, the gradient pushes are a renewal process.

### b.4 Runtime of K-async SGD

###### Proof of Lemma 6.

For new-longer-than-used distributions observe that the following holds:

 Pr(Xi>u+t|Xi>t)≤Pr(Xi>u). (21)

Thus the random variable is thus stochastically dominated by . Now let us assume we want to compute the expected computation time of one iteration of -async starting at time instant . Let us also assume that the learners last read their parameter values at time instants respectively where any of these are equal to as out of learners were updated at time and the remaining of these are . Let be the random variables denoting the computation time of the learners starting from time . Thus,

 Yi=Xi−(t0−ti)|Xi>(t0−ti)  ∀ i=1,2,…,P. (22)

Now each of the s are independent and are stochastically dominated by s.

 Pr(Yi>u)≤Pr(Xi>u) ∀ i,j=1,2,…,P. (23)

The expectation of the -th statistic of is the expected runtime of the iteration. Let us denote as the -th statistic of numbers . And let us us denote as the -th statistic of numbers where of them are given as and is the th number. Thus

 gK,s(x)=hK(x,s(1),s(2),…,s(P−1))

First observe that is an increasing function of since given the other values, the -th order statistic will either stay the same or increase with . Now we use the property that if is stochastically dominated by , then for any increasing function , we have

 EY1[g(Y1)]≤EX1[g(X1)].

This result is derived in [kreps1990course] .

This implies that for a given ,

 EY1[gK,s(Y1)]≤EX1[gK,s(X1)].