Local SGD With a Communication Overhead Depending Only on the Number of Workers

06/03/2020 ∙ by Artin Spiridonoff, et al. ∙ Boston University 6

We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among n workers, who can take SGD steps and coordinate with a central server. Unfortunately, this could require a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs Ω ( √(T) ) communications for T local gradient steps in order for the error to scale proportionately to 1/(nT), this has been successively improved in a string of papers, with the state-of-the-art requiring Ω( n ( (T) ) ) communications. In this paper, we give a new analysis of Local SGD. A consequence of our analysis is that Local SGD can achieve an error that scales as 1/(nT) with only a fixed number of communications independent of T: specifically, only Ω(n) communications are required.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic Gradient Descent (SGD) is a widely used algorithm to minimize a convex or non-convex function in which model parameters are updated iteratively as follows:

where is a stochastic gradient of at and

is the learning rate. This algorithm can be naively parallelized by adding more workers independently to compute a gradient and then average them at each step to reduce the variance in estimation of the true gradient

dekel2012optimal . This method requires each worker to share their computed gradients with each other at every iteration.

However, it is widely acknowledged that communication is a major bottleneck of this method for large scale optimization applications mcmahan2016communication ; konevcny2016federated ; lin2017deep . Often, mini-batch parallel SGD is suggested to address this issue by increasing the computation to communication ratio. Nonetheless, too large mini-batch size might degrades the performance lin2018don . Along the same lines of increasing compute to communication, local SGD has been proposed to reduce communications mcmahan2016communication ; dieuleveut2019communication . In this method, workers compute (stochastic) gradients and update their parameters locally, and communicate only once in a while to obtain the average of their parameters. Local SGD improves the communication efficiency not only by reducing the number of communication rounds, but also alleviates the synchronization delay caused by waiting for slow workers and evens out the variations in workers’ computing time wang2018cooperative .

On the other hand, since individual gradients of each worker are calculated at different points, this method introduces residual error as opposed to fully synchronous SGD. Therefore, there is a trade-off between having fewer communication rounds and introducing additional errors to the gradient estimates.

The idea of making local updates is not new and has been used in practice for a while konevcny2016federated . However, until recently, there have been few successful efforts to analyze Local SGD theoretically and therefore it is not fully understood yet. The paper zhang2016parallel shows that for quadratic functions, when the variance of the noise is higher far from the optimum, frequent averaging leads to faster convergence. One of the main questions we want to ask is: how many communication rounds are needed for Local SGD to have the same convergence rate of a synchronized parallel SGD while achieving performance that linearly improves in the number of workers?

stich2018local was among the earlier works that tried to answer this question for general strongly convex and smooth functions and showed that the communication rounds can be reduced up to a factor of , without affecting the asymptotic convergence rate (up to constant factors), where is the total number of iterations and is number of parallel workers.

Focusing on smooth and possibly non-convex functions which satisfy a Polyak-Lojasiewicz condition, haddadpour2019local demonstrates that only communication rounds are sufficient to achieve asymptotic performance that scales proportionately to .

More recently, khaled2019tighter and stich2019error improve upon the previous works by showing linear-speed up for Local SGD with only communication rounds when data is identically distributed among workers and is strongly convex. Their works also consider the cases when is not necessarily strongly-convex as well as the case of data being heterogeneously distributed among workers in khaled2019tighter .

[b] noise model a convergent communication rounds, convergence rate, Reference uniform no b stich2018local uniform with strong-growth c no haddadpour2019local uniform with strong-growth no d stich2019error uniform yes khaled2019tighter uniform with strong-growth yes This Paper

  • is the length of inter-communication intervals.

  • is the uniform upper bound assumed for the norm of gradients in the corresponding work.

  • This noise model is defined in Assumption 2.

  • ignores the poly-logarithmic and constant factors.

Table 1: Comparison of Similar Works

In this work, we focus on smooth and strongly-convex functions with a very general noise model. The main contribution of this paper is to propose a communication strategy which requires only communication rounds to achieve performance that scales as in the number of workers. To the best of the authors’ knowledge, this is the only work to show this result (without additional poly-logarithmic terms and constants). Our analysis can also recover some of the best known rates for special cases, e.g., when is constant, where is defined as the length of intercommunication intervals. A summary of our results compared to the available literature can be found in Table 1.

The rest of this paper is organized as follows. In the following subsection we outline the related literature and ongoing works. In Section 2 we define the main problem and state our assumptions. We present our theoretical findings in Section 3 and the sketch of proofs in Section 4, followed by numerical experiments in Section 5 and conclusion remarks in Section 6.

1.1 Related Works

There has been a lot of effort in the recent research to take into account the communication delays and training time in designing faster algorithms mcdonald2010distributed ; zhang2015deep ; bijral2016data ; kairouz2019advances . See tang2020communication for a comprehensive survey of communication efficient distributed training algorithms considering both system-level and algorithm-level optimizations.

Many works study the communication complexity of distributed methods for convex optimization arjevani2015communication woodworth2020local and statistical estimation zhang2013information . woodworth2020local presents a rigorous comparison of Local SGD with local steps and mini-batch SGD with times larger mini-batch size and the same number of communication rounds (we will refer to such a method as large mini-batch SGD) and show regimes in which each algorithm performs better: they show that Local SGD is strictly better than large mini-batch SGD when the functions are quadratic. Moreover, they prove a lower bound on the worst case of Local SGD that is higher than the worst-case error of large mini-batch SGD in a certain regime. zhang2013information studies the minimum amount of communication required to achieve centralized minimax-optimal rates by establishing lower bounds on minimax risks for distributed statistical estimation under a communication budget.

A parallel line of work studies the convergence of Local SGD with non-convex functions zhou2017convergence . yu2019parallel was among the first works to present provable guarantees of Local SGD with linear speed up. wang2018cooperative and koloskova2020unified present unified frameworks for analyzing decentralized SGD with local updates, elastic averaging or changing topology. The follow-up work wang2018adaptive presents ADACOMM, an adaptive communication strategy that starts with infrequent averaging and then increases the communication frequency in order to achieve a low error floor. They analyze the error-runtime trade-off of Local SGD with nonconvex functions and propose communication times to achieve faster runtime.

In One-Shot Averaging (OSA), workers perform local updates with no communication during the optimization until the end when they average their parameters. This method can be seen as an extreme case of Local SGD with , on the opposite end of synchronous SGD mcdonald2009efficient ; zinkevich2010parallelized ; zhang2013communication ; rosenblatt2016optimality ; godichon2017rates . dieuleveut2019communication

provides non-asymptotic analysis of mini-batch SGD and one-shot averaging as well as regimes in which mini-batch SGD could outperform one-shot averaging.

Another line of work reduces the communication by compressing the gradients and hence limiting the number of bits transmitted in every message between workers lin2017deep ; alistarh2017qsgd ; wangni2018gradient ; stich2018sparsified ; stich2019error .

Asynchronous methods have been studied widely due to their advantages over synchronous methods which suffer from synchronization delays due to the slower workers olshevsky2018robust . wang2019matcha studies the error-runtime trade-off in decentralized optimization and proposes MATCHA, an algorithm which parallelizes inter-node communication by decomposing the topology into matchings. hendrikx2019accelerated

provides an accelerated stochastic algorithm for decentralized optimization of finite-sum objective functions that by carefully balancing the ratio between communications and computations match the rates of the best known sequential algorithms while having the network scaling of optimal batch algorithms. However, these methods are relatively more involved and they often require full knowledge of the network, solving a semi-definite program and/or calculating communication probabilities (schedules).

1.2 Notation

For a positive integer , we define

. We use bold letters to represent vectors. We denote vectors of all

s and s by and , respectively. We use for the Euclidean norm.

2 Problem Formulation

Suppose there are workers , trying to minimize in parallel. We assume all workers have access to through noisy gradients. In Local SGD, workers perform local gradient steps and occasionally calculate the average of all workers’ iterates.

Having access to the same objective function is of special interest if the data is stored in one place accessible to all machines or is distributed identically among workers with no memory constraints. We hope that results presented here can be extended to applications with heterogeneous data distributions khaled2019tighter .

We will make the following additional assumptions.

Assumption 1.

Function is differentiable, -strongly convex and -smooth for . In particular,

We define to be the condition number of .

We make the following assumption on the noise of the stochastic gradients.

Assumption 2.

Each worker

has access to a gradient oracle which returns an unbiased estimate of the true gradient in the form

, such that is a zero-mean conditionally independent random noise with its expected squared norm error bounded as

where are constants.

To save space, we define as the stochastic gradient of node at iteration , and as the true gradient at the same point.

The noise model of Assumption 2 is very general and it includes the common case with uniformly bounded squared norm error when . As it is noted by zhang2016parallel , the advantage of periodic averaging compared to one-shot averaging only appears when is large. Therefore, to study Local SGD, it is important to consider a noise model as in Assumption 2 to capture the effects of frequent averaging. Among the related works mentioned in Table 1, only stich2019error and haddadpour2019local analyze this noise model while the rest study the special case with . SGD under this noise model with and was first studied in schmidt2013fast under the name strong-growth condition. Therefore we refer to the noise model considered in this work as uniform with strong-growth.

In Local SGD, each worker holds a local parameter at iteration and a set of communication times, and performs the following update:

(1)

When , we recover the fully synchronized parallel SGD, while recovers one-shot averaging. The pseudo code for Local SGD is provided as Algorithm 1.

1:  Input for , total number of iterations , the step-size sequence and
2:  for  do
3:     for  do
4:        evaluate a stochastic gradient
5:        if  then
6:           
7:        else
8:           
9:        end if
10:     end for
11:  end for
Algorithm 1 Local SGD

The main goal of this paper is to study the effect of communication times on the convergence of the Local SGD and provide better theoretical guarantees. In what follows, we claim that by carefully choosing the step size, linear speed-up of parallel SGD can be attained with only a small number of communication instances.

3 Convergence Results

In this section we present our convergence results for Local SGD. In the following theorem, we show an upper bound for the sub-optimality error, in the sense of function value, for any choice of communication times .

Before proceeding with our results, let us introduce some notation. Let be the communication times. Define , as the length of -th inter-communication interval, for . Moreover, define as the the average of the iterates of all workers. Notice that for .

The main results of this paper will be obtained by specializing the following bound.

Theorem 1.

Suppose Assumptions 1 and 2 hold. Choose and communication times such that it holds

(2)

Set . Then, using Algorithm 1, we have

(3)

where and is the most recent communication time.

The last term in Equation (3) is due the to disagreement between workers (consensus error), introduced by local computations without any communication. As the inter-communication intervals become larger, becomes larger as well and increases the overall optimization error. This term explains the trade-off between communication efficiency and the optimization error.

Note that condition (2) is mild. For instance, it suffices to set . Moreover, the bound in (3) is for the last iterate , and does not require keeping track of a weighted average of all the iterates.

Theorem 1 not only bounds the optimization error, but introduces a methodological approach to select the communication times to achieve smaller errors. For the scenarios when the user can afford to have a certain number of a communications, they can select to minimize the last term in (3).

We next discuss the implications of Theorem 1 under various conditions.

One-Shot Averaging.

Plugging in Theorem 1, we obtain a convergence rate of without any linear speed-up. Among previous works, only khaled2019tighter shows a similar result.

3.1 Fixed-Length Intervals

A simple way to select the communication times , is to split the whole training time to intervals of length at most . Then we can use the following bound in Equation (3),

We state this result formally in the following corollary.

Corollary 1.

Suppose assumptions of Theorem 1 hold and in addition, workers communicate at least once every iterations. Then,

(4)

Linear Speed-Up.

Setting we achieve linear-speed up in the number of workers, which is equivalent to a communication complexity of . To the best of the authors’ knowledge, this is the tightest communication complexity that is shown to achieve linear speed-up. khaled2019tighter and stich2019error have shown a similar communication complexity, however with slightly higher degrees of dependence on , e.g., in khaled2019tighter .

Recovering Synchronized SGD.

When , the the last term in (4) disappears and we recover the convergence rate of parallel SGD, albeit, with a worse dependence on .

3.2 Varying Intervals

In the previous subsection, we observed that with our current analysis, having fixed-length inter-communication intervals, linear speed-up can be achieved with only rounds of communications. A natural question that might arise is whether we can improve the result above even further.

Let us allow consecutive inter-communication intervals, i.e., , grow linearly, where are the communication times. The following Theorem presents a performance guarantee for this choice of communication times.

Theorem 2.

Suppose Assumptions 1 and 2 hold. Choose the maximum number of communications and set , and for . Choose and set . Then using Algorithm 1 we have,

(5)

The choice of communication times in Theorem 2 aligns with the intuition that workers need to communicate more frequently at the beginning of the optimization. As the the step-sizes become smaller and workers’ local parameters get closer to the global minimum, they diverge more slowly from each other and, hence, less communication is required to re-align them.

Linear Speed-Up.

Choosing communication rounds , we achieve an error that scales as in the number of workers when . This is the main result of this paper: it shows that we can get a linear speedup in the number of workers by simply increasing the number of iterations while keeping the total number of communications bounded.

4 Sketch of Proof

Here we give an outline of the proofs for the results presented in this paper. The proof of the following lemmas are left to the Appendix.

Perturbed Iterates.

A common approach in analyzing parallel algorithms such as Local SGD is to study the evolution of the sequence . We have,

(6)

where is the average of the stochastic gradient estimates of all workers.

Let us define to be the optimality error. The following lemma, which is similar to a part of the proof found in haddadpour2019local , bounds the optimality error at each iteration recursively.

Lemma 1.

Let Assumptions 1 and 2 hold. Then,

Equipped with Lemma 1, we can bound the consensus error () as well as the term in the following lemmas.

Consensus Error.

In the following lemmas, we utilize the structure of the problem to bound the consensus error recursively.

Lemma 2.

Let Assumptions 1 and 2 hold. Then,

(7)

This lemma, bounds how much the consensus error grows at each iteration. Of course, when workers communicate, this error resets to zero and thus, we can calculate an upper bound for the consensus error, knowing the last iteration communication occurred and the step-size sequence. The following lemma takes care of that. Before stating the following lemma, let us define .

Lemma 3.

Let assumptions of Theorem 1 hold. Then,

(8)

Variance.

Our next lemma bounds .

Lemma 4.

Under Assumption 2 we have,

The proofs of Theorems 1 and 2 follow from these lemmas. Due to space constraints, these proofs are given in the supplementary information.

5 Numerical Experiments

To verify our findings and compare different communication strategies in Local SGD, we performed the following numerical experiments.

5.1 Quadratic Function With Strong-Growth Condition

As discussed in zhang2016parallel ; dieuleveut2019communication , under uniformly bounded variance, one-shot averaging performs asymptotically as well as mini-batch SGD. Therefore, to fully capture the importance of the choice of communication times , we design a hard problem, where noise variance is uniform with strong-growth condition, defined in Assumption 2. Let us define where,

(9)

, where and ,

are random variables with normal distributions. We assume at each iteration

, each worker samples a and uses as a stochastic estimate of . It is easy to verify that is -strongly convex, and , where and .

We use Local SGD to minimize using different communication strategies. We select , machines and iterations and the step-size sequence with . We start each simulation from the initial point of and repeat each simulation times. The average of the results are reported in Figures 1(a) and 1(b). Moreover, average performance of Local SGD with different number of workers and the communication strategy proposed in this paper with is shown in Figure 1(c) along with the respective convergence rate of .

(a) Error over iterations.
(b) Error over communications.
(c) Speed-up in the network size.
Figure 1: Local SGD with different communication strategies with defined in (9), . Figures (a) and (b) show the error of different communication methods over iteration and communication round, respectively, with a fixed network size of . Figure (c) shows the convergence of Local SGD with the communication method proposed in this paper ( communication rounds) for different network sizes. The dashed lines are showing .

Figure 1(a) shows that the method with increasing communication intervals () proposed in this paper performs better than all the other communication strategies in the transient time as well as in the final error, requiring much less communication rounds. In particular, the method with the same number of communications but fixed intervals (), has both higher transient error and final error. This affirms the advantages of having more frequent communication at the beginning of the optimization. Indeed, observe that in Figures 1(a), the only method which outperforms the method we propose is the one that communicates at every step.

Figure 1(b) reveals the effectiveness of each communication round in different methods. We observe that there’s an initial spike in the initial communications in methods and . This is mainly because these two methods have more frequent communications at the beginning of the training, where the step-sizes are larger. Other methods experience this increase as well, however since they communicate later, it’s not observed in this figure. Indeed, observe that the only method which makes better use of communication periods than our method in Figure 1(b) is one-shot averaging, which is not competitive in terms of its final error.

Figure 1(c) verifies that linear-speed up in the number of workers can be achieved with only communication rounds. Moreover, it shows that Local SGD achieves the optimal convergence rate of asymptotically.

5.2 Regularized Logistic Regression

We also performed additional numerical experiments with regularized logistic regression using two real data sets. Due to space constraints, the results are presented in supplementary information.

6 Conclusion

We have presented a new analysis of Local SGD and studied the effect of choice of communication times on the final optimality error. We proposed a communication strategy which achieves linear speed-up in the number of workers with only communication rounds, independent of the total number of iterations . Numerical experiments further confirmed our theoretical findings, and showed that our method achieves smaller error than previous methods using fewer communications.

Broader Impact

The results presented in this paper could help speed up training in many machine learning applications. The potential broader impacts are therefore somewhat generic for machine learning: this research could amplify all the benefits ML can bring by making it cheaper in terms of computational cost, while simultaneously amplifying all the ways ML could be misused.

References

Appendix A Missing Proofs

Let us define the following notations used in the proofs presented here.

Moreover, define .

Lemma (1).

Let Assumptions 1 and 2 hold. Then,

Proof of Lemma 1.

By Assumption 1 and (6) we have,

(10)

We bound the first term on the R.H.S of (10) by conditioning on as follows:

(11)

where we used in the second equation and as well as smoothness of in the last inequality. Taking full expectation of (A) and combining it with (10) concludes the lemma. ∎

We state an important identity in the following lemma.

Lemma 5.

Let be arbitrary vectors. Define . Then,

Proof.

We have

Lemma (2).

Let Assumptions 1 and 2 hold. Then,

Proof of Lemma 2.

We have,

(12)

Let us consider the first term on the right hand side of (12). Taking conditional expectation of both sides of (6) implies,

(13)

By -smoothness of ,

(14)

Moreover, by -strong convexity of ,

(15)

where we used in the inequality. Combining (A)-(15) we obtain,

Now, consider the second term on the right hand side of (12). We have,

where are defined at the beginning of this section and and we used Lemma 5 in the third equation and the conditional independence of to use in the last equality. Taking full expectation of the two relations above with respect to and combining them with (12) completes the proof. ∎

Lemma (3).

Let assumptions of Theorem 1 hold. Then,

Before proving this lemma, let us state and prove the following lemma.

Lemma 6.

Let be integers. Define . We then have

Proof.

Indeed,

where we used the inequality as well as the standard technique of viewing as a Riemann sum for and observing that the Riemann sum overstates the integral. Exponentiating both sides now implies the lemma. ∎

Proof of Lemma 3.

Define and for . By Lemma 2,