# Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

## Authors

• 4 publications
• 6 publications
• 1 publication
• 27 publications
• 16 publications
• 60 publications
05/31/2020

### DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed traini...
12/20/2014

### Deep learning with Elastic Averaging SGD

We study the problem of stochastic optimization for deep learning in the...
04/07/2020

### Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning

This paper investigates the stochastic optimization problem with a focus...
03/03/2021

### A Pessimistic Bilevel Stochastic Problem for Elastic Shape Optimization

We consider pessimistic bilevel stochastic programs in which the followe...
11/27/2018

### LEASGD: an Efficient and Privacy-Preserving Decentralized Algorithm for Distributed Learning

Distributed learning systems have enabled training large-scale models ov...
06/17/2018

We propose a very simple modification of gradient descent and stochastic...
03/09/2021

### Proof-of-Learning: Definitions and Practice

Training machine learning (ML) models typically involves expensive itera...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As deep learning models and data sets grow in size, it becomes increasingly helpful to parallelize their training over a distributed computational environment. These models lie at the core of many modern machine-learning-based systems for image recognition

NIPS2012_4824 , speech recognition DBLP:conf/icassp/Abdel-HamidMJP12 DBLP:conf/emnlp/WestonCA14 , and more. This paper focuses on the parallelization of the data, not the model, and considers collective communication scheme DBLP:journals/corr/WickramasingheL16 that is most commonly used nowadays. A typical approach to data parallelization in deep learning uses multiple workers that run variants of SGD bottou-98x ; B-NH2018arxiv ; GAJKB2018SPAA on different data batches. Therefore, the effective batch size is increased by the number of workers. Communication ensures that all models are synchronized and critically relies on a scheme where each worker broadcasts its parameter gradients to all the remaining workers. This is the case for DOWNPOUR DOWNPOUR (its decentralized extension, with no central parameter server, based on the ring topology can be found in pmlr-v80-lian18a ) or Horovod sergeev2018horovod methods. These techniques require frequent communication (after processing each batch) to avoid instability/divergence, and hence are communication expensive. Moreover, training with a large batch size usually hurts generalization KMNST2017ICLR ; JKABFBS2018ICLRW ; SL2018ICLR and convergence speed MBB2018ICML ; DBLP:journals/corr/abs-1708-03888 .

Another approach, called Elastic Averaging (Stochastic) Gradient Decent, EA(S)GD EASGD , introduces elastic forces linking the parameters of the local workers with central parameters computed as a moving average over time and space (i.e. over the parameters computed by local workers). This method allows less frequent communication as workers by design do not need to have the same parameters but are instead periodically pulled towards each other. The objective function of EASGD, however, has stationary points which are not stationary points of the underlying objective function (see Proposition 8 in the Supplement), thus optimizing it may lead to sub-optimal solutions for the original problem. Further, EASGD can be viewed as a parallel extension of the averaging SGD scheme polyak1992acceleration and as such it inherits the downsides of the averaging policy. On non-convex problems, when the iterates are converging to different local minima (that may potentially be globally optimal), the averaging term can drag the iterates in the wrong directions and significantly hurt the convergence speed of both local workers and the master. In symmetric regions of the optimization landscape, the elastic forces related with different workers may cancel each other out causing the master to be permanently stuck in between or at the maximum between different minima, and local workers to be stuck at the local minima or on the slopes above them. This can result in arbitrarily bad generalization error. We refer to this phenomenon as the “curse of symmetry”. Landscape symmetries are common in a plethora of non-convex problems LLAHLWZ2019IEEETRINFTH ; ge2017no ; DBLP:journals/focm/SunQW18 ; DBLP:journals/tit/SunQW17 ; ge2016matrix , including deep learning DBLP:journals/corr/BadrinarayananM15 ; DBLP:conf/aistats/ChoromanskaHMAL15 ; pmlr-v80-liang18a ; K2016NIPS .

Finally, our L(S)GD approach, similarly to EA(S)GD, tends to explore wide valleys in the optimization landscape when the pulling force between workers and leaders is set to be small. This property often leads to improved generalization performance of the optimizer DBLP:journals/corr/ChaudhariCSL16 ; DBLP:journals/corr/ChaudhariBZST17 .

The paper is organized as follows: Section 2 introduces the L(S)GD approach, Section 3 provides theoretical analysis, Section 4 contains empirical evaluation, and finally Section 5 concludes the paper. Theoretical proofs and additional theoretical and empirical results are contained in the Supplement.

### 2.1 Motivating example

Figure 1 illustrates how elastic averaging can impair convergence. To obtain the figure we applied EAGD (Elastic Averaging Gradient Decent) and LGD to the matrix completion problem of the form: . This problem is non-convex but is known to have the property that all local minimizers are global minimizers LLAHLWZ2019IEEETRINFTH . For four choices of the rank , we generated random instances of the matrix completion problem, and solved each with EAGD and LGD, initialized from the same starting points (we use workers). For each algorithm, we report the progress of the best objective value at each iteration, over all workers. Figure 1 shows the results across random experiments for each rank.

It is clear that EAGD slows down significantly as it approaches a minimizer. Typically, the center of EAGD is close to the average of the workers, which is a poor solution for the matrix completion problem when the workers are approaching different local minimizers, even though all local minimizers are globally optimal. This induces a pull on each node away from the minimizers, which makes it extremely difficult for EAGD to attain a solution of high accuracy. In comparison, LGD does not have this issue. Further details of this experiment, and other illustrative examples of the difference between EAGD and LGD, can be found in the Supplement.

Next we explain the basic update of the L(S)GD algorithm. Consider first the single-leader setting and the problem of minimizing loss function

in a parallel computing environment. The optimization problem is given as

 minx1,x2,…,xlL(x1,x2,…,xl)\coloneqqminx1,x2,…,xll∑i=1E[f(xi;ξi)]+λ2||xi−~x||2, (1)

where is the number of workers, are the parameters of the workers and are the parameters of the leader. The best performing worker, i.e. ), and

s are data samples drawn from some probability distribution

.

is the hyperparameter that denotes the strength of the force pulling the workers to the leader. In the theoretical section we will refer to

as simply . This formulation can be further extended to the multi-leader setting. The optimization problem is modified to the following form

 minx1,1,x1,2,…,xn,lL(x1,1,x1,2,…,xn,l) \coloneqqminx1,1,x1,2,...,xn,ln∑j=1l∑i=1E[f(xj,i;ξj,i)]+λ2||xj,i−~xj||2+λG2||xj,i−~x||2, (2)

where is the number of groups, is the number of workers in each group, is the local leader of the group (i.e. ), is the global leader (the best worker among local leaders, i.e. ), are the parameters of the workers in the group, and s are the data samples drawn from . and are the hyperparameters that denote the strength of the forces pulling the workers to their local and global leader respectively.

The updates of the LSGD algorithm are captured below, where denotes iteration. The first update shown in Equation 3 is obtained by taking the gradient descent step on the objective in Equation 2 with respect to variables . The stochastic gradient of with respect to is denoted as (in case of LGD the gradient is computed over all training examples) and is the learning rate.

 xj,it+1=xj,it−ηgj,it(xj,it)−λ(xj,it−~xjt)−λG(xj,it−~xt) (3)

where and are the local and global leaders defined above.

Equation 3 describes the update of any given worker and is comprised of the regular gradient step and two corrective forces (in single-leader setting the third term disappears as

then). These forces constitute the communication mechanism among the workers and pull all the workers towards the currently best local and global solution to ensure fast convergence. As opposed to EASGD, the updates performed by workers in LSGD break the curse of symmetry and avoid convergence decelerations that result from workers being pulled towards the average which is inherently influenced by poorly performing workers. In this paper, instead of pulling workers to their averaged parameters, we propose the mechanism of pulling the workers towards the leaders. The flavor of the update resembles a particle swarm optimization approach

488968 , which is not typically used in the context of stochastic gradient optimization for deep learning. Our method may therefore be viewed as a dedicated particle swarm optimization approach for training deep learning models in the stochastic setting and parallel computing environment.

Next we describe the LSGD algorithm in more detail. We rely on the collective communication scheme. In order to reduce the amount of communication between the workers, it is desired to pull them towards the leaders less often than every iteration. Also, in practice each worker can have a different speed. To prevent waiting for the slower workers and achieve communication efficiency, we implement the algorithm in the asynchronous operation mode. In this case, the communication period is determined based on the total number of iterations computed across all workers and the communication is performed every or iterations, where and denote local and global communication periods, respectively. In practice, we use since communication between workers lying in different groups is more expensive than between workers within one group, as explained above. When communication occurs, all workers are updated at the same time (i.e. pulled towards the leaders) in order to take advantage of the collective communication scheme. Between communications, workers run their own local SGD optimizers. The resulting LSGD method is very simple, and is depicted in Algorithm 1.

The next section provides a theoretical description of the single-leader batch (LGD) and stochastic (LSGD) variants of our approach.

## 3 Theoretical Analysis

We assume without loss of generality that there is a single leader. The objective function with multiple leaders is given by , which is equivalent to for and . Proofs for this section are deferred to the Supplement.

### 3.1 Convergence Rates for Stochastic Strongly Convex Optimization

We first show that LSGD obtains the same convergence rate as SGD for stochastic strongly convex problems BCN2018SIAMREV . In Section 3.3 we discuss how and when LGD can obtain better search directions than gradient descent. We discuss non-convex optimization in Section 3.2. Throughout Section 3.1, will typically satisfy:

Assumption 1 is -Lipschitz-differentiable and -strongly convex, which is to say, the gradient satisfies , and satisfies . We write for the unique minimizer of , and for the condition number of .

#### 3.1.1 Convergence Rates

The key technical result is that LSGD satisfies a similar one-step descent in expectation as SGD, with an additional term corresponding to the pull of the leader. To provide a unified analysis of ‘pure’ LSGD as well as more practical variants where the leader is updated infrequently or with errors, we consider a general iteration , where is an arbitrary guiding point; that is, may not be the minimizer of , nor even satisfy . Since the nodes operate independently except when updating , we may analyze LSGD steps for each node individually, and we write for brevity.

###### Theorem 1.

Let satisfy Assumption 1. Let

be an unbiased estimator for

with , and let be any point. Suppose that satisfy and . Then the LSGD step satisfies

 Ef(x+)−f(x∗)≤(1−mη)(f(x)−f(x∗))−ηλ(f(x)−f(z))+η2M2σ2.

Note the presence of the new term which speeds up convergence when , i.e the leader is better than . If the leader is always chosen so that at every step , then . If decreases at the rate , then .

#### 3.1.2 Communication Periods

In practice, communication between distributed machines is costly. The LSGD algorithm has a communication period for which the leader is only updated every iterations, so each node can run independently during that period. This is allowed to differ between nodes, and over time, which captures the asynchronous and multi-leader variants of LSGD. We write for the -th step during the -th period. It may occur that for some , that is, the current solution is now better than the last selected leader. In this case, the leader term may no longer be beneficial, and instead simply pulls toward . There is no general way to determine how many steps are taken before this event. However, we can show that if , then , so the solution will not become worse than a stale leader (up to gradient noise). As goes to infinity, LSGD converges to the minimizer of , which is quantifiably better than as captured in Theorem 2. Together, these facts show that LSGD is safe to use with long communication periods as long as the original leader is good.

###### Theorem 2.

Let be -strongly convex, and let be the minimizer of . For fixed , define . The minimizer of satisfies .

In our experiments, we employ another method to avoid this issue. To ensure that the leader is good, we perform an LSGD step only on the first step after a leader update, and then take standard SGD steps for the remainder of the communication period.

Next, we consider the impact of selecting the leader with errors. In practice, it is often costly to evaluate , as in deep learning. Instead, we estimate the values , and then select as the variable having the smallest estimate. Formally, suppose that we have an unbiased estimator of

, with uniformly bounded variance. At each step, a single sample

is drawn from each estimator , and then . We refer to this as stochastic leader selection. The stochastic leader satisfies , where is the true leader (see supplementary materials). Thus, the error introduced by the stochastic leader contributes an additive error of at most . Since this is of order rather than , we cannot guarantee convergence with 111For intuition, note that is divergent. unless is also decreasing. We have the following result:

###### Theorem 3.

Let satisfy Assumption 1, and let be as in Theorem 1. Suppose we use stochastic leader selection with having . If are fixed so that and , then . If decrease at the rate , then .

The communication period and the accuracy of stochastic leader selection are both methods of reducing the cost of updating the leader, and can be substitutes. When the communication period is long, it may be effective to estimate to higher accuracy, since this can be done independently.

### 3.2 Non-convex Optimization: Stationary Points

As mentioned above, EASGD has the flaw that the EASGD objective function can have stationary points such that none of is a stationary point of the underlying function . LSGD does not have this issue.

###### Theorem 4.

Let be the points where is the unique minimizer among . If is a stationary point of the LSGD objective function, then .

Moreover, it can be shown that for the deterministic algorithm LGD with any choice of communication periods, there will always be some variable such that .

###### Theorem 5.

Assume that is bounded below and -Lipschitz-differentiable, and that the LGD step sizes are selected so that . Then for any choice of communication periods, it holds that for every such that is the leader infinitely often, .

### 3.3 Search Direction Improvement from Leader Selection

In this section, we discuss how LGD can obtain better search directions than gradient descent. In general, it is difficult to determine when the LGD step will satisfy , since this depends on the precise combination of , and moreover, the maximum allowable value of is different for LGD and gradient descent. Instead, we measure the goodness of a search direction by the angle it forms with the Newton direction . The Newton method is locally quadratically convergent around local minimizers with non-singular Hessian, and converges in a single step for quadratic functions if . Hence, we consider it desirable to have search directions that are close to . Let denote the angle between . Let be the LGD direction with leader , and . The angle improvement set is the set of leaders . The set of candidate leaders is . We aim to show that a large subset of leaders in belong to .

In this section, we consider the positive definite quadratic with condition number and . The first result shows that as becomes sufficiently small, at least half of improves the angle.

###### Theorem 6.

Let be any point such that , and let . Then .222Note that for , so the limit is well-defined.

Next, we consider when is large. We show that points with large angle between exist, which are most suitable for improvement by LGD. For , define . It can be shown that is nonempty for all . We show that for for a certain range of , is at least half of for any choice of .

###### Theorem 7.

Let . If for , then for any , .

## 4 Experimental Results

### 4.1 Experimental setup

In this section we compare the performance of LSGD with state-of-the-art methods for parallel training of deep networks, such as EASGD and DOWNPOUR (their pseudo-codes can be found in EASGD ), as well as sequential technique SGD. We use communication period equal to for DOWNPOUR in all our experiments as this is the typical setting used for this method ensuring stable convergence. The experiments were performed using the CIFAR- data set CIFAR on three benchmark architectures: -layer CNN used in the original EASGD paper (see Section 5.1. in EASGD ) that we refer to as CNN, VGG Simonyan15 , and ResNet He2016DeepRL

; and ImageNet (ILSVRC

) data set imagenet_cvpr09 on ResNet.

During training, we select the leader for the LSGD method based on the average of the training loss computed over the last (CIFAR-) and (ImageNet) data batches. At testing, we report the performance of the center variable for EASGD and LSGD, where for LSGD the center variable is computed as the average of the parameters of all workers. We use weight decay with decay coefficient set to for all methods. In our experiments we use either workers (single-leader LSGD setting) or workers (multi-leader LSGD setting with groups of workers).

We use GPU nodes interconnected with Ethernet. Each GPU node has four GTX 1080 GPU processors where each local worker corresponds to one GPU processor. We use CUDA Toolkit 10.0 and NCCL 2

. We have developed a software package based on PyTorch for distributed training, which will be released (details are elaborated in Section

9.4).

Data processing and prefetching are discussed in the Supplement. The summary of the hyperparameters explored for each method are also provided in the Supplement. We use constant learning rate for CNN and learning rate drop (we divide the learning rate by when we observe saturation of the optimizer) for VGG, ResNet, and ResNet.

### 4.2 Experimental Results

In Figure 2 we report results obtained with CNN on CIFAR-. We run EASGD and LSGD with communication period . We used for the multi-leader LSGD case. The number of workers was set to . Our method consistently outperforms the competitors in terms of convergence speed (it is roughly times faster than EASGD for workers) and for workers it obtains smaller error.

In Figure 3 we demonstrate results for VGG and CIFAR- with communication period and number of workers equal to . LSGD converges marginally faster than EASGD and recovers the same error. At the same time it outperforms significantly DOWNPOUR in terms of convergence speed and obtains a slightly better solution.

The experimental results obtained using ResNet and CIFAR- for the same setting of communication period and number of workers as in case of CNN are shown in Figure 6. On workers we converge comparably fast to EASGD but recover better test error. For this experiment in Figure 6 we show the switching pattern between the leaders indicating that LSGD indeed takes advantage of all workers when exploring the landscape. On workers we converge roughly times faster then EASGD and obtain significantly smaller error. In this and CNN experiment LSGD (as well as EASGD) are consistently better than DONWPOUR and SGD, as expected.

Finally, in Figure 6 we report the empirical results for ResNet run on ImageNet. The number of workers was set to and the communication period was set to . In this experiment our algorithm behaves comparably to EASGD but converges much faster than DOWNPOUR.

## 5 Conclusion

In this paper we propose a new algorithm called LSGD for distributed optimization in non-convex settings. Our approach relies on pulling workers to the current best performer among them, rather than their average, at each iteration. We justify replacing the average by the leader both theoretically and through empirical demonstrations. We provide a thorough theoretical analysis, including proof of convergence, of our algorithm. Finally, we apply our approach to the matrix completion problem and training deep learning models and demonstrate that it is well-suited to these learning settings.

## References

• [1] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In NIPS, 2015.
• [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. NIPS, 2012.
• [3] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In ICASSP, 2012.
• [4] J. Weston, S. Chopra, and K. Adams. #tagspace: Semantic embeddings from hashtags. In EMNLP, 2014.
• [5] U. Wickramasinghe and A. Lumsdaine. A survey of methods for collective communication optimization and tuning. CoRR, abs/1611.06334, 2016.
• [6] L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
• [7] T. Ben-Nun and T. Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. CoRR, abs/1802.09941, 2018.
• [8] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. Integrated model, batch, and domain parallelism in training neural networks. Proceedings of the 30th Syposium on Parallelism in Algorithms and Architectures, pages 77–86, 2018.
• [9] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In NIPS, 2012.
• [10] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In ICML, 2018.
• [11] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR, abs/1802.05799, 2018.
• [12] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
• [13] S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Finding flatter minima with sgd. In ICLR Workshop Track, 2018.
• [14] S. L. Smith and Q. V. Le. A bayesian perspective on generalization and stochastic gradient descent. In ICLR, 2018.
• [15] S. Ma, R. Bassily, and M. Belkin.

The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning.

In ICML, 2018.
• [16] Y. You, I. Gitman, and B. Ginsburg. Scaling SGD batch size to 32k for imagenet training. In ICLR, 2018.
• [17] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
• [18] X. Li, J. Lu, R. Arora, J. Haupt, H. Liu, Z. Wang, and T. Zhao. Symmetry, saddle points, and global optimization landscape of nonconvex matrix factorization. IEEE Transactions on Information Theory, PP:1–1, 03 2019.
• [19] R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In ICML, 2017.
• [20] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
• [21] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Information Theory, 63(2):853–884, 2017.
• [22] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In NIPS, 2016.
• [23] V. Badrinarayanan, B. Mishra, and R. Cipolla. Understanding symmetries in deep networks. CoRR, abs/1511.01029, 2015.
• [24] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
• [25] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binary classification. In ICML, 2018.
• [26] K. Kawaguchi. Deep learning without poor local minima. In NIPS, 2016.
• [27] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. T. Chayes, L. Sagun, and R. Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. In ICLR, 2017.
• [28] P. Chaudhari, C. Baldassi, R. Zecchina, S. Soatto, and A. Talwalkar. Parle: parallelizing stochastic gradient descent. In SysML, 2018.
• [29] J. Kennedy and R. Eberhart. Particle swarm optimization. In ICNN, 1995.
• [30] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
• [31] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research). CIFAR, 2009.
• [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
• [33] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
• [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
• [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

## 6 LGD versus EAGD: Illustrative Example

We consider the following non-convex optimization problem:

 minx,yL(x,y),whereL(x,y)=sin(√x2+y2⋅π)√x2+y2⋅π.

Both methods use workers with initial points , , and . The communication period is set to . The learning rate for both EAGD and LGD equals . Furthermore, EAGD uses and LGD uses .

Table 1 captures optima obtained by different methods.

Figure 7 captures the optimization trajectories of EAGD and LGD algorithms. Clearly, EAGD suffers from the averaging policy, whereas LGD is able to recover a solution close to the global optimum.

## 7 Proofs of Theoretical Results

We provide omitted proofs from the main text.

### 7.1 Definitions and Notation

Recall that the objective function of Leader (Stochastic) Gradient Descent (L(S)GD) is defined as

 minx1,…,xpL(x1,…,xp):=p∑i=1f(xi)+λ2∥xi−˜x∥2 (4)

where . An L(S)GD step is a (stochastic) gradient step applied to . Writing at a particular , the update in the variable is

 xi+=xi−η(∇f(xi)+λ(xi−z))

Observe that this reduces to a (S)GD step for the variable which is the leader.

Practical variants of the algorithm do not communicate the updated leader at every iteration. Thus, in our analysis, we will generally take to be an arbitrary guiding point, which is not necessarily the minimizer of , nor even satisfy for all . The required properties of will be specified on a result-by-result basis.

When discussing the optimization landscape of LSGD, the term ‘LSGD objective function’ will refer to (4) with defined as the argmin.

Communication periods are sequences of steps where the leader is not updated. We introduce the notation for the -th step in the -th period, where the leader is updated only at the beginning of each period. We write for the number of steps that takes during the -th period. The standard LSGD defined above has for all , in which case . In addition, let , the leader for the -th period.

### 7.2 Stationary Points of EASGD

The EASGD EASGD objective function is defined as

 minx1,…,xp,˜xL(x1,…,xp,˜x):=p∑i=1f(xi)+λ2∥xi−˜x∥2. (5)

Observe that unlike LSGD, is a decision variable of EASGD. A stationary point of EASGD is a point such that .

###### Proposition 8.

There exists a Lipschitz differentiable function such that for every , there exists a point which is a stationary point of EASGD with parameter , but none of is a stationary point of .

###### Proof.

Define by

 f(x)=⎧⎪⎨⎪⎩ex+1 if x<−1p(x) if −1≤x≤1e−x+1 if x>1

where is a sixth-degree polynomial. For to be Lipschitz differentiable, we will select to make twice continuously differentiable, with bounded second derivative. To make twice continuously differentiable, we must have and . Since we aim to have , we also will require . The existence of is equivalent to the solvability of a linear system, which is easily checked to be invertible. Thus, we deduce that such a function exists.

It remains to show that for any , there exists a stationary point of EASGD. Set . The first-order condition yields . Since , we have . For , is an increasing function, so is increasing, and we deduce that there exists a solution with . By symmetry, satisfies , since for . Hence, is a stationary point of EASGD, but none of are stationary points of . ∎

### 7.3 Technical Preliminaries

Recall the statement of Assumption 1:

Assumption 1

is -Lipschitz-differentiable and -strongly convex, which is to say, the gradient satisfies , and satisfies

 f(y)≥f(x)+∇f(x)T(y−x)+m2∥y−x∥2.

We write for the unique minimizer of , and for the condition number of .

We will frequently use the following standard result.

###### Lemma 9.

If is -Lipschitz-differentiable, then

 f(y)≤f(x)+∇f(x)T(y−x)+M2∥y−x∥2.
###### Proof.

See (BCN2018SIAMREV, , eq. (4.3)). ∎

###### Lemma 10.

Let be -strongly convex, and let be the minimizer of . Then

 f(w)−f(x∗)≤12m∥∇f(w)∥2 (6)

and

 f(w)−f(x∗)≥m2∥w−x∗∥2 (7)
###### Proof.

Equation 6 is the well-known Polyak-Łojasiewicz inequality. Equation 7 follows from the definition of strong convexity, and . ∎

###### Lemma 11.

Let be -Lipschitz-differentiable. If the gradient descent step size , then , where .

###### Proof.

By Lemma 9,

 f(x+) ≤f(x)−η∥∇f(x)∥2+η22M∥∇f(x)∥2 =f(x)−η2(2−ηM)∥∇f(x)∥2

Rearranging yields the desired result. ∎

### 7.4 Proofs from Section 3.1.1

###### Lemma 12 (One-Step Descent).

Let satisfy Assumption 1. Let be an unbiased estimator for with . Let be the current iterate, and let be another point, with . The LSGD step satisfies:

 Ef(x+) ≤f(x)−η2(1−ηM(ν+1))∥∇f(x)∥2−η4λ(m−2ηMλ)∥δ∥2 (8) −η√λ√2(√m−ηM√2λ)∥∇f(x)∥∥δ∥−ηλ(f(x)−f(z))+η22Mσ2

where the expectation is with respect to , and conditioned on the current point . Hence, for sufficiently small with and ,

 Ef(x+)−f(x∗)≤(1−mη)(f(x)−f(x∗))−ηλ(f(x)−f(z))+η2M2σ2 (9)
###### Proof.

The proof is similar to the convergence analysis of SGD. We apply Lemma 9 to obtain

 f(x+) ≤f(x)−η∇f(x)T(˜g(x)+λδ)+η22M∥˜g(x)+λδ∥2.

Taking the expectation and using ,

 Ef(x+) ≤f(x)−η∥∇f(x)∥2−ηλ∇f(x)Tδ+η2λ22M∥δ∥2+η2λM∇f(x)Tδ+η22ME[˜g(x)T˜g(x)]

Using the definition of -strong convexity, we have , from which we deduce that . Substituting this above, and splitting both the terms in half, we obtain

 Ef(x+) =f(x)−η2∥∇f(x)∥2+η22ME[˜g(x)T˜g(x)] −η4mλ∥δ∥2+η22λ2M∥δ∥2 −η2∥∇f(x)∥2−η4mλ∥δ∥2+η2λM∇f(x)Tδ −ηλ(f(x)−f(z))

We proceed to bound each line. For the first line, the standard bias-variance decomposition yields

 E[˜g(x)T˜g(x)]≤(ν+1)∥∇f(x)∥2+σ2

and so we have

 −η2∥∇f(x)∥2+η22ME[˜g(x)T˜g(x)]≤−η2(1−ηM(ν+1))∥∇f(x)∥2+η22Mσ2.

For the second line, we obtain

 −η4mλ∥δ∥2+η22λ2M∥δ∥2≤−η4λ(m−2ηMλ)∥δ∥2.

For the third line, we apply the inequality to obtain

 η2∥∇f(x)∥2+η4mλ∥δ∥2≥η√2√mλ∥∇f(x)∥∥δ∥.

Using the Cauchy-Schwarz inequality, we then obtain

 −η2∥∇f(x)∥2−η4mλ∥δ∥2+η2λ∇Mf(x)Tδ≤−η√λ√2(√m−ηM√2λ)∥∇f(x)∥∥δ∥.

Combining these inequalities yields the desired result. ∎

###### Theorem 13.

Let satisfy Assumption 1. Suppose that the leader is always chosen so that . If are fixed so that and , then . If decreases at the rate , then .

###### Proof.

This result follows (9) and Theorems 4.6 and 4.7 of BCN2018SIAMREV . ∎

### 7.5 Proofs from Section 3.1.2

###### Theorem 14.

Let satisfy Assumption 1. Suppose that are small enough that and . If , then