1 Introduction
As deep learning models and data sets grow in size, it becomes increasingly helpful to parallelize their training over a distributed computational environment. These models lie at the core of many modern machinelearningbased systems for image recognition
NIPS2012_4824 , speech recognition DBLP:conf/icassp/AbdelHamidMJP12 DBLP:conf/emnlp/WestonCA14 , and more. This paper focuses on the parallelization of the data, not the model, and considers collective communication scheme DBLP:journals/corr/WickramasingheL16 that is most commonly used nowadays. A typical approach to data parallelization in deep learning uses multiple workers that run variants of SGD bottou98x ; BNH2018arxiv ; GAJKB2018SPAA on different data batches. Therefore, the effective batch size is increased by the number of workers. Communication ensures that all models are synchronized and critically relies on a scheme where each worker broadcasts its parameter gradients to all the remaining workers. This is the case for DOWNPOUR DOWNPOUR (its decentralized extension, with no central parameter server, based on the ring topology can be found in pmlrv80lian18a ) or Horovod sergeev2018horovod methods. These techniques require frequent communication (after processing each batch) to avoid instability/divergence, and hence are communication expensive. Moreover, training with a large batch size usually hurts generalization KMNST2017ICLR ; JKABFBS2018ICLRW ; SL2018ICLR and convergence speed MBB2018ICML ; DBLP:journals/corr/abs170803888 .Another approach, called Elastic Averaging (Stochastic) Gradient Decent, EA(S)GD EASGD , introduces elastic forces linking the parameters of the local workers with central parameters computed as a moving average over time and space (i.e. over the parameters computed by local workers). This method allows less frequent communication as workers by design do not need to have the same parameters but are instead periodically pulled towards each other. The objective function of EASGD, however, has stationary points which are not stationary points of the underlying objective function (see Proposition 8 in the Supplement), thus optimizing it may lead to suboptimal solutions for the original problem. Further, EASGD can be viewed as a parallel extension of the averaging SGD scheme polyak1992acceleration and as such it inherits the downsides of the averaging policy. On nonconvex problems, when the iterates are converging to different local minima (that may potentially be globally optimal), the averaging term can drag the iterates in the wrong directions and significantly hurt the convergence speed of both local workers and the master. In symmetric regions of the optimization landscape, the elastic forces related with different workers may cancel each other out causing the master to be permanently stuck in between or at the maximum between different minima, and local workers to be stuck at the local minima or on the slopes above them. This can result in arbitrarily bad generalization error. We refer to this phenomenon as the “curse of symmetry”. Landscape symmetries are common in a plethora of nonconvex problems LLAHLWZ2019IEEETRINFTH ; ge2017no ; DBLP:journals/focm/SunQW18 ; DBLP:journals/tit/SunQW17 ; ge2016matrix , including deep learning DBLP:journals/corr/BadrinarayananM15 ; DBLP:conf/aistats/ChoromanskaHMAL15 ; pmlrv80liang18a ; K2016NIPS .
This paper revisits the EASGD update and modifies it in a simple, yet powerful way which overcomes the above mentioned shortcomings of the original technique. We propose to replace the elastic force relying on the average of the parameters of local workers by an attractive force linking the local workers and the current best performer among them (leader). Our approach reduces the communication overhead related with broadcasting parameters of all workers to each other, and instead requires broadcasting only the leader parameters. The proposed approach easily adapts to a typical hardware architecture comprising of multiple compute nodes where each node contains a group of workers and local communication, within a node, is significantly faster than communication between the nodes. We propose a multileader extension of our approach that adapts well to this hardware architecture and relies on forming groups of workers (one per compute node) which are attracted both to their local and global leader. To reduce the communication overhead, the correction force related with the global leader is applied less frequently than the one related with the local leader.
Finally, our L(S)GD approach, similarly to EA(S)GD, tends to explore wide valleys in the optimization landscape when the pulling force between workers and leaders is set to be small. This property often leads to improved generalization performance of the optimizer DBLP:journals/corr/ChaudhariCSL16 ; DBLP:journals/corr/ChaudhariBZST17 .
The paper is organized as follows: Section 2 introduces the L(S)GD approach, Section 3 provides theoretical analysis, Section 4 contains empirical evaluation, and finally Section 5 concludes the paper. Theoretical proofs and additional theoretical and empirical results are contained in the Supplement.
2 Leader (Stochastic) Gradient Descent “L(S)GD” Algorithm
2.1 Motivating example
Figure 1 illustrates how elastic averaging can impair convergence. To obtain the figure we applied EAGD (Elastic Averaging Gradient Decent) and LGD to the matrix completion problem of the form: . This problem is nonconvex but is known to have the property that all local minimizers are global minimizers LLAHLWZ2019IEEETRINFTH . For four choices of the rank , we generated random instances of the matrix completion problem, and solved each with EAGD and LGD, initialized from the same starting points (we use workers). For each algorithm, we report the progress of the best objective value at each iteration, over all workers. Figure 1 shows the results across random experiments for each rank.
It is clear that EAGD slows down significantly as it approaches a minimizer. Typically, the center of EAGD is close to the average of the workers, which is a poor solution for the matrix completion problem when the workers are approaching different local minimizers, even though all local minimizers are globally optimal. This induces a pull on each node away from the minimizers, which makes it extremely difficult for EAGD to attain a solution of high accuracy. In comparison, LGD does not have this issue. Further details of this experiment, and other illustrative examples of the difference between EAGD and LGD, can be found in the Supplement.
2.2 Symmetrybreaking updates
Next we explain the basic update of the L(S)GD algorithm. Consider first the singleleader setting and the problem of minimizing loss function
in a parallel computing environment. The optimization problem is given as(1) 
where is the number of workers, are the parameters of the workers and are the parameters of the leader. The best performing worker, i.e. ), and
s are data samples drawn from some probability distribution
.is the hyperparameter that denotes the strength of the force pulling the workers to the leader. In the theoretical section we will refer to
as simply . This formulation can be further extended to the multileader setting. The optimization problem is modified to the following form(2) 
where is the number of groups, is the number of workers in each group, is the local leader of the group (i.e. ), is the global leader (the best worker among local leaders, i.e. ), are the parameters of the workers in the group, and s are the data samples drawn from . and are the hyperparameters that denote the strength of the forces pulling the workers to their local and global leader respectively.
The updates of the LSGD algorithm are captured below, where denotes iteration. The first update shown in Equation 3 is obtained by taking the gradient descent step on the objective in Equation 2 with respect to variables . The stochastic gradient of with respect to is denoted as (in case of LGD the gradient is computed over all training examples) and is the learning rate.
(3) 
where and are the local and global leaders defined above.
Equation 3 describes the update of any given worker and is comprised of the regular gradient step and two corrective forces (in singleleader setting the third term disappears as
then). These forces constitute the communication mechanism among the workers and pull all the workers towards the currently best local and global solution to ensure fast convergence. As opposed to EASGD, the updates performed by workers in LSGD break the curse of symmetry and avoid convergence decelerations that result from workers being pulled towards the average which is inherently influenced by poorly performing workers. In this paper, instead of pulling workers to their averaged parameters, we propose the mechanism of pulling the workers towards the leaders. The flavor of the update resembles a particle swarm optimization approach
488968 , which is not typically used in the context of stochastic gradient optimization for deep learning. Our method may therefore be viewed as a dedicated particle swarm optimization approach for training deep learning models in the stochastic setting and parallel computing environment.Next we describe the LSGD algorithm in more detail. We rely on the collective communication scheme. In order to reduce the amount of communication between the workers, it is desired to pull them towards the leaders less often than every iteration. Also, in practice each worker can have a different speed. To prevent waiting for the slower workers and achieve communication efficiency, we implement the algorithm in the asynchronous operation mode. In this case, the communication period is determined based on the total number of iterations computed across all workers and the communication is performed every or iterations, where and denote local and global communication periods, respectively. In practice, we use since communication between workers lying in different groups is more expensive than between workers within one group, as explained above. When communication occurs, all workers are updated at the same time (i.e. pulled towards the leaders) in order to take advantage of the collective communication scheme. Between communications, workers run their own local SGD optimizers. The resulting LSGD method is very simple, and is depicted in Algorithm 1.
The next section provides a theoretical description of the singleleader batch (LGD) and stochastic (LSGD) variants of our approach.
3 Theoretical Analysis
We assume without loss of generality that there is a single leader. The objective function with multiple leaders is given by , which is equivalent to for and . Proofs for this section are deferred to the Supplement.
3.1 Convergence Rates for Stochastic Strongly Convex Optimization
We first show that LSGD obtains the same convergence rate as SGD for stochastic strongly convex problems BCN2018SIAMREV . In Section 3.3 we discuss how and when LGD can obtain better search directions than gradient descent. We discuss nonconvex optimization in Section 3.2. Throughout Section 3.1, will typically satisfy:
Assumption 1 is Lipschitzdifferentiable and strongly convex, which is to say, the gradient satisfies , and satisfies . We write for the unique minimizer of , and for the condition number of .
3.1.1 Convergence Rates
The key technical result is that LSGD satisfies a similar onestep descent in expectation as SGD, with an additional term corresponding to the pull of the leader. To provide a unified analysis of ‘pure’ LSGD as well as more practical variants where the leader is updated infrequently or with errors, we consider a general iteration , where is an arbitrary guiding point; that is, may not be the minimizer of , nor even satisfy . Since the nodes operate independently except when updating , we may analyze LSGD steps for each node individually, and we write for brevity.
Theorem 1.
Let satisfy Assumption 1. Let
be an unbiased estimator for
with , and let be any point. Suppose that satisfy and . Then the LSGD step satisfiesNote the presence of the new term which speeds up convergence when , i.e the leader is better than . If the leader is always chosen so that at every step , then . If decreases at the rate , then .
3.1.2 Communication Periods
In practice, communication between distributed machines is costly. The LSGD algorithm has a communication period for which the leader is only updated every iterations, so each node can run independently during that period. This is allowed to differ between nodes, and over time, which captures the asynchronous and multileader variants of LSGD. We write for the th step during the th period. It may occur that for some , that is, the current solution is now better than the last selected leader. In this case, the leader term may no longer be beneficial, and instead simply pulls toward . There is no general way to determine how many steps are taken before this event. However, we can show that if , then , so the solution will not become worse than a stale leader (up to gradient noise). As goes to infinity, LSGD converges to the minimizer of , which is quantifiably better than as captured in Theorem 2. Together, these facts show that LSGD is safe to use with long communication periods as long as the original leader is good.
Theorem 2.
Let be strongly convex, and let be the minimizer of . For fixed , define . The minimizer of satisfies .
In our experiments, we employ another method to avoid this issue. To ensure that the leader is good, we perform an LSGD step only on the first step after a leader update, and then take standard SGD steps for the remainder of the communication period.
3.1.3 Stochastic Leader Selection
Next, we consider the impact of selecting the leader with errors. In practice, it is often costly to evaluate , as in deep learning. Instead, we estimate the values , and then select as the variable having the smallest estimate. Formally, suppose that we have an unbiased estimator of
, with uniformly bounded variance. At each step, a single sample
is drawn from each estimator , and then . We refer to this as stochastic leader selection. The stochastic leader satisfies , where is the true leader (see supplementary materials). Thus, the error introduced by the stochastic leader contributes an additive error of at most . Since this is of order rather than , we cannot guarantee convergence with ^{1}^{1}1For intuition, note that is divergent. unless is also decreasing. We have the following result:Theorem 3.
Let satisfy Assumption 1, and let be as in Theorem 1. Suppose we use stochastic leader selection with having . If are fixed so that and , then . If decrease at the rate , then .
The communication period and the accuracy of stochastic leader selection are both methods of reducing the cost of updating the leader, and can be substitutes. When the communication period is long, it may be effective to estimate to higher accuracy, since this can be done independently.
3.2 Nonconvex Optimization: Stationary Points
As mentioned above, EASGD has the flaw that the EASGD objective function can have stationary points such that none of is a stationary point of the underlying function . LSGD does not have this issue.
Theorem 4.
Let be the points where is the unique minimizer among . If is a stationary point of the LSGD objective function, then .
Moreover, it can be shown that for the deterministic algorithm LGD with any choice of communication periods, there will always be some variable such that .
Theorem 5.
Assume that is bounded below and Lipschitzdifferentiable, and that the LGD step sizes are selected so that . Then for any choice of communication periods, it holds that for every such that is the leader infinitely often, .
3.3 Search Direction Improvement from Leader Selection
In this section, we discuss how LGD can obtain better search directions than gradient descent. In general, it is difficult to determine when the LGD step will satisfy , since this depends on the precise combination of , and moreover, the maximum allowable value of is different for LGD and gradient descent. Instead, we measure the goodness of a search direction by the angle it forms with the Newton direction . The Newton method is locally quadratically convergent around local minimizers with nonsingular Hessian, and converges in a single step for quadratic functions if . Hence, we consider it desirable to have search directions that are close to . Let denote the angle between . Let be the LGD direction with leader , and . The angle improvement set is the set of leaders . The set of candidate leaders is . We aim to show that a large subset of leaders in belong to .
In this section, we consider the positive definite quadratic with condition number and . The first result shows that as becomes sufficiently small, at least half of improves the angle.
Theorem 6.
Let be any point such that , and let . Then .^{2}^{2}2Note that for , so the limit is welldefined.
Next, we consider when is large. We show that points with large angle between exist, which are most suitable for improvement by LGD. For , define . It can be shown that is nonempty for all . We show that for for a certain range of , is at least half of for any choice of .
Theorem 7.
Let . If for , then for any , .
4 Experimental Results
4.1 Experimental setup
In this section we compare the performance of LSGD with stateoftheart methods for parallel training of deep networks, such as EASGD and DOWNPOUR (their pseudocodes can be found in EASGD ), as well as sequential technique SGD. We use communication period equal to for DOWNPOUR in all our experiments as this is the typical setting used for this method ensuring stable convergence. The experiments were performed using the CIFAR data set CIFAR on three benchmark architectures: layer CNN used in the original EASGD paper (see Section 5.1. in EASGD ) that we refer to as CNN, VGG Simonyan15 , and ResNet He2016DeepRL
; and ImageNet (ILSVRC
) data set imagenet_cvpr09 on ResNet.During training, we select the leader for the LSGD method based on the average of the training loss computed over the last (CIFAR) and (ImageNet) data batches. At testing, we report the performance of the center variable for EASGD and LSGD, where for LSGD the center variable is computed as the average of the parameters of all workers. We use weight decay with decay coefficient set to for all methods. In our experiments we use either workers (singleleader LSGD setting) or workers (multileader LSGD setting with groups of workers).
We use GPU nodes interconnected with Ethernet. Each GPU node has four GTX 1080 GPU processors where each local worker corresponds to one GPU processor. We use CUDA Toolkit 10.0^{3}^{3}3https://developer.nvidia.com/cudazone and NCCL 2^{4}^{4}4https://developer.nvidia.com/nccl
. We have developed a software package based on PyTorch for distributed training, which will be released (details are elaborated in Section
9.4).Data processing and prefetching are discussed in the Supplement. The summary of the hyperparameters explored for each method are also provided in the Supplement. We use constant learning rate for CNN and learning rate drop (we divide the learning rate by when we observe saturation of the optimizer) for VGG, ResNet, and ResNet.
4.2 Experimental Results
In Figure 2 we report results obtained with CNN on CIFAR. We run EASGD and LSGD with communication period . We used for the multileader LSGD case. The number of workers was set to . Our method consistently outperforms the competitors in terms of convergence speed (it is roughly times faster than EASGD for workers) and for workers it obtains smaller error.
In Figure 3 we demonstrate results for VGG and CIFAR with communication period and number of workers equal to . LSGD converges marginally faster than EASGD and recovers the same error. At the same time it outperforms significantly DOWNPOUR in terms of convergence speed and obtains a slightly better solution.
The experimental results obtained using ResNet and CIFAR for the same setting of communication period and number of workers as in case of CNN are shown in Figure 6. On workers we converge comparably fast to EASGD but recover better test error. For this experiment in Figure 6 we show the switching pattern between the leaders indicating that LSGD indeed takes advantage of all workers when exploring the landscape. On workers we converge roughly times faster then EASGD and obtain significantly smaller error. In this and CNN experiment LSGD (as well as EASGD) are consistently better than DONWPOUR and SGD, as expected.
Finally, in Figure 6 we report the empirical results for ResNet run on ImageNet. The number of workers was set to and the communication period was set to . In this experiment our algorithm behaves comparably to EASGD but converges much faster than DOWNPOUR.
5 Conclusion
In this paper we propose a new algorithm called LSGD for distributed optimization in nonconvex settings. Our approach relies on pulling workers to the current best performer among them, rather than their average, at each iteration. We justify replacing the average by the leader both theoretically and through empirical demonstrations. We provide a thorough theoretical analysis, including proof of convergence, of our algorithm. Finally, we apply our approach to the matrix completion problem and training deep learning models and demonstrate that it is wellsuited to these learning settings.
References
 [1] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In NIPS, 2015.
 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. NIPS, 2012.
 [3] O. AbdelHamid, A.r. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural networks concepts to hybrid NNHMM model for speech recognition. In ICASSP, 2012.
 [4] J. Weston, S. Chopra, and K. Adams. #tagspace: Semantic embeddings from hashtags. In EMNLP, 2014.
 [5] U. Wickramasinghe and A. Lumsdaine. A survey of methods for collective communication optimization and tuning. CoRR, abs/1611.06334, 2016.
 [6] L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
 [7] T. BenNun and T. Hoefler. Demystifying parallel and distributed deep learning: An indepth concurrency analysis. CoRR, abs/1802.09941, 2018.
 [8] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. Integrated model, batch, and domain parallelism in training neural networks. Proceedings of the 30th Syposium on Parallelism in Algorithms and Architectures, pages 77–86, 2018.
 [9] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In NIPS, 2012.
 [10] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In ICML, 2018.
 [11] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR, abs/1802.05799, 2018.
 [12] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
 [13] S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Finding flatter minima with sgd. In ICLR Workshop Track, 2018.
 [14] S. L. Smith and Q. V. Le. A bayesian perspective on generalization and stochastic gradient descent. In ICLR, 2018.

[15]
S. Ma, R. Bassily, and M. Belkin.
The power of interpolation: Understanding the effectiveness of sgd in modern overparametrized learning.
In ICML, 2018.  [16] Y. You, I. Gitman, and B. Ginsburg. Scaling SGD batch size to 32k for imagenet training. In ICLR, 2018.
 [17] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 [18] X. Li, J. Lu, R. Arora, J. Haupt, H. Liu, Z. Wang, and T. Zhao. Symmetry, saddle points, and global optimization landscape of nonconvex matrix factorization. IEEE Transactions on Information Theory, PP:1–1, 03 2019.
 [19] R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In ICML, 2017.
 [20] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
 [21] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Information Theory, 63(2):853–884, 2017.
 [22] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In NIPS, 2016.
 [23] V. Badrinarayanan, B. Mishra, and R. Cipolla. Understanding symmetries in deep networks. CoRR, abs/1511.01029, 2015.
 [24] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
 [25] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binary classification. In ICML, 2018.
 [26] K. Kawaguchi. Deep learning without poor local minima. In NIPS, 2016.
 [27] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. T. Chayes, L. Sagun, and R. Zecchina. EntropySGD: Biasing gradient descent into wide valleys. In ICLR, 2017.
 [28] P. Chaudhari, C. Baldassi, R. Zecchina, S. Soatto, and A. Talwalkar. Parle: parallelizing stochastic gradient descent. In SysML, 2018.
 [29] J. Kennedy and R. Eberhart. Particle swarm optimization. In ICNN, 1995.
 [30] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 [31] A. Krizhevsky, V. Nair, and G. Hinton. Cifar10 (canadian institute for advanced research). CIFAR, 2009.
 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [33] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [34] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
6 LGD versus EAGD: Illustrative Example
We consider the following nonconvex optimization problem:
Both methods use workers with initial points , , and . The communication period is set to . The learning rate for both EAGD and LGD equals . Furthermore, EAGD uses and LGD uses .
Table 1 captures optima obtained by different methods.
Optimizer  

EAGD  0.0912 
LGD  0.2172 
Figure 7 captures the optimization trajectories of EAGD and LGD algorithms. Clearly, EAGD suffers from the averaging policy, whereas LGD is able to recover a solution close to the global optimum.
7 Proofs of Theoretical Results
We provide omitted proofs from the main text.
7.1 Definitions and Notation
Recall that the objective function of Leader (Stochastic) Gradient Descent (L(S)GD) is defined as
(4) 
where . An L(S)GD step is a (stochastic) gradient step applied to . Writing at a particular , the update in the variable is
Observe that this reduces to a (S)GD step for the variable which is the leader.
Practical variants of the algorithm do not communicate the updated leader at every iteration. Thus, in our analysis, we will generally take to be an arbitrary guiding point, which is not necessarily the minimizer of , nor even satisfy for all . The required properties of will be specified on a resultbyresult basis.
When discussing the optimization landscape of LSGD, the term ‘LSGD objective function’ will refer to (4) with defined as the argmin.
Communication periods are sequences of steps where the leader is not updated. We introduce the notation for the th step in the th period, where the leader is updated only at the beginning of each period. We write for the number of steps that takes during the th period. The standard LSGD defined above has for all , in which case . In addition, let , the leader for the th period.
7.2 Stationary Points of EASGD
The EASGD EASGD objective function is defined as
(5) 
Observe that unlike LSGD, is a decision variable of EASGD. A stationary point of EASGD is a point such that .
Proposition 8.
There exists a Lipschitz differentiable function such that for every , there exists a point which is a stationary point of EASGD with parameter , but none of is a stationary point of .
Proof.
Define by
where is a sixthdegree polynomial. For to be Lipschitz differentiable, we will select to make twice continuously differentiable, with bounded second derivative. To make twice continuously differentiable, we must have and . Since we aim to have , we also will require . The existence of is equivalent to the solvability of a linear system, which is easily checked to be invertible. Thus, we deduce that such a function exists.
It remains to show that for any , there exists a stationary point of EASGD. Set . The firstorder condition yields . Since , we have . For , is an increasing function, so is increasing, and we deduce that there exists a solution with . By symmetry, satisfies , since for . Hence, is a stationary point of EASGD, but none of are stationary points of . ∎
7.3 Technical Preliminaries
Recall the statement of Assumption 1:
 Assumption 1

is Lipschitzdifferentiable and strongly convex, which is to say, the gradient satisfies , and satisfies
We write for the unique minimizer of , and for the condition number of .
We will frequently use the following standard result.
Lemma 9.
If is Lipschitzdifferentiable, then
Proof.
See (BCN2018SIAMREV, , eq. (4.3)). ∎
Lemma 10.
Let be strongly convex, and let be the minimizer of . Then
(6) 
and
(7) 
Proof.
Equation 6 is the wellknown PolyakŁojasiewicz inequality. Equation 7 follows from the definition of strong convexity, and . ∎
Lemma 11.
Let be Lipschitzdifferentiable. If the gradient descent step size , then , where .
Proof.
7.4 Proofs from Section 3.1.1
Lemma 12 (OneStep Descent).
Let satisfy Assumption 1. Let be an unbiased estimator for with . Let be the current iterate, and let be another point, with . The LSGD step satisfies:
(8)  
where the expectation is with respect to , and conditioned on the current point . Hence, for sufficiently small with and ,
(9) 
Proof.
The proof is similar to the convergence analysis of SGD. We apply Lemma 9 to obtain
Taking the expectation and using ,
Using the definition of strong convexity, we have , from which we deduce that . Substituting this above, and splitting both the terms in half, we obtain
We proceed to bound each line. For the first line, the standard biasvariance decomposition yields
and so we have
For the second line, we obtain
For the third line, we apply the inequality to obtain
Using the CauchySchwarz inequality, we then obtain
Combining these inequalities yields the desired result. ∎
Theorem 13.
Let satisfy Assumption 1. Suppose that the leader is always chosen so that . If are fixed so that and , then . If decreases at the rate , then .
Proof.
This result follows (9) and Theorems 4.6 and 4.7 of BCN2018SIAMREV . ∎
7.5 Proofs from Section 3.1.2
Theorem 14.
Let satisfy Assumption 1. Suppose that are small enough that and . If , then
Comments
There are no comments yet.