Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/31/2020

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed traini...
12/20/2014

Deep learning with Elastic Averaging SGD

We study the problem of stochastic optimization for deep learning in the...
04/07/2020

Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning

This paper investigates the stochastic optimization problem with a focus...
03/03/2021

A Pessimistic Bilevel Stochastic Problem for Elastic Shape Optimization

We consider pessimistic bilevel stochastic programs in which the followe...
11/27/2018

LEASGD: an Efficient and Privacy-Preserving Decentralized Algorithm for Distributed Learning

Distributed learning systems have enabled training large-scale models ov...
06/17/2018

Laplacian Smoothing Gradient Descent

We propose a very simple modification of gradient descent and stochastic...
03/09/2021

Proof-of-Learning: Definitions and Practice

Training machine learning (ML) models typically involves expensive itera...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As deep learning models and data sets grow in size, it becomes increasingly helpful to parallelize their training over a distributed computational environment. These models lie at the core of many modern machine-learning-based systems for image recognition 

NIPS2012_4824 , speech recognition DBLP:conf/icassp/Abdel-HamidMJP12

, natural language processing 

DBLP:conf/emnlp/WestonCA14 , and more. This paper focuses on the parallelization of the data, not the model, and considers collective communication scheme DBLP:journals/corr/WickramasingheL16 that is most commonly used nowadays. A typical approach to data parallelization in deep learning uses multiple workers that run variants of SGD bottou-98x ; B-NH2018arxiv ; GAJKB2018SPAA on different data batches. Therefore, the effective batch size is increased by the number of workers. Communication ensures that all models are synchronized and critically relies on a scheme where each worker broadcasts its parameter gradients to all the remaining workers. This is the case for DOWNPOUR DOWNPOUR (its decentralized extension, with no central parameter server, based on the ring topology can be found in pmlr-v80-lian18a ) or Horovod sergeev2018horovod methods. These techniques require frequent communication (after processing each batch) to avoid instability/divergence, and hence are communication expensive. Moreover, training with a large batch size usually hurts generalization KMNST2017ICLR ; JKABFBS2018ICLRW ; SL2018ICLR and convergence speed MBB2018ICML ; DBLP:journals/corr/abs-1708-03888 .

Another approach, called Elastic Averaging (Stochastic) Gradient Decent, EA(S)GD EASGD , introduces elastic forces linking the parameters of the local workers with central parameters computed as a moving average over time and space (i.e. over the parameters computed by local workers). This method allows less frequent communication as workers by design do not need to have the same parameters but are instead periodically pulled towards each other. The objective function of EASGD, however, has stationary points which are not stationary points of the underlying objective function (see Proposition 8 in the Supplement), thus optimizing it may lead to sub-optimal solutions for the original problem. Further, EASGD can be viewed as a parallel extension of the averaging SGD scheme polyak1992acceleration and as such it inherits the downsides of the averaging policy. On non-convex problems, when the iterates are converging to different local minima (that may potentially be globally optimal), the averaging term can drag the iterates in the wrong directions and significantly hurt the convergence speed of both local workers and the master. In symmetric regions of the optimization landscape, the elastic forces related with different workers may cancel each other out causing the master to be permanently stuck in between or at the maximum between different minima, and local workers to be stuck at the local minima or on the slopes above them. This can result in arbitrarily bad generalization error. We refer to this phenomenon as the “curse of symmetry”. Landscape symmetries are common in a plethora of non-convex problems LLAHLWZ2019IEEETRINFTH ; ge2017no ; DBLP:journals/focm/SunQW18 ; DBLP:journals/tit/SunQW17 ; ge2016matrix , including deep learning DBLP:journals/corr/BadrinarayananM15 ; DBLP:conf/aistats/ChoromanskaHMAL15 ; pmlr-v80-liang18a ; K2016NIPS .

Figure 1: Low-rank matrix completion problems solved with EAGD and LGD. The dimension and four ranks are used. The reported value for each algorithm is the value of the best worker ( workers are used in total) at each step.

This paper revisits the EASGD update and modifies it in a simple, yet powerful way which overcomes the above mentioned shortcomings of the original technique. We propose to replace the elastic force relying on the average of the parameters of local workers by an attractive force linking the local workers and the current best performer among them (leader). Our approach reduces the communication overhead related with broadcasting parameters of all workers to each other, and instead requires broadcasting only the leader parameters. The proposed approach easily adapts to a typical hardware architecture comprising of multiple compute nodes where each node contains a group of workers and local communication, within a node, is significantly faster than communication between the nodes. We propose a multi-leader extension of our approach that adapts well to this hardware architecture and relies on forming groups of workers (one per compute node) which are attracted both to their local and global leader. To reduce the communication overhead, the correction force related with the global leader is applied less frequently than the one related with the local leader.

Finally, our L(S)GD approach, similarly to EA(S)GD, tends to explore wide valleys in the optimization landscape when the pulling force between workers and leaders is set to be small. This property often leads to improved generalization performance of the optimizer DBLP:journals/corr/ChaudhariCSL16 ; DBLP:journals/corr/ChaudhariBZST17 .

The paper is organized as follows: Section 2 introduces the L(S)GD approach, Section 3 provides theoretical analysis, Section 4 contains empirical evaluation, and finally Section 5 concludes the paper. Theoretical proofs and additional theoretical and empirical results are contained in the Supplement.

2 Leader (Stochastic) Gradient Descent “L(S)GD” Algorithm

2.1 Motivating example

Figure 1 illustrates how elastic averaging can impair convergence. To obtain the figure we applied EAGD (Elastic Averaging Gradient Decent) and LGD to the matrix completion problem of the form: . This problem is non-convex but is known to have the property that all local minimizers are global minimizers LLAHLWZ2019IEEETRINFTH . For four choices of the rank , we generated random instances of the matrix completion problem, and solved each with EAGD and LGD, initialized from the same starting points (we use workers). For each algorithm, we report the progress of the best objective value at each iteration, over all workers. Figure 1 shows the results across random experiments for each rank.

It is clear that EAGD slows down significantly as it approaches a minimizer. Typically, the center of EAGD is close to the average of the workers, which is a poor solution for the matrix completion problem when the workers are approaching different local minimizers, even though all local minimizers are globally optimal. This induces a pull on each node away from the minimizers, which makes it extremely difficult for EAGD to attain a solution of high accuracy. In comparison, LGD does not have this issue. Further details of this experiment, and other illustrative examples of the difference between EAGD and LGD, can be found in the Supplement.

2.2 Symmetry-breaking updates

Next we explain the basic update of the L(S)GD algorithm. Consider first the single-leader setting and the problem of minimizing loss function

in a parallel computing environment. The optimization problem is given as

(1)

where is the number of workers, are the parameters of the workers and are the parameters of the leader. The best performing worker, i.e. ), and

s are data samples drawn from some probability distribution

.

is the hyperparameter that denotes the strength of the force pulling the workers to the leader. In the theoretical section we will refer to

as simply . This formulation can be further extended to the multi-leader setting. The optimization problem is modified to the following form

(2)

where is the number of groups, is the number of workers in each group, is the local leader of the group (i.e. ), is the global leader (the best worker among local leaders, i.e. ), are the parameters of the workers in the group, and s are the data samples drawn from . and are the hyperparameters that denote the strength of the forces pulling the workers to their local and global leader respectively.

The updates of the LSGD algorithm are captured below, where denotes iteration. The first update shown in Equation 3 is obtained by taking the gradient descent step on the objective in Equation 2 with respect to variables . The stochastic gradient of with respect to is denoted as (in case of LGD the gradient is computed over all training examples) and is the learning rate.

(3)

where and are the local and global leaders defined above.

Input: pulling coefficients , learning rate , local/global communication periods
Initialize:
         Randomly initialize
         Set iteration counters
         Set ;
repeat
     for all ,  do Do in parallel for each worker
         Draw random sample
         
         ;
         if  divides  then
              . Determine the local best workers
               Pull to the local best workers
         end if
         if  divides  then
              . Determine the global best worker
               Pull to the global best worker
         end if
     end for
until termination
Algorithm 1 LSGD Algorithm (Asynchronous)

Equation 3 describes the update of any given worker and is comprised of the regular gradient step and two corrective forces (in single-leader setting the third term disappears as

then). These forces constitute the communication mechanism among the workers and pull all the workers towards the currently best local and global solution to ensure fast convergence. As opposed to EASGD, the updates performed by workers in LSGD break the curse of symmetry and avoid convergence decelerations that result from workers being pulled towards the average which is inherently influenced by poorly performing workers. In this paper, instead of pulling workers to their averaged parameters, we propose the mechanism of pulling the workers towards the leaders. The flavor of the update resembles a particle swarm optimization approach 

488968 , which is not typically used in the context of stochastic gradient optimization for deep learning. Our method may therefore be viewed as a dedicated particle swarm optimization approach for training deep learning models in the stochastic setting and parallel computing environment.

Next we describe the LSGD algorithm in more detail. We rely on the collective communication scheme. In order to reduce the amount of communication between the workers, it is desired to pull them towards the leaders less often than every iteration. Also, in practice each worker can have a different speed. To prevent waiting for the slower workers and achieve communication efficiency, we implement the algorithm in the asynchronous operation mode. In this case, the communication period is determined based on the total number of iterations computed across all workers and the communication is performed every or iterations, where and denote local and global communication periods, respectively. In practice, we use since communication between workers lying in different groups is more expensive than between workers within one group, as explained above. When communication occurs, all workers are updated at the same time (i.e. pulled towards the leaders) in order to take advantage of the collective communication scheme. Between communications, workers run their own local SGD optimizers. The resulting LSGD method is very simple, and is depicted in Algorithm 1.

The next section provides a theoretical description of the single-leader batch (LGD) and stochastic (LSGD) variants of our approach.

3 Theoretical Analysis

We assume without loss of generality that there is a single leader. The objective function with multiple leaders is given by , which is equivalent to for and . Proofs for this section are deferred to the Supplement.

3.1 Convergence Rates for Stochastic Strongly Convex Optimization

We first show that LSGD obtains the same convergence rate as SGD for stochastic strongly convex problems BCN2018SIAMREV . In Section 3.3 we discuss how and when LGD can obtain better search directions than gradient descent. We discuss non-convex optimization in Section 3.2. Throughout Section 3.1, will typically satisfy:

Assumption 1 is -Lipschitz-differentiable and -strongly convex, which is to say, the gradient satisfies , and satisfies . We write for the unique minimizer of , and for the condition number of .

3.1.1 Convergence Rates

The key technical result is that LSGD satisfies a similar one-step descent in expectation as SGD, with an additional term corresponding to the pull of the leader. To provide a unified analysis of ‘pure’ LSGD as well as more practical variants where the leader is updated infrequently or with errors, we consider a general iteration , where is an arbitrary guiding point; that is, may not be the minimizer of , nor even satisfy . Since the nodes operate independently except when updating , we may analyze LSGD steps for each node individually, and we write for brevity.

Theorem 1.

Let satisfy Assumption 1. Let

be an unbiased estimator for

with , and let be any point. Suppose that satisfy and . Then the LSGD step satisfies

Note the presence of the new term which speeds up convergence when , i.e the leader is better than . If the leader is always chosen so that at every step , then . If decreases at the rate , then .

3.1.2 Communication Periods

In practice, communication between distributed machines is costly. The LSGD algorithm has a communication period for which the leader is only updated every iterations, so each node can run independently during that period. This is allowed to differ between nodes, and over time, which captures the asynchronous and multi-leader variants of LSGD. We write for the -th step during the -th period. It may occur that for some , that is, the current solution is now better than the last selected leader. In this case, the leader term may no longer be beneficial, and instead simply pulls toward . There is no general way to determine how many steps are taken before this event. However, we can show that if , then , so the solution will not become worse than a stale leader (up to gradient noise). As goes to infinity, LSGD converges to the minimizer of , which is quantifiably better than as captured in Theorem 2. Together, these facts show that LSGD is safe to use with long communication periods as long as the original leader is good.

Theorem 2.

Let be -strongly convex, and let be the minimizer of . For fixed , define . The minimizer of satisfies .

In our experiments, we employ another method to avoid this issue. To ensure that the leader is good, we perform an LSGD step only on the first step after a leader update, and then take standard SGD steps for the remainder of the communication period.

3.1.3 Stochastic Leader Selection

Next, we consider the impact of selecting the leader with errors. In practice, it is often costly to evaluate , as in deep learning. Instead, we estimate the values , and then select as the variable having the smallest estimate. Formally, suppose that we have an unbiased estimator of

, with uniformly bounded variance. At each step, a single sample

is drawn from each estimator , and then . We refer to this as stochastic leader selection. The stochastic leader satisfies , where is the true leader (see supplementary materials). Thus, the error introduced by the stochastic leader contributes an additive error of at most . Since this is of order rather than , we cannot guarantee convergence with 111For intuition, note that is divergent. unless is also decreasing. We have the following result:

Theorem 3.

Let satisfy Assumption 1, and let be as in Theorem 1. Suppose we use stochastic leader selection with having . If are fixed so that and , then . If decrease at the rate , then .

The communication period and the accuracy of stochastic leader selection are both methods of reducing the cost of updating the leader, and can be substitutes. When the communication period is long, it may be effective to estimate to higher accuracy, since this can be done independently.

3.2 Non-convex Optimization: Stationary Points

As mentioned above, EASGD has the flaw that the EASGD objective function can have stationary points such that none of is a stationary point of the underlying function . LSGD does not have this issue.

Theorem 4.

Let be the points where is the unique minimizer among . If is a stationary point of the LSGD objective function, then .

Moreover, it can be shown that for the deterministic algorithm LGD with any choice of communication periods, there will always be some variable such that .

Theorem 5.

Assume that is bounded below and -Lipschitz-differentiable, and that the LGD step sizes are selected so that . Then for any choice of communication periods, it holds that for every such that is the leader infinitely often, .

3.3 Search Direction Improvement from Leader Selection

In this section, we discuss how LGD can obtain better search directions than gradient descent. In general, it is difficult to determine when the LGD step will satisfy , since this depends on the precise combination of , and moreover, the maximum allowable value of is different for LGD and gradient descent. Instead, we measure the goodness of a search direction by the angle it forms with the Newton direction . The Newton method is locally quadratically convergent around local minimizers with non-singular Hessian, and converges in a single step for quadratic functions if . Hence, we consider it desirable to have search directions that are close to . Let denote the angle between . Let be the LGD direction with leader , and . The angle improvement set is the set of leaders . The set of candidate leaders is . We aim to show that a large subset of leaders in belong to .

In this section, we consider the positive definite quadratic with condition number and . The first result shows that as becomes sufficiently small, at least half of improves the angle.

Theorem 6.

Let be any point such that , and let . Then .222Note that for , so the limit is well-defined.

Next, we consider when is large. We show that points with large angle between exist, which are most suitable for improvement by LGD. For , define . It can be shown that is nonempty for all . We show that for for a certain range of , is at least half of for any choice of .

Theorem 7.

Let . If for , then for any , .

4 Experimental Results

4.1 Experimental setup

Figure 2: CNN on CIFAR-. Test error for the center variable versus wall-clock time (original plot on the left and zoomed on the right). Test loss is reported in Figure 9 in the Supplement.

In this section we compare the performance of LSGD with state-of-the-art methods for parallel training of deep networks, such as EASGD and DOWNPOUR (their pseudo-codes can be found in EASGD ), as well as sequential technique SGD. We use communication period equal to for DOWNPOUR in all our experiments as this is the typical setting used for this method ensuring stable convergence. The experiments were performed using the CIFAR- data set CIFAR on three benchmark architectures: -layer CNN used in the original EASGD paper (see Section 5.1. in EASGD ) that we refer to as CNN, VGG Simonyan15 , and ResNet He2016DeepRL

; and ImageNet (ILSVRC

) data set imagenet_cvpr09 on ResNet.

Figure 3: VGG on CIFAR-. Test error for the center variable versus wall-clock time (original plot on the left and zoomed on the right). Test loss is reported in Figure 11 in the Supplement.

During training, we select the leader for the LSGD method based on the average of the training loss computed over the last (CIFAR-) and (ImageNet) data batches. At testing, we report the performance of the center variable for EASGD and LSGD, where for LSGD the center variable is computed as the average of the parameters of all workers. We use weight decay with decay coefficient set to for all methods. In our experiments we use either workers (single-leader LSGD setting) or workers (multi-leader LSGD setting with groups of workers).

We use GPU nodes interconnected with Ethernet. Each GPU node has four GTX 1080 GPU processors where each local worker corresponds to one GPU processor. We use CUDA Toolkit 10.0333https://developer.nvidia.com/cuda-zone and NCCL 2444https://developer.nvidia.com/nccl

. We have developed a software package based on PyTorch for distributed training, which will be released (details are elaborated in Section 

9.4).

Data processing and prefetching are discussed in the Supplement. The summary of the hyperparameters explored for each method are also provided in the Supplement. We use constant learning rate for CNN and learning rate drop (we divide the learning rate by when we observe saturation of the optimizer) for VGG, ResNet, and ResNet.

4.2 Experimental Results

In Figure 2 we report results obtained with CNN on CIFAR-. We run EASGD and LSGD with communication period . We used for the multi-leader LSGD case. The number of workers was set to . Our method consistently outperforms the competitors in terms of convergence speed (it is roughly times faster than EASGD for workers) and for workers it obtains smaller error.

Figure 4: ResNet on CIFAR-. Test error for the center variable versus wall-clock time (original plot on the left and zoomed on the right). Test loss is reported in Figure 10 in the Supplement.
Figure 5: ResNet on CIFAR-. The identity of the worker that is recognized as the leader (i.e. rank) versus iterations (on the left) and the number of times each worker was the leader (on the right).
Figure 6: ResNet on ImageNet. Test error for the center variable versus wall-clock time (original plot on the left and zoomed on the right). Test loss is reported in Figure 12 in the Supplement.

In Figure 3 we demonstrate results for VGG and CIFAR- with communication period and number of workers equal to . LSGD converges marginally faster than EASGD and recovers the same error. At the same time it outperforms significantly DOWNPOUR in terms of convergence speed and obtains a slightly better solution.

The experimental results obtained using ResNet and CIFAR- for the same setting of communication period and number of workers as in case of CNN are shown in Figure 6. On workers we converge comparably fast to EASGD but recover better test error. For this experiment in Figure 6 we show the switching pattern between the leaders indicating that LSGD indeed takes advantage of all workers when exploring the landscape. On workers we converge roughly times faster then EASGD and obtain significantly smaller error. In this and CNN experiment LSGD (as well as EASGD) are consistently better than DONWPOUR and SGD, as expected.

Finally, in Figure 6 we report the empirical results for ResNet run on ImageNet. The number of workers was set to and the communication period was set to . In this experiment our algorithm behaves comparably to EASGD but converges much faster than DOWNPOUR.

5 Conclusion

In this paper we propose a new algorithm called LSGD for distributed optimization in non-convex settings. Our approach relies on pulling workers to the current best performer among them, rather than their average, at each iteration. We justify replacing the average by the leader both theoretically and through empirical demonstrations. We provide a thorough theoretical analysis, including proof of convergence, of our algorithm. Finally, we apply our approach to the matrix completion problem and training deep learning models and demonstrate that it is well-suited to these learning settings.

References

  • [1] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In NIPS, 2015.
  • [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. NIPS, 2012.
  • [3] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In ICASSP, 2012.
  • [4] J. Weston, S. Chopra, and K. Adams. #tagspace: Semantic embeddings from hashtags. In EMNLP, 2014.
  • [5] U. Wickramasinghe and A. Lumsdaine. A survey of methods for collective communication optimization and tuning. CoRR, abs/1611.06334, 2016.
  • [6] L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
  • [7] T. Ben-Nun and T. Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. CoRR, abs/1802.09941, 2018.
  • [8] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. Integrated model, batch, and domain parallelism in training neural networks. Proceedings of the 30th Syposium on Parallelism in Algorithms and Architectures, pages 77–86, 2018.
  • [9] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In NIPS, 2012.
  • [10] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In ICML, 2018.
  • [11] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR, abs/1802.05799, 2018.
  • [12] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
  • [13] S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Finding flatter minima with sgd. In ICLR Workshop Track, 2018.
  • [14] S. L. Smith and Q. V. Le. A bayesian perspective on generalization and stochastic gradient descent. In ICLR, 2018.
  • [15] S. Ma, R. Bassily, and M. Belkin.

    The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning.

    In ICML, 2018.
  • [16] Y. You, I. Gitman, and B. Ginsburg. Scaling SGD batch size to 32k for imagenet training. In ICLR, 2018.
  • [17] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  • [18] X. Li, J. Lu, R. Arora, J. Haupt, H. Liu, Z. Wang, and T. Zhao. Symmetry, saddle points, and global optimization landscape of nonconvex matrix factorization. IEEE Transactions on Information Theory, PP:1–1, 03 2019.
  • [19] R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In ICML, 2017.
  • [20] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
  • [21] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Information Theory, 63(2):853–884, 2017.
  • [22] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In NIPS, 2016.
  • [23] V. Badrinarayanan, B. Mishra, and R. Cipolla. Understanding symmetries in deep networks. CoRR, abs/1511.01029, 2015.
  • [24] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
  • [25] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binary classification. In ICML, 2018.
  • [26] K. Kawaguchi. Deep learning without poor local minima. In NIPS, 2016.
  • [27] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. T. Chayes, L. Sagun, and R. Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. In ICLR, 2017.
  • [28] P. Chaudhari, C. Baldassi, R. Zecchina, S. Soatto, and A. Talwalkar. Parle: parallelizing stochastic gradient descent. In SysML, 2018.
  • [29] J. Kennedy and R. Eberhart. Particle swarm optimization. In ICNN, 1995.
  • [30] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
  • [31] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research). CIFAR, 2009.
  • [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

6 LGD versus EAGD: Illustrative Example

Figure 7: Left: Trajectories of variables during optimization. The dashed lines represent the local minima. The red and blue circles are the start and end points of each trajectory, respectively. Right: The value of the objective function for each worker during training.

We consider the following non-convex optimization problem:

Both methods use workers with initial points , , and . The communication period is set to . The learning rate for both EAGD and LGD equals . Furthermore, EAGD uses and LGD uses .

Table 1 captures optima obtained by different methods.

Optimizer
EAGD -0.0912
LGD -0.2172
Table 1: Optimum recovered by EAGD and LGD.

Figure 7 captures the optimization trajectories of EAGD and LGD algorithms. Clearly, EAGD suffers from the averaging policy, whereas LGD is able to recover a solution close to the global optimum.

7 Proofs of Theoretical Results

We provide omitted proofs from the main text.

7.1 Definitions and Notation

Recall that the objective function of Leader (Stochastic) Gradient Descent (L(S)GD) is defined as

(4)

where . An L(S)GD step is a (stochastic) gradient step applied to . Writing at a particular , the update in the variable is

Observe that this reduces to a (S)GD step for the variable which is the leader.

Practical variants of the algorithm do not communicate the updated leader at every iteration. Thus, in our analysis, we will generally take to be an arbitrary guiding point, which is not necessarily the minimizer of , nor even satisfy for all . The required properties of will be specified on a result-by-result basis.

When discussing the optimization landscape of LSGD, the term ‘LSGD objective function’ will refer to (4) with defined as the argmin.

Communication periods are sequences of steps where the leader is not updated. We introduce the notation for the -th step in the -th period, where the leader is updated only at the beginning of each period. We write for the number of steps that takes during the -th period. The standard LSGD defined above has for all , in which case . In addition, let , the leader for the -th period.

7.2 Stationary Points of EASGD

The EASGD EASGD objective function is defined as

(5)

Observe that unlike LSGD, is a decision variable of EASGD. A stationary point of EASGD is a point such that .

Proposition 8.

There exists a Lipschitz differentiable function such that for every , there exists a point which is a stationary point of EASGD with parameter , but none of is a stationary point of .

Proof.

Define by

where is a sixth-degree polynomial. For to be Lipschitz differentiable, we will select to make twice continuously differentiable, with bounded second derivative. To make twice continuously differentiable, we must have and . Since we aim to have , we also will require . The existence of is equivalent to the solvability of a linear system, which is easily checked to be invertible. Thus, we deduce that such a function exists.

It remains to show that for any , there exists a stationary point of EASGD. Set . The first-order condition yields . Since , we have . For , is an increasing function, so is increasing, and we deduce that there exists a solution with . By symmetry, satisfies , since for . Hence, is a stationary point of EASGD, but none of are stationary points of . ∎

7.3 Technical Preliminaries

Recall the statement of Assumption 1:

Assumption 1

is -Lipschitz-differentiable and -strongly convex, which is to say, the gradient satisfies , and satisfies

We write for the unique minimizer of , and for the condition number of .

We will frequently use the following standard result.

Lemma 9.

If is -Lipschitz-differentiable, then

Proof.

See (BCN2018SIAMREV, , eq. (4.3)). ∎

Lemma 10.

Let be -strongly convex, and let be the minimizer of . Then

(6)

and

(7)
Proof.

Equation 6 is the well-known Polyak-Łojasiewicz inequality. Equation 7 follows from the definition of strong convexity, and . ∎

Lemma 11.

Let be -Lipschitz-differentiable. If the gradient descent step size , then , where .

Proof.

By Lemma 9,

Rearranging yields the desired result. ∎

7.4 Proofs from Section 3.1.1

Lemma 12 (One-Step Descent).

Let satisfy Assumption 1. Let be an unbiased estimator for with . Let be the current iterate, and let be another point, with . The LSGD step satisfies:

(8)

where the expectation is with respect to , and conditioned on the current point . Hence, for sufficiently small with and ,

(9)
Proof.

The proof is similar to the convergence analysis of SGD. We apply Lemma 9 to obtain

Taking the expectation and using ,

Using the definition of -strong convexity, we have , from which we deduce that . Substituting this above, and splitting both the terms in half, we obtain

We proceed to bound each line. For the first line, the standard bias-variance decomposition yields

and so we have

For the second line, we obtain

For the third line, we apply the inequality to obtain

Using the Cauchy-Schwarz inequality, we then obtain

Combining these inequalities yields the desired result. ∎

Theorem 13.

Let satisfy Assumption 1. Suppose that the leader is always chosen so that . If are fixed so that and , then . If decreases at the rate , then .

Proof.

This result follows (9) and Theorems 4.6 and 4.7 of BCN2018SIAMREV . ∎

7.5 Proofs from Section 3.1.2

Theorem 14.

Let satisfy Assumption 1. Suppose that are small enough that and . If , then