Introduction
Consider the distributed training of deep neural networks over multiple workers [Dean et al.2012], where all workers can access all or partial training data and aim to find a common model that yields the minimum training loss. Such a scenario can be modeled as the following distributed parallel nonconvex optimization
(1) 
where is the number of nodes/workers and each is a smooth nonconvex function where can be possibly different for different . Following the standard stochastic optimization setting, this paper assumes each worker can locally observe unbiased independent stochastic gradients (around the last iteration solution ) given by with where denotes all the randomness up to iteration .
One classical parallel method to solve problem (1) is to sample each worker’s local stochastic gradient in parallel, aggregate all gradients in a single server to obtain the average, and update each worker’s local solution using the averaged gradient in its SGD step^{1}^{1}1Equivalently, we can let the server update its solution using the averaged gradient and broadcast this solution to all local workers. Another equivalent implementation is to let each worker take a single SGD step using its own gradient and send the updated local solution to the server; let the server calculate the average of all workers’ updated solutions and refresh each worker’s local solution with the averaged version. [Dekel et al.2012] [Li et al.2014]. Such a classical method, called parallel minibatch SGD
in this paper, is conceptually equivalent to a single node Stochastic Gradient Descent (SGD) with a batch size
times large and achieves convergence with a linear speedup with respect to (w.r.t.) the number of workers [Dekel et al.2012]. Since every iteration of parallel minibatch SGD requires exchanging of local gradient information among all workers, the corresponding communication cost is quite heavy and often becomes the performance bottleneck.There have been many attempts to reduce communication overhead in parallel minibatch SGD. One notable method called decentralized parallel SGD (DPSGD) is studied in [Lian et al.2017][Jiang et al.2017] [Lian et al.2018]. Remarkably, DPSGD can achieve the same convergence rate as parallel minibatch SGD, i.e., the linear speedup w.r.t. the number of workers is preserved, without requiring a single server to collect stochastic gradient information from local workers. However, since DPSGD requires each worker to exchange their local solutions/gradients with its neighbors at every iteration, the total number of communication rounds in DPSGD is the same as that in parallel minibatch SGD. Another notable method to reduce communication overhead in parallel minibatch SGD is to let each worker use compressed gradients rather than raw gradients for communication. For example, quantized SGD studied in [Seide et al.2014][Alistarh et al.2017][Wen et al.2017] or sparsified SGD studied in [Strom2015][Dryden et al.2016][Aji and Heafield2017] allow each worker to pass low bit quantized or sparsified gradients to the server at every iteration by sacrificing the convergence to a mild extent. Similarly to DPSGD, such gradient compression based methods require message passing at every iteration and hence their total number of communication rounds is still the same as that in parallel minibatch SGD.
Recall that parallel minibatch SGD can be equivalently interpreted as a procedure where at each iteration each local worker first takes a single SGD step and then replaces its own solution by the average of individual solutions. With a motivation to reduce the number of internode communication rounds, a lot of works suggest to reduce the frequency of averaging individual solutions in parallel minibatch SGD. Such a method is known as model averaging and has been widely used in practical training of deep neural networks. Model averaging can at least date back to [Zinkevich et al.2010] [McDonald, Hall, and Mann2010] where individual models are averaged only at the last iteration before which all workers simply run SGD in parallel. The method in [Zinkevich et al.2010] [McDonald, Hall, and Mann2010], referred to as oneshot averaging, uses only one single communication step at the end and is numerically shown to have good solution quality in many applications. However, it is unclear whether the oneshot averaging can preserve the linear speedup w.r.t. the number of workers. In fact, [Zhang et al.2016] shows that oneshot averaging can yield inaccurate solutions for certain nonconvex optimization. As a remedy, [Zhang et al.2016] suggests more frequent averaging should be used to improve the performance. However, the understanding on how averaging frequency can affect the performance of parallel SGD is quite limited in the current literature. Work [Zhou and Cong2017] proves that by averaging local worker solutions only every iterations, parallel SGD has convergence rate for nonconvex optimization.^{2}^{2}2In this paper, we shall show that if is chosen as , parallel SGD for nonconvex optimization does not lose any factor in its convergence rate. That is, the convergence slows down by a factor of by saving times internode communication. A recent exciting result reported in [Stich2018] proves that for stronglyconvex minimization, model averaging can achieve a linear speedup w.r.t. as long as the averaging (communication) step is performed once at least every iterations. Work [Stich2018] provides the first theoretical analysis that demonstrates the possibility of achieving the same linear speedup attained by parallel minibatch SGD with strictly less communication for stronglyconvex stochastic optimization. However, it remains as an open question in [Stich2018] whether it is possible to achieve convergence for nonconvex optimization, which is the case of deep learning.
On the other hand, many experimental works [Povey, Zhang, and Khudanpur2015] [Chen and Huo2016] [McMahan et al.2017] [Su, Chen, and Xu2018] [Kamp et al.2018] [Lin, Stich, and Jaggi2018] observe that model averaging can achieve a superior performance for various deep learning applications. One may be curious whether these positive experimental results are merely coincidences for special case examples or can be attained universally. In this paper, we shall show that model averaging indeed can achieve convergence for nonconvex optimization by averaging only every iterations. That is, the same convergence is preserved for nonconvex optimization while communication overhead is saved by a factor of . To our knowledge, this paper is the first^{3}^{3}3After the preprint [Yu, Yang, and Zhu2018] of this paper is posted on ArXiv in July 2018, another work [Wang and Joshi2018]
subsequently analyzes the convergence rate of model averaging for nonconvex optimization. Their independent analysis relaxes our bounded second moment assumption but further assumes all
in formulation (1) are identical, i.e, all workers must access a common training set when training deep neural networks. to present provable convergence rate guarantees (with the linear speedup w.r.t. number of workers and less communication) of model averaging for nonconvex optimization such as deep learning and provide guidelines on how often averaging is needed without losing the linear speedup.Besides reducing the communication cost, the method of model averaging also has the advantage of reducing privacy and security risks in the federated learning scenario recently proposed by Google in [McMahan et al.2017]. This is because model averaging only passes deep learning models, which are shown to preserve good differential privacy, and does not pass raw data or gradients owned by each individual worker.
Parallel Restarted SGD and Its Performance Analysis
Throughout this paper, we assume problem (1) satisfies the following assumption.
Assumption 1.

Smoothness: Each function is smooth with modulus .
(2) 
(3) 
Consider the simple parallel SGD described in Algorithm 1
. If we divide iteration indices into epochs of length
, then in each epochs all workers are running SGD in parallel with the same initial point that is the average of final individual solutions from the previous epoch. This is why we call Algorithm 1 “Parallel Restarted SGD”. The “model averaging” technique used as a common practice for training deep neural networks can be viewed as a special case since Algorithm 1 calculates the model average to obtain every iterations and performs local SGDs at each worker otherwise. Such an algorithm is different from elastic averaging SGD (EASGD) proposed in [Zhang, Choromanska, and LeCun2015] which periodically drags each local solution towards their average using a controlled weight. Note that synchronization (of iterations) across workers is not necessary inside each epoch of Algorithm 1. Furthermore, internode communication is only needed to calculate the initial point at the beginning of each epoch and is longer needed inside each epoch. As a consequence, Algorithm 1 with reduces its number of communication rounds by a factor of when compared with the classical parallel minibatch SGD. The linear speedup property (w.r.t. number of workers) with is recently proven only for strongly convex optimization in [Stich2018]. However, there is no theoretical guarantee on whether the linear speedup with can be preserved for nonconvex optimization, which is the case of deep neural networks.Fix iteration index , we define
(4) 
as the average of local solution over all nodes. It is immediate that
(5) 
Inspired by earlier works on distributed stochastic optimization [Zhang, Wainwright, and Duchi2012] [Lian et al.2017] [Mania et al.2017] [Stich2018] where convergence analysis is performed for an aggregated version of individual solutions, this paper focuses on the convergence rate analysis of defined in (4). An interesting observation from (5) is: Workers in Algorithm 1 run their local SGD independently for most iterations, however, they still jointly update their node average using a dynamic similar to SGD.The main issue in (5) is an “inaccurate” stochastic gradient, which is a simple average of individual stochastic gradients at points different from , is used. Since each worker in Algorithm 1 periodically restarts its SGD with the same initial point, deviations between each local solution and are expected to be controlled by selecting a proper synchronization interval . The following useful lemma relates quantity and algorithm parameter . A similar lemma is proven in [Stich2018].
Lemma 1.
Proof.
Theorem 1.
Proof.
Fix . By the smoothness of , we have
(7) 
Note that
(8) 
where (a) follows from (5); (b) follows by noting that and applying the basic inequality that holds for any random vector ; (c) follows because each has mean and is independent across nodes; and (d) follows from Assumption 1.
We further note that
(9) 
where (a) follows from (5); (b) follows because
where the first equality follows by the iterated law of expectations, the second equality follows because is determined by and the third equality follows by ; and (c) follows from the basic identity for any two vectors of the same length.
The next corollary follows by substituting suitable values into Theorem 1.
Corollary 1.
Remark 1.
For nonconvex optimization, it is generally impossible to develop a convergence rate for objective values. In Theorem 1 and Corollary 1, we follow the convention in literature [Ghadimi and Lan2013] [Lian et al.2017] [Alistarh et al.2017] to use the (average) expected squared gradient norm to characterize the convergence rate. Note that the average can be attained in expectation by taking each
with an equal probability
.
Linear Speedup: By part (1) of Corollary 1, Algorithm 1 with any fixed constant has convergence rate . If is large enough, i.e., , then the term is dominated by the term and hence Algorithm 1 has convergence rate . That is, our algorithm achieves a linear speedup with respect to the number of workers. Such linear speedup for stochastic nonconvex optimization was previously attained by decentralizedparallel stochastic gradient descent (DPSGD) considered in [Lian et al.2017] by requiring at least . See, e.g., Corollary 2 in [Lian et al.2017].^{4}^{4}4In fact, for a ring network considered in Theorem 3 in [Lian et al.2017], DPSGD requires a even larger satisfying since its implementation depends on the network topology. In contrast, the linear speedup of our algorithm is irrelevant to the network topology.

Communication Reduction: Note that Algorithm 1 requires internode communication only at the iterations that are multiples of . By Corollary 1, it suffices to choose any to ensure the convergence of our algorithm. That is, compared with parallel minibatch SGD or the DPSGD in [Lian et al.2017], the number of communication rounds in our algorithm can be reduced by a factor . Although Algorithm 1 does not describe how the node average is obtained at each node, in practice, the simplest way is to introduce a parameter server that collects all local solutions and broadcasts their average as in parallel minibatch SGD [Li et al.2014]. Alternatively, we can perform an allreduce operation on the local models(without introducing a server) such that all nodes obtain independently and simultaneously. (Using an allreduce operation among all nodes to obtain gradients averages has been previously suggested in [Goyal et al.2017] for distributed training of deep learning.)
Extensions
Using TimeVarying Learning Rates
Note that Corollary 1 assumes time horizon is known and uses a constant learning rate in Algorithm 1. In this subsection, we consider the scenario where the time horizon is not known beforehand and develop a variant of Algorithm 1 with timevarying rates to achieve the same computation and communication complexity. Compared with Algorithm 1, Algorithm 2 has the advantage that its accuracy is being improved automatically as it runs longer.
(13) 
Although Algorithm 2 introduces the concept of epoch for the convenience of description, we note that it is nothing but a parallel restarted SGD where each worker restarts itself every epoch using the node average of the last epoch’s final solutions as the initial point. If we sequentially reindex as (note that all are ignored since ), then Algorithm 2 is mathematically equivalent to Algorithm 1 except that timevarying learning rates are used in different epochs. Similarly to (4), we can define via and have
(14) 
Theorem 2.
Proof.
See Supplement. ∎
Asynchronous Implementations in Heterogeneous Networks
Algorithm 1 requires all workers to compute the average of individual solutions every iterations and synchronization among local workers are not needed before averaging. However, the fastest worker still needs to wait until all the other workers finish iterations of SGD even if it finishes its own iteration SGD much earlier. (See Figure 1 for a worker example where one worker is significantly faster than the other. Note that orange “syn” rectangles represent the procedures to compute the node average.) As a consequence, the computation capability of faster workers is wasted. Such an issue can arise quite often in heterogeneous networks where nodes are equipped with different hardwares.
Intuitively, if one worker finishes its iteration local SGD earlier, to avoid wasting its computation capability, we might want to let this worker continue running its local SGD until all the other workers finish their iteration local SGD. However, such a method can drag the node average too far towards the local solution at the fastest worker. Note that if in (1) are significantly different from each other such that the minimizer of at the th worker, which is the fastest one, deviates the true minimizer of problem (1) too much, then dragging the node average towards the fastest worker’s local solution is undesired. In this subsection, we further assume that problem (1) satisfies the following assumption:
Assumption 2.
The distributions in the definition of each in (1) are identical.
Note that Assumption 2 is satisfied if all local workers can access a common training data set or each local training data set is obtained from uniform sampling from the global training set. Consider the restarted local SGD for heterogeneous networks described in Algorithm 3. Note that if for some fixed constant , then Algorithm 3 degrades to Algorithm 1.
(15) 
In practice, if the hardware configurations or measurements (from previous experiments) of each local worker are known, we can predetermine the value of each , i.e., if worker is two times faster than worker , then . Alternatively, under a more practical implementation, we can set a fixed time duration for each epoch and let each local worker keep running its local SGD until the given time elapses. By doing so, within the same time duration, the faster a worker is, the more SGD iterations it runs. In contrast, if we apply Algorithm 1 in this setting, then all local workers have to run the same number of SGD iterations as that can be run by the slowest worker within the given time interval. This subsection shows that, under Assumption 2, Algorithm 3 can achieve a better performance than Algorithm 1 in heterogeneous networks where some workers are much faster than others.
Without loss of generality, this subsection always indexes local workers in a decreasing order of speed. That is, worker is the fastest while worker is the slowest. If we run Algorithm 3 by specifying a fixed wall clock time duration for each epoch, during which each local worker keeps running its local SGD, then we have . Fix epoch index , note that for , variables with is never used. However, for the convenience of analysis, we define
Conceptually, the above equation can be interpreted as assuming worker , which is slower than worker , runs extra iterations of SGD by using as an imaginary stochastic gradient (with no computation cost). See Figure 2 for a worker example where and . Using the definition , we have
Theorem 3.
Consider problem (1) under Assumptions 1 and 2. Suppose all workers are indexed in a decreasing order of their speed, i.e., worker is the fastest and worker is the slowest. If in Algorithm 3, then for all ,
(16) 
where for each given is the largest integer in such that (That is, for each fixed , is the number of workers that are still using sampled true stochastic gradients to update their local solutions at iteration .); and is the minimum value of problem (1).
Proof.
See Supplement. ∎
The next corollary shows that Algorithm 3 in heterogeneous networks can ensure the convergence and preserve the same convergence rate with the same communication reduction.
Corollary 2.
Remark 2.
Note that once values are known, then for any in Theorem 3 and Corollary 2 are also available by its definition. To appreciate the implication of Theorem 2, we recall that Algorithm 1 can be interpreted as a special case of Algorithm 3 with , i.e., all workers can only run the same number (determined by the slowest worker) of SGD iterations in each epoch. In this perspective, Theorem 1 (with ) implies that the performance of Algorithm 1 is given by