1 Introduction
Stochastic gradient descent (SGD) is the backbone of stateoftheart supervised learning, which is revolutionizing inference and decisionmaking in many diverse applications. Classical SGD was designed to be run on a single computing node, and its errorconvergence with respect to the number of iterations has been extensively analyzed and improved via accelerated SGD methods. Due to the massive training datasets and neural network architectures used today, it has became imperative to design distributed SGD implementations, where gradient computation and aggregation is parallelized across multiple worker nodes. Although parallelism boosts the amount of data processed per iteration, it exposes SGD to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. Thus, there is a critical need to make distributed SGD fast, yet robust to system variability.
Need to Optimize Convergence in terms of Error versus Wallclock Time. The convergence speed of distributed SGD is a product of two factors: 1) the error in the trained model versus the number of iterations, and 2) the number of iterations completed per second. Traditional singlenode SGD analysis focuses on optimizing the first factor, because the second factor is generally a constant when SGD is run on a single dedicated server. In distributed SGD, which is often run on shared cloud infrastructure, the second factor depends on several aspects such as the number of worker nodes, their local computation and communication delays, and the protocol (synchronous, asynchronous or periodic) used to aggregate their gradients. Hence, in order to achieve the fastest convergence speed we need: 1) optimization techniques (eg. variable learning rate) to maximize the errorconvergence rate with respect to iterations, and 2) scheduling techniques (eg. straggler mitigation, infrequent communication) to maximize the number of iterations completed per second. These directions are interdependent and need to be explored together rather than in isolation. While many works have advanced the first direction, the second is less explored from a theoretical point of view, and the juxtaposition of both is an unexplored problem.
LocalUpdate SGD to Reduce Communication Delays. A popular distributed SGD implementation is the parameter server framework Dean et al. (2012); Cui et al. (2014); Li et al. (2014); Gupta et al. (2016); Mitliagkas et al. (2016)
where in each iteration, worker nodes compute gradients on one minibatch of data and a central parameter server aggregates these gradients (synchronously or asynchronously) and updates the parameter vector
. The constant communication between the parameter server and worker nodes in each iteration can be expensive and slow in bandwidthlimited computed environments. Recently proposed distributed SGD frameworks such as Elasticaveraging Zhang et al. (2015); Chaudhari et al. (2017), Federated Learning McMahan et al. (2016); Smith et al. (2017b) and decentralized SGD Lian et al. (2017); Jiang et al. (2017) save this communication cost by allowing worker nodes to perform local updates to the parameter instead of just computing gradients. The resulting locally trained models (which are different due to variability in training data across nodes) are periodically averaged through a central server, or via direct interworker communication. Periodic averaging has been shown to offer significant speedup in deep neural network training Moritz et al. (2015); Zhang et al. (2016); Su & Chen (2015); Zhou & Cong (2017); Lin et al. (2018).ErrorRuntime Tradeoffs in LocalUpdate SGD. While local updates reduce the communicationdelay incurred per iteration, discrepancies between the models can result in an inferior errorconvergence. For example, consider the case of periodic averaging SGD where each of worker nodes makes local updates, and the resulting models are averaged after every iterations. A larger value of leads to slower convergence with respect to the number of iterations as illustrated in Figure 1. However, if we look at the true convergence with respect to the wallclock time, then a larger , that is, less frequent averaging, saves communication delay and reduces the runtime per iteration. While some recent theoretical works Zhou & Cong (2017); Yu et al. (2018); Wang & Joshi (2018); Stich (2018) study this dependence of the errorconvergence with respect to the number of iterations as varies, achieving a provablyoptimal speedup in the true convergence with respect to wallclock time is an open problem that we aim to address in this work.
Need for Adaptive Communication Strategies. In the errorruntime in Figure 1, we observe a tradeoff between the convergence speed and the error floor when the number of local updates is varied. A larger gives a faster initial drop in the training loss but results in a higher error floor. This calls for adaptive communication strategies that start with a larger and gradually decrease it as the model reaches closer to convergence. Such an adaptive strategy will offer a winwin in the errorruntime tradeoff by achieving fast convergence as well as low error floor. To the best of our knowledge, this is the first work to propose an adaptive communication frequency strategy.
Main Contributions. In this paper we consider periodicaveraging distributed SGD (PASGD), where each worker node performs local updates by processing one minibatch of data per iteration. A fusion node takes a simple average of these local models and then the worker nodes start with the averaged model and perform the next local updates. The main contributions are as follows:

We provide the first runtime analysis of localupdate SGD algorithms by modeling local computing time and communication delays as random variables, and quantifying the runtime speedup in comparison with fully synchronous SGD. A novel insight from this analysis is that periodic averaging strategy not only reduces the communication delay but also mitigates the stragglers.

Combining the runtime analysis and previous errorconvergence analysis of PASGD, we can obtain the errorruntime tradeoff for different values of . Using this combined errorruntime tradeoff, we derive an expression of the optimal communication period, which can serve as a useful guideline in practice.

We present a convergence analysis for PASGD with variable communication period and variable learning rate , generalizing previous works Zhou & Cong (2017); Wang & Joshi (2018). This analysis shows that decaying provides similar convergence benefits as decaying learning rate, the difference being that varying improves the true convergence with respect to the wallclock time. Adaptive communication can also be used in conjunction with existing learning rate schedules.

Based on the observations in runtime and convergence analysis, we develop an adaptive communication scheme: AdaComm. Experiments on training VGG16 and ResNet50 deep neural networks and different settings (with/without momentum, fixed/decaying learning rate) show that AdaComm can give a runtime speedup and still reach the same low training loss as fully synchronous SGD.
Although we focus on periodic simpleaveraging of local models, the insights on errorruntime tradeoffs and adaptive communication strategies are directly extendable to other communicationefficient SGD algorithms including Federated Learning McMahan et al. (2016), ElasticAveraging Zhang et al. (2015) and Decentralized averaging Jiang et al. (2017); Lian et al. (2017), as well as synchronous/asynchronous distributed SGD with a central parameter server Dean et al. (2012); Cui et al. (2014); Dutta et al. (2018).
2 Problem Framework
Empirical Risk Minimization via Minibatch SGD. Our objective is to minimize an objective function , the empirical risk function, with respect to model parameters denoted by . The training dataset is denoted by , where represents the th labeled data point. The objective function can be expressed as the empirical risk calculated using the training data and is given by
(1) 
where
is the composite loss function at the
data point. In classic minibatch stochastic gradient descent (SGD) Dekel et al. (2012), updates to the parameter vector are performed as follows. If represents a randomly sampled minibatch, then the update rule is(2) 
where denotes the learning rate and the stochastic gradient is defined as: . For simplicity, we will use instead of in the rest of the paper. A complete review of convergence properties of serial SGD can be found in Bottou et al. (2018).
Periodic Averaging SGD (PASGD). We consider a distributed SGD framework with worker nodes where all workers can communicate with others via a central server or via direct interworker communication. In periodic averaging SGD, all workers start at the same initial point . Each worker performs local minibatch SGD updates according to (2), and the local models are averaged by a fusion node or by performing an allnode broadcast. The workers then update their local models with the averaged model, as illustrated in Figure 2. Thus, the overall update rule at the worker is given by
(3) 
where denote the model parameters in the th worker after iterations and is defined as the communication period. Note that the iteration index corresponds to the local iterations, and not the number of averaging steps.
Special Case (): Fully Synchronous SGD. When , that is, the local models are synchronized after every iteration, periodicaveraging SGD is equivalent to fully synchronous SGD which has the update rule
(4) 
The analysis of fully synchronous SGD is identical to serial SGD with fold large minibatch size.
Local Computation Times and Communication Delay. In order to analyze the effect of on the expected runtime per iteration, we consider the following delay model. The time taken by the worker to compute a minibatch gradient at the localstep is modeled a random variable , assumed to be i.i.d. across workers and minibatches. The communication delay is a random variable for each allnode broadcast, as illustrated in Figure 3. The value of random variable can depend on the number of workers as follows.
(5) 
where represents the time taken for each internode communication, and describes how the delay scales with the number of workers, which depends on the implementation and system characteristics. For example, in the parameter server framework, the communication delay can be proportional to by exploiting a reduction tree structure Iandola et al. (2016). We assume that is known beforehand for the communicationefficient distributed SGD framework under consideration.
Convergence Criteria. In the errorconvergence analysis, since the objective function is nonconvex, we use the expected gradient norm as a an indicator of convergence following Ghadimi & Lan (2013); Bottou et al. (2018). We say the algorithm achieves an suboptimal solution if:
(6) 
When is arbitrarily small, this condition can guarantee the algorithm converges to a stationary point.
3 Jointly Analyzing Runtime and ErrorConvergence
3.1 Runtime Analysis
We now present a comparison of the runtime per iteration of periodic averaging SGD with fully synchronous SGD to illustrate how increasing can lead to a large runtime speedup. Another interesting effect of performing more local update is that it mitigates the slowdown due to straggling worker nodes.
Runtime Per Iteration of Fully Synchronous SGD. Fully synchronous SGD is equivalent to periodic averaging SGD with . Each of the workers computes the gradient of one minibatch and updates the parameter vector , which takes time at the worker^{1}^{1}1Instead of local updates, typical implementations of fully synchronous SGD have a central server that performs the update. Here we compare PASGD with fully synchronous SGD without a central parameter server.. After all workers finish their local updates, an allnode broadcast is performed to synchronize and average the models. Thus, the total time to complete each iteration is given by
(7)  
(8) 
where
are i.i.d. random variables with probability distribution
and is the communication delay. The term denotes the highest order statistic of i.i.d. random variables David & Nagaraja (2003).Runtime Per Iteration of Periodic Averaging SGD (PASGD). In periodic averaging SGD, each worker performs local updates before communicating with other workers. Let us denote the average local computation time at the worker by
(9) 
Since the communication delay is amortized over iterations, the average computation time per iteration is
(10)  
(11) 
The value of the first term and how it compares with depends on the probability distribution of
. We can obtain the following distributionindependent bound on the runtime of PASGD that only depends on the mean and the variance of
.[Upper Bound on the Runtime per Iteration] Suppose each worker takes time to compute gradients and perform a local update, which is i.i.d. across workers and minibatches. The mean and the variance of are and respectively. Then,
(12) 
The proof follows from the bound on expected order statistics given by Arnold & Groeneveld (1979). Observe that as we increase , the runtime bound in Section 3.1 decreases in two ways: 1) fold reduction in the communication delay and 2) reduction in the variance of the maximum of the local computation times, that is, reduction in additional delay due to slow or straggling workers.
Speedup over fully synchronous SGD. We now evaluate the speedup of periodicaveraging SGD over fully synchronous SGD for different and to demonstrate how the relative value of computation versus communication delays affects the speedup. Consider the simplest case where and are constants and , the communication/computation ratio. Besides systems aspects such as network bandwidth and computing capacity, for deep neural network training, this ratio also depends on the size of the neural network model and the minibatch size. See Figure 8 for a comparison of the communication/computation delays of common deep neural network architectures. Then , , are all equal to , and the ratio of and is given by
(13) 
Figure 4 shows the speedup for different values of and . When is comparable with (), periodicaveraging SGD (PASGD) can be almost twice as fast as fully synchronous SGD.
Straggler Mitigation due to Local Updates. Suppose that
is exponentially distributed with mean
and variance . For fully synchronous SGD, the term in (8) is equal to , which is approximately equal to . Thus, the expected runtime per iteration of fully synchronous SGD (8) increases logarithmically with the number of workers . Let us compare this with the scaling of the runtime of periodicaveraging SGD (11). Here, (9) is an Erlang random variable with mean and variable . Since the variance is times smaller than that of , the maximum order statistic is smaller than . Figure 5 shows the probability distribution of and for exponentially distributed . Observe that has a much lighter tail. This is because the effect of the variability in on is reduced due to the in (8) being replaced by (which has lower variance) in (11).3.2 Joint Analysis with Errorconvergence
In this subsection, we combine the runtime analysis with previous errorconvergence analysis for PASGD Wang & Joshi (2018). Due to space limitations, we state the necessary theoretical assumptions in the Appendix; the assumptions are similar to previous works Zhou & Cong (2017); Wang & Joshi (2018) on the convergence of localupdate SGD algorithms. [Errorruntime Convergence of PASGD] For PASGD, under certain assumptions (stated in the Appendix), if the learning rate satisfies and all workers are initialized at the same point , then after total wallclock time, the minimal expected squared gradient norm within time interval will be bounded by:
(14) 
where is the Lipschitz constant of the objective function and is the variance bound of minibatch stochastic gradients. The proof of Section 3.2 is presented in the Appendix. From the optimization error upper bound creftypecap 14, one can easily observe the errorruntime tradeoff for different communication periods. While a larger reduces the runtime per iteration and let the first term in creftypecap 14 become smaller, it also adds additional noise and increases the last term. In Figure 6, we plot theoretical bounds for both fully synchronous SGD () and PASGD. It is shown that although PASGD with starts with a rapid drop, it will eventually converge to a high error floor. This theoretical result is also corroborated by experiments in Section 5. Another direct outcome of Section 3.2 is the determination of the best communication period that balances the first and last terms in creftypecap 14. We will discuss the selection of communication period later in Section 4.1.
4 AdaComm: Proposed Adaptive Communication Strategy
Inspired by the clear tradeoff in the learning curve in Figure 6, it would be better to have an adaptive communication strategy that starts with infrequent communication to improve convergence speed, and then increases the frequency to achieve a low error floor. In this section, we are going to develop the proposed adaptive communication scheme.
The basic idea to adapt the communication is to choose the communication period that minimizes the optimization error at each wallclock time. One way to achieve the idea is switching between the learning curves at their intersections. However, without prior knowledge of various curves, it would be difficult to determine the switch points.
Instead, we divide the whole training procedure into uniform wallclock time intervals with the same length . At the beginning of each time interval, we select the best value of that has the fastest decay rate in the next wallclock time. If the interval length
is small enough and the best choice of communication period for each interval can be precisely estimated, then this adaptive scheme should achieve a winwin in the errorruntime tradeoff as illustrated in
Figure 7.After setting the interval length, the next question is how to estimate the best communication period for each time interval. In Section 4.1 we use the errorruntime analysis in Section 3.2 to find the best at each time.
4.1 Determining the Best Communication Period for Each Time Interval
From Section 3.2, it can be observed that there is an optimal value that minimizes the optimization error bound at given wallclock time. In particular, consider the simplest setting where and are constants. Then, by minimizing the upper bound creftypecap 14 over , we obtain the following. For PASGD, under the same assumptions as Section 3.2, the optimization error upper bound in (14) at time is minimized when the communication period is
(15) 
The proof is straightforward by setting the derivative of creftypecap 14 to zero. We present the details in the Appendix. Suppose all workers starts from the same initial point where subscript denotes the wallclock time. Directly applying Section 4.1 to the first time interval, then the best choice of communication period is:
(16) 
Similarly, for the th time interval, workers can be viewed as restarting training at a new initial point . Applying Section 4.1 again, we have
(17) 
Comparing creftypepluralcap 17 and 16, it is easy to see the generated communication period sequence decreases along with the objective value . This result is consistent with the intuition that the tradeoff between errorconvergence and communicationefficiency varies over time. Compared to the initial phase of training, the benefit of using a large communication period diminishes as the model reaches close to convergence. At this later stage, a lower error floor is more preferable to speeding up the runtime.
Practical SGD implementations generally decay the learning rate or increase the minibatch size Smith et al. (2017a); Goyal et al. (2017), in order to reduce the variance of the gradient updates. As we saw from the convergence analysis Section 3.2, performing local updates adds noise in stochastic gradients, resulting in a higher error floor at the end of training. Decaying the communication period can gradually reduce the variance of gradients and yield a similar improvement in convergence. Thus, adaptive communication strategies are similar in spirit to decaying learning rate or increasing minibatch size. The key difference is that here we are optimizing the true error convergence with respect to wallclock time rather than the number iterations.
4.2 Practical Considerations
Although creftypepluralcap 17 and 16 provide useful insights about how to adapt over time, it is still difficult to directly use them in practice due to the Lipschitz constant and the gradient variance bound being unknown. For deep neural networks, estimating these constants can be difficult and unreliable due to the highly nonconvex and highdimensional loss surface. As an alternative, we propose a simpler rule where we approximate by , and divide creftypecap 17 by creftypecap 16 to obtain the basic communication period update rule:
(18) 
where is the ceil function to round to the nearest integer . Since the objective function values (i.e., training loss) and can be easily obtained in the training, the only remaining thing now is to determine the initial communication period
. We obtain a heuristic estimate of
by a simple grid search over differentrun for one or two epochs each.
4.3 Refinements to the Proposed Adaptive Strategy
4.3.1 Faster Decay When Training Saturates
The communication period update rule creftypecap 18 tends to give a decreasing sequence . Nonetheless, it is possible that the best value of for next time interval is larger than the current one due to random noise in the training process. Besides, when the training loss can get stuck on plateaus and decrease very slowly, creftypecap 18 will result in saturating at the same value for a long time. To address this issue, we borrow a idea used in classic SGD where the learning rate is decayed by a factor when the training loss saturates for several epochs Goyal et al. (2017). Similarly, in the our scheme, the communication period will be multiplied by when the given by creftypecap 18 is not strictly less than . To be specific, the communication period for the time interval will be determined as follows:
(19) 
In the experiments, turns out to be a good choice. One can obtain a more aggressive decay in by either reducing the value of or introducing a slack variable in the condition, such as .
4.3.2 Incorporating Adaptive Learning Rate
So far we consider a fixed learning rate for the local SGD updates at the workers. We now present an adaptive communication strategy that adjusts for a given variable learning rate schedule, in order to obtain the best errorruntime tradeoff. Suppose denotes the learning rate for the time interval. Then, combining creftypepluralcap 17 and 16 again, we have
(20) 
Observe that when the learning rate becomes smaller, the communication period increases. This result corresponds the intuition that a small learning rate reduces the discrepancy between the local models, and hence is more tolerant to large communication periods.
Equation creftypecap 20 states that the communication period should be proportional to . However, in practice, it is common to decay the learning rate times after some given number of epochs. The dramatic change of learning rate may push the communication period to an unreasonably large value. In the experiments with momentum SGD, we observe that when applying creftypecap 20, the communication period can increase to which causes the training loss to diverge.
To avoid this issue, we propose the adaptive strategy given by (21) below. This strategy can also be justified by theoretical analysis. Suppose that in time interval, the objective function has a local Lipschitz smoothness . Then, by using the approximation , which is common in SGD literature Balles et al. (2016), we derive the following adaptive strategy:
(21) 
Apart from coupling the communication period with learning rate, when to decay the learning rate is another key design factor. In order to eliminate the noise introduced by local updates, we choose to first gradually decay the communication period to and then decay the learning rate as usual. For example, if the learning rate is scheduled to be decayed at the epoch but at that time the communication period is still larger than , then we will continue use the current learning rate until .
4.4 Theoretical Guarantees for the Convergence of AdaComm
In this subsection, we are going to provide a convergence guarantee for the proposed adaptive communication scheme by extending the error analysis for PASGD. Without loss of generality, we will analyze an arbitrary communication period sequence , where represents the total communication rounds^{2}^{2}2Note that in the error analysis, the subscripts of communication period and learning rate represent the index of local update periods rather than the index of the length wallclock time intervals as considered in Sections 4.14.3.. It will be shown that a decreasing sequence of is beneficial to guarantee the convergence. [Convergence of adaptive communication scheme] For PASGD with adaptive communication period and adaptive learning rate, suppose the learning rate remains same in each local update period. If the following conditions are satisfied as ,
(22) 
then the averaged model is guaranteed to converge to a stationary point:
(23) 
where . The proof details and a nonasymptotic result (similar to Section 3.2 but with variable ) are provided in Appendix. In order to understand the meaning of condition creftypecap 22, let us first consider the case when is a constant. In this case, the convergence condition is identical to minibatch SGD Bottou et al. (2018):
(24) 
As long as the communication period sequence is bounded, it is trivial to adapt the learning rate scheme in minibatch SGD creftypecap 24 to satisfy creftypecap 22. In particular, when the communication period sequence is decreasing, the last two terms in creftypecap 22 will become easier to be satisfied and put less constraints on the learning rate sequence.
5 Experimental Results
5.1 Experimental Setting
Platform. The proposed adaptive communication scheme was implemented in Pytorch Paszke et al. (2017) with Mpi4Py Dalcín et al. (2005). All experiments were conducted on a local cluster where each worker node has an NVIDIA TitanX GPU and 16core Intel Xeon CPU.
Dataset. We evaluate our method for image classification tasks on CIFAR10 and CIFAR100 dataset Krizhevsky (2009), which consists of 50,000 training images and 10,000 validation images in 10 and 100 classes respectively. Each worker machine is assigned with a partition which will be randomly shuffled after every epoch.
Model. We choose to train deep neural networks VGG16 Simonyan & Zisserman (2014) and ResNet50 He et al. (2016) from scratch. These two neural networks have different architectures and parameter sizes, thus resulting in different performance of periodic averaging. As shown in Figure 8, for VGG16, the communication time is about times higher than the computation time. Thus, compared to ResNet50, it requires a larger in order to reduce the runtimeperiteration and achieve fast convergence.
Moreover, unless otherwise stated, we used 4 worker nodes and minibatch size on each worker is 128. Therefore, the total minibatch size per iteration is 512. The initial learning rates for VGG16 and ResNet50 are 0.2 and 0.4 respectively. The weight decay for both networks is 0.0005. In the variable learning rate setting, we decay the learning rate by after epochs. We set the time interval length as seconds (about 10 epochs for the initial communication period).
Metrics. We compare the performance of proposed adaptive communication scheme with following methods with a fixed communication period: (1) Baseline: fully synchronous SGD (); (2) Extreme high throughput case where ; (3) Manually tuned case where a moderate value of is selected after trial runs with different communication periods. Instead of training for a fixed number of epochs, we train all methods for sufficiently long time to convergence and compare the training loss and test accuracy, both of which are recorded after every 100 iterations.
5.2 Adaptive Communication in PASGD
We first validate the effectiveness of AdaComm which uses the communication period update rule creftypecap 19 combined with creftypecap 21 on original PASGD without momentum.
Figure 9 presents the results for VGG16 for both fixed and variable learning rates. A large communication period initially results in a rapid drop in the error, but the error finally converges to higher floor. By adapting , the proposed AdaComm scheme strikes the best errorruntime tradeoff in all settings. In Figure 8(a), while fully synchronous SGD takes minutes to reach training loss, AdaComm costs minutes achieving more than speedup. Similarly, in Figure 8(b), AdaComm takes minutes to reach training loss achieving speedup over fully synchronous SGD ( minutes).
However, for ResNet50, the communication overhead is no longer the bottleneck. For fixed communication period, the negative effect of performing local updates becomes more obvious and cancels the benefit of low communication delay (see Figures 9(c) and 9(b)). It is not surprising to see fully synchronous SGD is nearly the best one in the errorruntime plot among all fixed methods. Even in this extreme case, adaptive communication can still have a competitive performance. When combined with learning rate decay, the adaptive scheme is about 1.3 times faster than fully synchronous SGD (see Figure 9(a)).
Table 1 lists the test accuracies in different settings; we report the best accuracy within a time budget for each setting. The results show that adaptive communication method have better generalization than fully synchronous SGD. In the variable learning rate case, the adaptive method even gives the better test accuracy than PASGD with the best fixed .
Model  Methods  Fixed lr  Variable lr 

VGG16  90.5  92.75  
92.25  92.5  
92.0  92.4  
AdaComm  91.1  92.85  
ResNet50  88.76  92.26  
90.42  92.26  
88.66  91.8  
AdaComm  89.57  92.42 
5.3 Adaptive Communication in Momentum SGD
The adaptive communication scheme is proposed based on the joint errorruntime analysis for PASGD without momentum. However, it can also be extended to other SGD variants, and in this subsection, we show that the proposed method works well for SGD with momentum.
5.3.1 Block Momentum in Periodic Averaging
Before presenting the empirical results, it is worth describing how to introduce momentum in PASGD. The most straightforward way is to apply the momentum independently to each local model, where each worker maintains an independent momentum buffer, which is the latest change in the parameter vector . However, this does not account for the potential dramatic change in at each averaging step. When local models are synchronized, the local momentum buffer will contain the update steps before averaging, resulting in a large momentum term in the first SGD step of the next local update period. When the communication period is large, this large momentum term can sidetrack the SGD descent direction resulting in slower convergence.
To address this issue, a block momentum scheme was proposed in Chen & Huo (2016) and applied to speech recognition tasks. The basic idea is treating the accumulated local updates in one period as one big gradient step between two synchronized models and introducing a global momentum for this big accumulated step. The update rule can be written as follows in terms of the momentum :
(25)  
(26) 
where represents the accumulated gradients in the local update period and denotes the global momentum factor. Moreover, workers can also conduct momentum SGD on local models, but their local momentum buffer will be cleared at the beginning of each local update period. That is, we restart momentum SGD on local models after every averaging step. The same strategy was also suggested in Microsoft’s CNTK framework Seide & Agarwal (2016). In our experiments, we set the global momentum factor as and local momentum factor as following Lin et al. (2018). In the fully synchronous case, there is no need to introduce the block momentum and we simply follow the common practice setting the momentum factor as .
5.3.2 AdaComm plus Block Momentum
We applied our adaptive communication strategy in PASGD with block momentum and observed significant performance gain on CIFAR10/100 (see Figure 11). In particular, the adaptive communication scheme has the fastest convergence rate with respect to wallclock time in the whole training process. While fully synchronous SGD gets stuck with a plateau before the first learning rate decay, the training loss of adaptive method continuously decreases until converging. For VGG16 in Figure 10(b), AdaComm is faster (in terms of wallclock time) than fully synchronous SGD in reaching a training loss. For ResNet50 in Figure 10(a), AdaComm takes minutes to get training loss which is times faster than fully synchronous SGD ( minutes).
6 Concluding Remarks
The design of fast communicationefficient distributed SGD algorithms that are robust to system variability is vital to enable machine learning training to scale to resourcelimited computing nodes. This paper is one of the first to analyze the convergence of error with respect to wallclock time instead of number of iterations by accounting for the dependence of runtime per iteration on systems aspects such as computation and communication delays. We present a theoretical analysis of the errorruntime tradeoff for periodic averaging SGD (PASGD), where each worker node performs local updates and their models are averaged after every iterations. Based on the joint errorruntime analysis, we design the first (to the best of our knowledge) adaptive communication strategy called AdaComm
for distributed deep learning. Experimental results using VGGNet and ResNet show that the proposed method can achieve up to a
improvement in runtime, while achieving the same error floor as fully synchronous SGD. Going beyond periodicaveraging SGD, our idea of adapting frequency of averaging distributed SGD updates can be easily extended to other SGD frameworks including elasticaveraging Zhang et al. (2015), decentralized SGD Lian et al. (2017) and parameter serverbased training Dean et al. (2012).Acknowledgments
This work was partially supported by the CMU Dean’s fellowship and an IBM Faculty Award. The experiments were conducted on the ORCA cluster provided by the Parallel Data Lab at CMU, and on Amazon AWS (supported by an AWS credit grant).
References
 Arnold & Groeneveld (1979) Arnold, B. C. and Groeneveld, R. A. Bounds on expectations of linear systematic statistics based on dependent samples. The Annals of Statistics, 7(1):220–223, January 1979. doi: 10.1214/aos/1176344567. URL https://doi.org/10.1214/aos/1176344567.
 Balles et al. (2016) Balles, L., Romero, J., and Hennig, P. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.
 Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 Chaudhari et al. (2017) Chaudhari, P., Baldassi, C., Zecchina, R., Soatto, S., Talwalkar, A., and Oberman, A. Parle: parallelizing stochastic gradient descent. arXiv preprint arXiv:1707.00424, 2017.
 Chen & Huo (2016) Chen, K. and Huo, Q. Scalable training of deep learning machines by incremental block training with intrablock parallel optimization and blockwise modelupdate filtering. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5880–5884. IEEE, 2016.
 Cui et al. (2014) Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., et al. Exploiting bounded staleness to speed up big data analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pp. 37–48, 2014.
 Dalcín et al. (2005) Dalcín, L., Paz, R., and Storti, M. MPI for python. Journal of Parallel and Distributed Computing, 65(9):1108–1115, 2005.
 David & Nagaraja (2003) David, H. A. and Nagaraja, H. N. Order statistics. John Wiley, Hoboken, N.J., 2003.
 Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.
 Dekel et al. (2012) Dekel, O., GiladBachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using minibatches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
 Dutta et al. (2018) Dutta, S., Joshi, G., Ghosh, S., Dube, P., and Nagpurkar, P. Slow and stale gradients can win the race: Errorruntime tradeoffs in distributed SGD. arXiv preprint arXiv:1803.01113, 2018.
 Ghadimi & Lan (2013) Ghadimi, S. and Lan, G. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Gupta et al. (2016) Gupta, S., Zhang, W., and Wang, F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In IEEE 16th International Conference on Data Mining (ICDM), pp. 171–180. IEEE, 2016.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Iandola et al. (2016) Iandola, F. N., Moskewicz, M. W., Ashraf, K., and Keutzer, K. Firecaffe: nearlinear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2592–2600, 2016.
 Jiang et al. (2017) Jiang, Z., Balu, A., Hegde, C., and Sarkar, S. Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems, pp. 5906–5916, 2017.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Li et al. (2014) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.Y. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pp. 583–598, 2014.
 Lian et al. (2017) Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5336–5346, 2017.
 Lin et al. (2018) Lin, T., Stich, S. U., and Jaggi, M. Don’t use large minibatches, use local SGD. arXiv preprint arXiv:1808.07217, 2018.
 McMahan et al. (2016) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 Mitliagkas et al. (2016) Mitliagkas, I., Zhang, C., Hadjis, S., and Ré, C. Asynchrony begets momentum, with an application to deep learning. In 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 997–1004. IEEE, 2016.
 Moritz et al. (2015) Moritz, P., Nishihara, R., Stoica, I., and Jordan, M. I. SparkNet: Training deep networks in spark. arXiv preprint arXiv:1511.06051, 2015.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPSW, 2017.
 Seide & Agarwal (2016) Seide, F. and Agarwal, A. CNTK: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2135–2135. ACM, 2016.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Smith et al. (2017a) Smith, S. L., Kindermans, P.J., and Le, Q. V. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017a.
 Smith et al. (2017b) Smith, V., Chiang, C.K., Sanjabi, M., and Talwalkar, A. S. Federated multitask learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. 2017b.
 Stich (2018) Stich, S. U. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
 Su & Chen (2015) Su, H. and Chen, H. Experiments on parallel training of deep neural network using model averaging. arXiv preprint arXiv:1507.01239, 2015.
 Wang & Joshi (2018) Wang, J. and Joshi, G. Cooperative SGD: A unified framework for the design and analysis of communicationefficient SGD algorithms. arXiv preprint arXiv:1808.07576, 2018.
 Yu et al. (2018) Yu, H., Yang, S., and Zhu, S. Parallel restarted SGD for nonconvex optimization with faster convergence and less communication. arXiv preprint arXiv:1807.06629, 2018.
 Zhang et al. (2016) Zhang, J., De Sa, C., Mitliagkas, I., and Ré, C. Parallel SGD: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
 Zhang et al. (2015) Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learning with elastic averaging SGD. In NIPS’15 Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 685–693, 2015.
 Zhou & Cong (2017) Zhou, F. and Cong, G. On the convergence properties of a step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012, 2017.
Appendix A Inefficient Local Updates
It is worth noting there is an interesting phenomenon about the convergence of periodic averaging SGD (PASGD). When the learning rate is fixed, PASGD with finetuned communication period has better test accuracy than both fully synchronous SGD and the adaptive method, while its training loss remains higher than the latter two methods (see Figure 9, Figure 10). In particular, on CIFAR100 dataset, we observe about improvement in test accuracy when . To investigate this phenomenon, we evaluate the test accuracy for PASGD () in two frequencies: 1) every iterations; 2) every iterations. In the former case, the test accuracy is reported just after the averaging step. However, in the latter case, the test accuracy can come from either the synchronized/averaged model or local models, since cannot be divided by .
From Figure 12, it is clear that local model’s accuracy is much lower than the synchronized model, even when the algorithm has converged. Thus, we conjecture that the improvement of test accuracy only happens on the synchronized model. That is, after averaging, the test accuracy will undergo a rapid increase but it decreases again in the following local steps due to noise in stochastic gradients. Such behavior may depend on the geometric structure of the loss surface of specific neural networks. The observation also reveals that the local updates are inefficient as they reduces the accuracy and makes no progress. In this sense, it is necessary for PASGD to reduce the gradient variance by either decaying learning rate or decaying communication period.
Appendix B Assumptions for Convergence Analysis
The convergence analysis is conducted under the following assumptions, which are similar to the assumptions made in previous work on the analysis of PASGD Zhou & Cong (2017); Yu et al. (2018); Wang & Joshi (2018); Stich (2018). In particular, we make no assumptions on the convexity of the objective function. We also remove the uniform bound assumption for the norm of stochastic gradients. [Lipschitz smooth & lower bound on ] The objective function is differentiable and Lipschitz smooth, i.e., . The function value is bounded below by a scalar .
[Unbiased estimation] The stochastic gradient evaluated on a minibatch
is an unbiased estimator of the full batch gradient .[Bounded variance] The variance of stochastic gradient evaluated on a minibatch is bounded as
where and are nonnegative constants and in inverse proportion to the minibatch size.
Appendix C Proof of Theorem 2: Errorruntime Convergence of PASGD
Firstly, let us recall the erroranalysis of PASGD. We adapt the theorem from Wang & Joshi (2018). [ErrorConvergence of PASGD Wang & Joshi (2018)] For PASGD, under Appendices B, B and B, if the learning rate satisfies and all workers are initialized at the same point , then after iterations, we have
(27) 
where is the Lipschtiz constant of the objective function, is the variance bound of minibatch stochastic gradients and denotes the averaged model at the iteration. From the runtime analysis in Section 2, we know that the expected runtime per iteration of PASGD is
(28) 
Accordingly, the total wallclock time of training iteration is
(29) 
Then, directly substituting in creftypecap 27, we complete the proof.
Appendix D Proof of Theorem 3: the Best Communication Period
Taking the derivative of the upper bound (14) with respect to the communication period, we obtain
(30) 
When the derivative equals to zero, the communication period is
(31) 
Since the second derivative of (14) is
(32) 
then the optimal value obtained in creftypecap 31 must be a global minimum.
Appendix E Proof of Theorem 4: ErrorConvergence of Adaptive Communication Scheme
e.1 Notations
In order to faciliate the analysis, we would like to first introduce some useful notations. Define matrices that concatenate all local models and gradients:
(33)  
(34) 
Besides, define matrix where denotes the column vector . Unless otherwise stated, is a size column vector, and the matrix
and identity matrix
are of size , where is the number of workers.e.2 Proof
Let us first focus on the th local update period, where . Without loss of generality, suppose the local index of the local update period starts from and ends with . Then, for the th local step in the interested period, we have the following lemma. [Lemma 1 in Wang & Joshi (2018)] For PASGD, under Appendices B, B and B, at the th iteration, we have the following bound for the objective value:
(35) 
where denotes the averaged model at the iteration. Taking the total expectation and summing over all iterates in the th local update period, we can obtain
(36) 
Next, we are going to provide an upper bound for the last term in (36). Note that
(37)  
(38)  
(39)  
(40) 
where (40) follows the fact that all workers start from the same point at the beginning of each local update period, i.e., . Accordingly, we have
(41)  
(42) 
where the inequality (42) is due to the operator norm of is less than 1. Furthermore, using the fact , one can get
(43)  
(44) 
For the first term , since the stochastic gradients are unbiased, all cross terms are zero. Thus, combining with Appendix B, we have
(45)  
(46)  
(47) 
For the second term in (44), directly applying Jensen’s inequality, we get
(48)  
(49) 
Substituting the bounds of and into (44),
(50) 
Recall the upper bound (36), we further derive the following bound:
(51)  
(52)  
(53) 
Then, since , we have
(54)  
(55) 