# A continuous-time analysis of distributed stochastic gradient

Synchronization in distributed networks of nonlinear dynamical systems plays a critical role in improving robustness of the individual systems to independent stochastic perturbations. Through analogy with dynamical models of biological quorum sensing, where synchronization between systems is induced through interaction with a common signal, we analyze the effect of synchronization on distributed stochastic gradient algorithms. We demonstrate that synchronization can significantly reduce the magnitude of the noise felt by the individual distributed agents and by their spatial mean. This noise reduction property is connected with a reduction in smoothing of the loss function imposed by the stochastic gradient approximation. Using similar techniques, we provide a convergence analysis, and derive a bound on the expected deviation of the spatial mean of the agents from the global minimizer of a strictly convex function. By considering additional dynamics for the quorum variable, we derive an analogous bound, and obtain new convergence results for the elastic averaging SGD algorithm. We conclude with a local analysis around a minimum of a nonconvex loss function, and show that the distributed setting leads to lower expected loss values and wider minima.

## Authors

• 6 publications
• 15 publications
• ### Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Large-batch stochastic gradient descent (SGD) is widely used for trainin...
06/26/2019 ∙ by Kosuke Haruki, et al. ∙ 4

• ### Synchronization and Redundancy: Implications for Robustness of Neural Learning and Decision Making

Learning and decision making in the brain are key processes critical to ...
10/21/2010 ∙ by Jake Bouvrie, et al. ∙ 0

• ### Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent

We interpret the variational inference of the Stochastic Gradient Descen...
01/18/2019 ∙ by Wenqing Hu, et al. ∙ 0

• ### Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM

Understanding the bottlenecks in implementing stochastic gradient descen...
05/03/2019 ∙ by Vibhatha Abeykoon, et al. ∙ 0

• ### Stochastic Gradient Langevin Dynamics with Variance Reduction

Stochastic gradient Langevin dynamics (SGLD) has gained the attention of...
02/12/2021 ∙ by Zhishen Huang, et al. ∙ 0

• ### Global Synchronization of Clocks in Directed Rooted Acyclic Graphs: A Hybrid Systems Approach

In this paper, we study the problem of robust global synchronization of ...
04/07/2019 ∙ by Muhammad U. Javed, et al. ∙ 0

• ### Fully Distributed and Asynchronized Stochastic Gradient Descent for Networked Systems

This paper considers a general data-fitting problem over a networked sys...
04/13/2017 ∙ by Ying Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient descent (SGD) and its variants have become the de-facto algorithms for large-scale machine learning applications such as deep neural networks (Bottou, 2010; Goodfellow et al., 2016; LeCun et al., 2015; Mallat, 2016). SGD is used to optimize finite-sum loss functions, where a stochastic approximation to the gradient is computed using only a random selection of the input data points. Well-known results on almost-sure convergence rates to global minimizers for strictly convex functions and to stationary points for nonconvex functions exist under sufficient regularity conditions (Bottou, 1998; Robbins and Siegmund, 1971). Classic work on iterate averaging for SGD (Polyak and Juditsky, 1992) and other more recent extensions in (Defazio et al., 2014; Roux et al., 2012; Bach and Moulines, 2013; Schmidt et al., 2017) can improve convergence under a set of reasonable assumptions typically satisfied in the machine learning setting. Convergence proofs rely on a suitably chosen decreasing step size; for constant step sizes and strictly convex functions, the parameters ultimately converge to a distribution peaked around the optimum.

For large-scale machine learning applications, parallelization of SGD is a critical problem of significant modern research interest (Dean et al., 2012; Recht and Ré, 2013; Recht et al., 2011; Chaudhari et al., 2017). Recent work in this direction includes the elastic averaging SGD (EASGD) algorithm, in which

distributed agents coupled through a common signal optimize the same loss function. EASGD can be derived from a single SGD step on a global variable consensus objective with a quadratic penalty, and the common signal takes the form of an average over space and time of the individual agents’ parameter vectors

(Zhang et al., 2015; Boyd et al., 2010). At its core, the EASGD algorithm is a system of identical, coupled, discrete-time dynamical systems.

Surprisingly, mathematical models of quorum sensing in bacteria follow exactly the same structure as the EASGD algorithm, though the dynamics of the common (quorum) signal can be arbitrary, and analysis has typically been performed in continuous-time (Miller and Bassler, 2001; Waters and Bassler, 2005; Russo and Slotine, 2010; Dockery D. and James, 2000). Motivated by this immediate analogy, we present here a continuous-time analysis of distributed stochastic gradient algorithms, of which EASGD is a special case. A significant focus of this work is the interaction between the degree of synchronization of the individual agents, characterized rigorously by a bound on the expected distance between all agents and governed by the coupling strength, and the amount of noise induced by their stochastic gradient approximations.

The effect of coupling between identical continuous-time dynamical systems has a rich history. In particular, synchronization phenomena in such coupled systems have been the subject of much mathematical (Wang and Slotine, 2005), biological Russo and Slotine (2010), neuroscientific (Tabareau et al., 2010), and physical interest (Javaloyes et al., 2008). In nonlinear dynamical systems, synchronization has been shown to play a crucial role in protection of the individual systems from independent sources of noise (Tabareau et al., 2010)

. The interaction between synchronization and noise has also been posed as a possible source of regularization in biological learning, where quorum sensing-like mechanisms could be implemented between neurons through local field potentials

(Bouvrie and Slotine, 2013). Given the significance of stochastic gradient (Zhang et al., 2018b) and externally injected (Neelakantan et al., 2015) noise in regularization of large-scale machine learning models such as deep networks (Zhang et al., 2017), it is natural to expect that the interplay between synchronization of the individual agents and the noise from their stochastic gradient approximations is of central importance in distributed SGD algorithms.

Recently, there has been renewed interest in a continuous-time view of optimization algorithms (Wilson et al., 2016; Wibisono et al., 2016; Wibisono and Wilson, 2015; Betancourt et al., 2018). Nesterov’s accelerated gradient method (Nesterov, 1983) was fruitfully analyzed in continuous-time in (Su et al., 2014), and a unifying extension to other algorithms can be found in (Wibisono et al., 2016). Continuous-time analysis has also enabled discrete-time algorithm development through classical discretization techniques from numerical analysis Zhang et al. (2018a). Analysis in continuous-time optimization is limited, in part, by a difficulty in concluding corresponding discrete-time convergence rates: the dynamics can be arbitrarily sped up in continuous-time to achieve any convergence rate. An absolute time reference can be imposed, for example, by using a singular perturbation framework as in (Nguyen et al., 2018).

The paper is organized as follows. In Sec. 2, we provide some necessary mathematical preliminaries: a review of SGD in continuous-time, a continuous-time limit of the EASGD algorithm, a review of stochastic nonlinear contraction theory, and a statement of some needed assumptions. In Sec. 3, we demonstrate that the effect of synchronization of the distributed SGD agents is to protect each agent and their spatial mean from noise. We derive this for an algorithm where all-to-all coupling is implemented through communication with the spatial mean of the distributed parameters, and we refer to this algorithm as quorum SGD (QSGD). In the appendix, a similar derivation is presented with arbitrary dynamics for the quorum variable, of which EASGD is a special case. In Sec. 4, we connect this noise reduction property with a recent analysis (Kleinberg et al., 2018), which shows SGD can be interpreted as performing gradient descent on a smoothed loss in expectation. In Sec. 5, we provide convergence results for QSGD, as well as new, stronger convergence results for EASGD than were previously obtained in discrete-time (Zhang et al., 2015). In Sec. 6, we consider a local analysis around a minimum of a nonconvex function, and demonstrate that a larger number of distributed agents leads to lower expected loss function values and wider minima, generalizing recent work to the distributed setting (Jastrzȩbski et al., 2017). We close with some concluding remarks in Sec. 7.

## 2 Mathematical preliminaries

In this section, we provide a brief review of the necessary mathematical tools employed in this work.

### 2.1 Stochastic gradient descent in discrete-time

SGD has been essential for training large-scale machine learning models such as deep neural networks, where empirical risk minimization leads to finite-sum loss functions of the form

 f(x)=N∑i=1l(x,yi)

Above, is the input data example and the vector holds the model parameters. In the typical machine learning setting where is very large, the gradient of requires gradient computations of , which is prohibitively expensive.

Instead, a stochastic gradient is computed by taking a random selection of size , typically known as a minibatch. It is simple to see that the stochastic gradient

 ^g(x)=∑y∈B∇l(x,y)

is an unbiased estimator of the true gradient. The parameters are updated according to the iteration

 xt+1=xt−η^g(x)

By adding and subtracting the true gradient, the SGD iteration can be rewritten

 xt+1=xt−η∇f(x)−η√bζ (1)

where is a data-dependent noise term.

can be taken to be Gaussian under a central limit theorem argument, assuming that the size of the minibatch is large enough

(Jastrzȩbski et al., 2017; Mandt et al., ).

is then given by the variance of a single-element stochastic gradient

 Σ=E[(∇l(x,yi)−∇f(x))(∇l(x,yi)−∇f(x))T]

with the index chosen uniformly from , and where the expectation is taken over the selection of this index.

### 2.2 Stochastic gradient descent in continuous-time

A significant difficulty in a continuous-time analysis of SGD is formulating an accurate stochastic differential equation (SDE) model. Recent works have proved rigorously (Hu et al., 2017; Feng et al., 2018; Li et al., 2015, 2018) that the sequence of values generated by the SDE

 dx=(−∇f(x)−14η∇|∇f(x)|2)dt+√ηS(x)dW

approximates the SGD iteration with weak error , where is a standard Wiener process, and where . Dropping the small term proportional to reduces the weak error to Hu et al. (2017). This leads to the SDE

 dx=−∇f(x)dt+√ηbB(x)dW (2)

where . Equation (2) has appeared in a number of recent works (Mandt et al., 2017, 2016, ; Chaudhari and Soatto, 2018; Chaudhari et al., 2018; Jastrzȩbski et al., 2017), and is generally obtained by making the replacement and in (1), as a sort of reverse Euler-Maruyama discretization (Kloeden and Platen, 1992). In this work, we will need to convert a related sequence to continuous-time, where these simple rules will not apply.

A procedure common in the stochastic modeling community for converting a discrete stochastic model to an SDE is to consider

 dx=E[Δx]Δtdt+ ⎷E[(Δx−E[Δx])(Δx−E[Δx])T]ΔtdW (3)

where denotes the discrete update . It can be shown that the discrete stochastic model and the SDE in (3) approximately have the same Fokker-Planck equation (Allen, 2007)

. The probability distribution of solutions to the SDE in (

3) and the original discrete model will therefore approximately be the same, and it is expected that insight about the behavior of the discrete model can be gained from analyzing the continuous model. Performing this procedure on (1) immediately leads to

 E[Δx] =−η∇f(x) E[(Δx−E[Δx])(Δx−E[Δx])T] =η2bΣ

so that, considering as the timestep, the approximating SDE in the sense of (3) is given by

 dx=−∇f(x)dt+√ηbB(x)dW

which is precisely (2). Later in this work, we will use this procedure to convert a discrete-time sequence to a continuous-time limit.

### 2.3 EASGD in continuous-time

Following (Zhang et al., 2015), we provide a brief introduction to the EASGD algorithm, and convert the resulting sequences to continuous-time. We imagine a distributed optimization setting with agents and a single master. We are interested in solving a stochastic optimization problem

 minxF(x)=Eζ[f(x,ζ)]

where is the vector of parameters and

is a random variable representing the stochasticity in the objective. This is equivalent to the distributed optimization problem

(Boyd et al., 2010)

 minx1,…,xp,~xp∑i=1(Eζi[f(xi,ζi)]+k2∥xi−~x∥2) (4)

where each is a local vector of parameters and is the quorum variable. The quadratic penalty ensures that all local agents remain close to the quorum variable, and sets the coupling strength. Smaller values of allow for more exploration, while larger values ensure a greater degree of synchronization. Intuitively, the interaction between agents mediated by is expected to help individual trajectories escape local minima, unless they all fall into the same deep or wide minimum together. Later in this work, we will make this intuition more precise.

We assume the expectation in (2.3) is approximated by a sum over input data points, and that the stochastic gradient is computed by taking a batch of size . After taking an SGD step, the updates for each agent and the quorum variable become

 xit+1 =xit−η∇f(xit)+ηk(~xt−xit)−η√bζi ~xt+1 =~xt+ηpk(x∙t−~xt)

where and . Switching to the continuous-time limit, these equations become,

 dxit =(−∇f(xit)+k(~xt−xit))dt−√ηbBdWit (5) d~xt =kp(x∙t−~xt) (6)

with . Note that in (6), the dynamics of represent a simple low-pass filter of the center of mass (spatial mean) variable . In the limit of large , the dynamics of this filter will be much faster than the SGD dynamics, and the continuous-time EASGD system can be replaced by

 dxit=(−∇f(xit)+k(x∙t−xit))dt−√ηbBdWit (7)

We refer to (7) as quorum SGD (QSGD), and it will be the primary focus of this work. Analyses for EASGD outside of the convergence results in Sec. 5 can be found in the appendix; we find that additional dynamics for the quorum variable, no matter their form, make the analysis more technical but offer no discernible theoretical benefit.

### 2.4 Background on nonlinear contraction theory

The main mathematical tool used in this work is nonlinear contraction theory. Contraction is a form of incremental stability for nonlinear systems whose properties we briefly summarize here; further details can be found in (Lohmiller and Slotine, 1998). Intuitively, a nonlinear system is called contracting if the effects of initial conditions and perturbations are lost exponentially fast. Consider a nonlinear dynamical system

 ˙x=f(x,t) (8)

where is the state and is a smooth function. Let

denote a uniformly invertible matrix. Equation (

8) is said to be contracting in the metric with contraction rate if the symmetric part of the generalized Jacobian is uniformly negative definite,

 Js=[(˙Θ+Θ∂f∂x)Θ−1]s≤−λI (9)

for all and for all time . In (9), subscript s refers to the symmetric part of a matrix, . If (9) is satisfied for a nonlinear system, all trajectories globally exponentially converge to each other regardless of initial conditions (Lohmiller and Slotine, 1998). If a dynamical system is contracting, we will interchangeably refer to , the system, and as contracting. We will also refer to as the metric transformation. Two specific robustness results for contracting systems needed for the derivations in this work are summarized below.

###### Theorem 1

Consider the dynamical system in (8), and assume that it is contracting with metric transformation and contraction rate . Let denote the maximum condition number of . Consider the perturbed dynamical system

 ˙x=f(x,t)+ϵ(x,t) (10)

Then, for a solution of (8) and a solution of (10),

 ∥x1(t)−x2(t)∥≤χ∥x1(0)−x2(0)∥e−λt+χ∥ϵ∥λ
###### Proof of Theorem 1

See (Vecchio and Slotine, 2013).

###### Theorem 2

Consider the stochastic differential equation

 dx=f(x,t)dt+σ(x,t)dW (11)

with and where denotes an -dimensional Wiener process. Assume that there exists a uniformly positive definite metric such that with , and that is contracting in this metric. Further assume that is uniformly upper bounded by a constant C. Then, for two trajectories and with stochastic initial conditions given by a probability distribution ,

 E[∥a(t)−b(t)∥2] ≤1β(E[(∥a(0)−b(0)∥2M−Cλ)+]e−2λt+Cλ)

where denotes the unit ramp function and . The expectation on the left-hand side is over the noise for all , and the expectation on the right-hand side is over the distribution of initial conditions.

###### Proof of Theorem 2

See (Pham et al., 2009).

If the conditions of Thm. 2 are satisfied, the system in (11) is said to be stochastically contracting in the metric with bound and rate .

### 2.5 Assumptions

We require two main assumptions about the objective function , both of which have been employed in previous work analyzing synchronization and noise in nonlinear systems (Tabareau et al., 2010). The first is an assumption on the nonlinearity of the components of the gradient. Intuitively, this can be understood as a condition on the third derivative of the objective, or on its degree of non-quadraticity.

Assumption 1

Assume that the Hessian matrix of each component of the gradient has bounded maximum eigenvalue,

for all .

The second assumption is a condition on the resistance of the distributed, coupled gradient flows studied in this work to external stochastic perturbations.

Assumption 2 Consider two dynamical systems , and where is a continuous-time stochastic process. Then implies that .

## 3 Synchronization and noise

In this section, we analyze the interaction between synchronization of the distributed QSGD agents and the noise they experience. We begin with a derivation of a quantitative measure of synchronization that applies to any distributed SGD algorithm, and then present the section’s primary contribution.

### 3.1 A measure of synchronization

We begin with a simple theorem on synchronization in the deterministic setting, which will allow us to prove a bound on synchronization in the stochastic setting using Thm. 2.

###### Theorem 3

Consider the coupled gradient descent system

 ˙xi=−∇f(xi)+k(z−xi) (12)

where represents a common external signal. Let denote the maximum eigenvalue of . For , the individual systems globally exponentially synchronize with rate .

###### Proof of Theorem 3

Consider the virtual system

 ˙y=−∇f(y)+k(z−y)

where is an external input. The Jacobian of this system is given by

 J=−∇2f(y)−kI (13)

Equation (13) is symmetric and negative definite for for any external input . Because the individual are particular solutions of this virtual system with the same input, contraction for all implies that for all and , exponentially. The contraction rate is given by .

Thm. 3 gives a simple condition on the coupling for synchronization of the individual agents in (12). Because can represent any input, Thm. 3 applies to any dynamics of the quorum variable: with , it applies to the QSGD algorithm, and with , it applies to the EASGD algorithm. Under the assumption of a contracting deterministic system, we can use the stochastic contraction results in Thm. 2 to bound the expected distance between individual agents in the stochastic setting.

###### Lemma 0

Assume that uniformly and that . Then, after exponential transients,

 E[∑i∥xi−x∙∥2]≤(p−1)Cη2b(k−¯λ) (14)

where each is a solution of (7).

###### Proof of Lemma 1

We first apply Thm. 2 in the Euclidean metric to conclude that

 E[∥xi−xj∥2]≤Cηb(k−¯λ)

after exponential tranasients. Summing (1) over and leads to

 E[∑i

Finally, as in (Tabareau et al., 2010), we can rewrite

 ∑i

which proves the result.

We will refer to (14) as a synchronization condition.

### 3.2 The interaction between synchronization and noise

We now provide a mathematical characterization of how synchronization reduces the amount of noise felt by the individual QSGD agents. In addition to the assumptions stated in Sec. 2.5, we require that the SGD agents are stochastically contracting with rate and bound . This ensures that the synchronization condition in (14) derived in the previous subsection can be applied. This subsection follows the mathematical procedure first employed in (Tabareau et al., 2010) in the context of quorum sensing. Adding up the stochastic dynamics in (7), we find

We then define

 ϵ=−1p∑i∇f(xi)+∇f(x∙)

so that we can rewrite

 dx∙=[−∇f(x∙)+ϵ]dt+√η√bpB∑idWi (15)

Let denote the gradient of and let denote its Hessian. We apply the Taylor formula with integral remainder to ,

 (−∇f(xi))j +(∇f(x∙))j−FTj(x∙)(xi−x∙) (16)

Summing (16) over and applying the assumed bound leads to the inequality

 ∣∣ ∣∣∑i(−∇f(xi))j+(∇f(x∙))j∣∣ ∣∣≤Q2∑i∥xi−x∙∥2

The left-hand side of the above inequality is . Squaring both sides and summing over provides a bound on . Taking a square root of this bound, performing an expectation over the noise, and using the synchronization condition in (14), we conclude that

 E[∥ϵ∥]≤(p−1)√nQCη4p(k−¯λ)b (17)

The bound in (17) depends on the contraction rate of each agent , the dimensionality of space , the bound on the non-quadraticity of the objective , and the bound on the noise strength . In the limit of large , the dependence on becomes negligible. The expected effect of the disturbance term tends to zero as the coupling gain tends to infinity, corresponding to the fully synchronized limit. By Assumption in Sec. 2.5, as , the difference between trajectories of (15) and the system without the presence of tends to zero.

As a sum of independent Gaussian random variables with mean zero and covariance , the quantity can be rewritten as a single Gaussian random variable . Thus, for a given noise covariance and corresponding bound , the difference between the dynamics followed by and the noise-free gradient descent system

 ˙xnf =−∇f(xnf)

tends to zero as and . The limit eliminates the effect of on the dynamics of , while the limit removes the effect of the additive noise. More precisely, the impact of the noise on the center of mass trajectory is characterized by

 (p−1)QCη√n4p(k−¯λ)b+√Cηbp

From the triangle inequality, the trajectory of any individual agent and that of the noise-free system are similar.

### 3.3 Discussion

The results in this section demonstrate that for distributed SGD algorithms, the noise strength is set by the ratio parameter at the expense of a distortion term which tends to zero with synchronization. This derivation will be essential for the strictly convex convergence proofs in Sec. 5. However, in the context of current empirical evidence, it is difficult to predict if this noise reduction is positive or negative for nonconvex

optimization. Many studies have reported the importance of stochastic gradient noise in deep learning

(Poggio et al., 2017; Zhu et al., 2018; Chaudhari and Soatto, 2018; Zhang et al., 2017). Larger batches are known to cause issues with generalization, and this has been hypothesized to be due to a reduction in the noise magnitude (Keskar et al., 2016). However, recent results suggest that the generalization gap may be due to a reduction in training time with larger batches (Hoffer et al., 2017), and a recent algorithm demonstrates that low noise levels can be sufficient (Jin et al., 2018).

If large batch size training with properly tuned hyperparameters does not impact performance as suggested by

(Hoffer et al., 2017), this noise protection property could supply the “best of both worlds”: training time and I/O operations are unaffected, because standard batch sizes can be used, but the effective batch size is larger.

If a larger noise strength is desired with typical batch sizes, these results suggest scaling with the number of agents . Similar linear scaling rules between the learning rate and the batch size have been suggested for successful training of large-scale deep networks using large batches and batch parallelism (Goyal et al., 2017).

Note that the simultaneous limits and are undesirable in the context of nonconvex optimization, as the noise-free gradient descent system will simply converge to the nearest critical point.

Finally, our synchronization setting allows a similar result to be derived when each individual agent has a different learning rate , and thus a different expected smoothing. In this case, the synchronization condition in (14) is modified to

 E[∑i∥xi−x∙∥2]≤C2pb(k−¯λ)∑i

so that

 E[∥ϵ∥]≤√nQC4p2b(k−¯λ)∑i

The noise term becomes a sum of independent Gaussians each with covariance , and can be written as a single Gaussian random variable . This could allow e.g. for multiresolution optimization, where agents with larger learning rates may help avoid sharper local minima, while agents with finer learning rates may help converge to good local minima. Similarly, typical learning rate schedules could be parallelized by starting multiple agents at the same initial set of parameters with different learning rates.

## 4 An alternative view of distributed stochastic gradient descent

Using similar techniques as employed in the previous section, we provide an analysis of the effect of synchronization on the degree of smoothing of the loss function imposed by SGD, first derived rigorously in (Kleinberg et al., 2018). Defining an auxiliary sequence and comparing with (1) shows that , yielding

 yt+1=yt−η∇f(yt−η√bζ)−η√bζ

so that

 Eζ[yt+1]=yt−η∇Eζ[f(yt−η√bζ)]

This demonstrates that the sequence performs gradient descent on the loss function convolved with the -scaled noise in expectation111In (Kleinberg et al., 2018), the authors group the factor of with the covariance of the noise.. Using this argument, it is shown in (Kleinberg et al., 2018) that SGD can converge to minimizers for a much larger class of functions than just convex functions, though the convolution operation can disturb the locations of the minima.

### 4.1 The effect of synchronization on the convolution scaling

The analysis in Sec. 3 suggests that synchronization of the variables should reduce the convolution scaling for a variable related to the center of mass, and we now make this intuition more precise. We perform this calculation for the QSGD algorithm; an analogous, slightly more technical derivation for arbitrary dynamics of the quorum variable can be found in the appendix. We have that

 Δxit=−η∇f(xit)+ηk(x∙t−xit)−η√bζit

so that

 Δx∙t =−η∇f(x∙t)+ηϵt−η√bpζt

with as usual. Define the auxiliary variable , so that

 x∙t+1=y∙t+ηϵt−η√bpζt (19)

Equation (19) can then be used to state

 y∙t+1 =y∙t−η∇f(x∙t+1)+ηϵt−η√bpζt =y∙t−η∇f(y∙t−η√bpζt+ηϵt)+ηϵt−η√bpζt

Taylor expanding the gradient term, we find,

 ∇f(y∙t−η√bpζt+ηϵt)=∇f(y∙t−η√bpζt)+η∇2f(y∙t−η√bpζt)ϵt+O(η2)

which alters the discrete update to

 Δy∙t =−η∇f(y∙t−η√bpζt)+η(1−η∇2f(y∙t−η√bpζt))ϵt−η√bpζt+O(η3) (20)

Converting (20) to an SDE via the procedure outlined in Sec. 2.2 then leads to

 dy∙t =(−∇Eζt[f(y∙t−η√bpζt)]+(1−η∇2Eζt[f(y∙t−η√bpζt)])ϵt)dt−√ηbpdWt (21)

Equation (21) says that, in expectation, performs gradient descent on a convolved loss with scaling reduced by a factor of . The reduced scaling comes at the expense of the usual disturbance term , which decreases to zero with increasing synchronization in expectation over the noise for . The convolution operation will disturb the location of the minima with respect to the true loss function ; thus, minima found by QSGD will lie closer to minima of the true loss function than standard SGD.

## 5 Convergence analysis

We now provide contraction-based convergence proofs for QSGD and EASGD.

### 5.1 QSGD convergence analysis

We first present a simple lemma describing convergence of deterministic distributed gradient descent with arbitrary coupling.

###### Lemma 0

Consider the all-to-all coupled system of ordinary differential equations

 ˙xi=−∇f(xi)+∑j≠i(u(xj)−u(xi)) (22)

with for . Assume that is contracting in some metric, and that is contracting in some (not necessarily the same) metric. Then all globally exponentially converge to a critical point of .

###### Proof of Lemma 2

Consider the virtual system

 ˙y=−∇f(y)−pu(y)+p∑j=1u(xj)

This system is contracting by assumption, and each of the individual agents is a particular solution. The agents therefore globally exponentially synchronize. After exponential transients, the dynamics of each agent is described by the reduced-order virtual system

 ˙y=−∇f(y)

By assumption, this system is contracting in some metric, and has a particular solution at any critical point such that .

We now make a few remarks on Lemma 5.

• This simple lemma demonstrates that any form of coupling can be used so long as the quantity is contracting. A simple choice is where is the coupling gain, corresponding to balanced equal strength all-to-all coupling. Then (22) can be simplified to

 ˙xi =−∇f(xi)+k(x∙−xi) (23)

which is QSGD without noise. Note that all-to-all coupling can thus be implemented with connections by communicating with the center of mass variable.

• If is strictly convex, will be contracting in the identity metric.

• If is geodesically convex, will be contracting in the corresponding Riemannian metric (Rapcsák, 1991; Wensing and Slotine, 2018).

• If is locally strictly convex or locally geodesically convex, will be locally contracting in the identity or Riemannian metric respectively. For example, for a nonconvex objective with initializations in a convex or geodesically convex region of parameter space, we can conclude exponential convergence to a local minimizer for each agent.

If is strictly or geodesically convex, the coupling between agents provides no advantage in the deterministic setting, as they would individually contract towards the minimum regardless. For stochastic dynamics, however, we would expect coupling to improve convergence. We now demonstrate the ramifications of the results in Sec. 3 in the context of QSGD agents with the following theorem.

###### Theorem 6

Consider the QSGD algorithm

with for . Assume that there exists a uniform upper bound on the Hessian of the components of the negative gradient, . Assume that is bounded such that uniformly. Assume that is strictly convex, so that uniformly with . Then, after exponential transients, the expected difference between the center of mass trajectory and the global minimizer of is given by

 E[∥x∗−x∙∥]≤Q(p−1)C√nη4pbλ(λ+k)+√ηCbpλ (24)
###### Proof of Theorem 4

We first sum the dynamics of the individual agents to compute the dynamics of the center of mass variable. This leads to the SDE

 dx∙=(−∇f(x∙)+ϵ)dt+√ηbpBdWt

with defined exactly as in Sec. 3. Consider the hierarchy of virtual systems

 ˙y1 =−∇f(y1) ˙y2 =−∇f(y2)+ϵ dy3t =(−∇f(y3)+ϵ)dt+√ηbpBdWt

The system is contracting by assumption, and admits a particular solution . By Thm. 1, the difference between the and systems can be bounded by after exponential transients. We have already shown in Sec. 3 that 222In Sec. 3, the denominator contained the factor rather than . Convexity of was not assumed, so that the contraction rate of the coupled system was . In this proof, convexity of implies that the contraction rate of the coupled system is ., so that

 E[∥y2−x∗∥]≤E[∥ϵ∥]λ≤Q(p−1)Cη√n4p(λ+k)λb

The system is contracting for any input , and the system is identical with the addition of an additive noise term. By Thm. 2, after exponential transients,

 E[∥y3−y2∥2]≤ηCbpλ

By Jensen’s inequality, and noting that is a concave, increasing function,

 E[∥y3−y2∥]≤√E[∥y3−y2∥2]≤√ηCbpλ

Finally, note that is a particular solution of the virtual system. From these observations and an application of the triangle inequality, after exponential transients,

 E[∥x∙−x∗∥]≤Q(p−1)C√nη4pbλ(λ+k)+√ηCbpλ

This completes the proof.

As in Sec. 3, the bound in (24) consists of two terms. The first term originates from a lack of complete synchronization and can be decreased by increasing . The second term comes from the additive noise, and can be decreased by increasing the number of agents. The first term in the bound may be loose in the large-scale machine learning setting due to the appearance of , but its dependencies on , , and are insightful, and can be exploited.

For example, state- and time-dependent couplings of the form are also immediately applicable with the proof methodology above. Note that the state-dependence must be identical for each agent for the coupling term to cancel when computing the dynamics of . Increasing over time can significantly decrease the influence of the first term in (24), leaving only a bound essentially equivalent to linear noise averaging. For nonconvex objectives, this suggests choosing low values of in the early stages of training for exploration, and larger values near the end of training to reduce the variance of around a minimum. If accessible, local curvature information could also be used to determine when is near a local minimum, and therefore when to increase . Using state- and time-dependent couplings would change the duration of exponential transients, but the result in Thm. 6 would still hold.

### 5.2 EASGD convergence analysis

We now incorporate the additional dynamics present in the EASGD algorithm. First, we prove a lemma demonstrating convergence to the global minimum of a strongly convex function with full gradient computations.

###### Lemma 0

Consider the deterministic continuous-time EASGD algorithm

 ˙xi =−∇f(xi)+k(~x−xi) ˙~x =kp(x∙−~x)

with for . Assume is strictly convex, so that for some . Then all agents and the quorum variable globally exponentially converge to the unique global minimum .

###### Proof of Lemma 3

By Thm. 3 and convexity of , the individual trajectories globally exponentially synchronize with rate . On the synchronized subspace, the overall system can be described by the reduced-order virtual system

 ˙y =−∇f(y)+k(~x−y) ˙~x =kp(y−~x)

The system Jacobian is then given by

 J=(−∇2f(y)−kIkIkpI−kpI)

Choosing a metric transformation , the generalized Jacobian becomes

 ΘJΘ−1=(−∇2f(y)−kI√pkI√pkI−kpI)

which is clearly symmetric. A sufficient condition for negative definiteness of this matrix is that (Wang and Slotine, 2005; Horn and Johnson, 2012). Rearranging leads to the condition , which is satisfied by strict convexity of . The overall system is therefore contracting. Finally, note that where is the unique global minimum is a particular solution. Because the overall virtual system is contracting, all trajectories will globally exponentially converge to this minimum.

Just as in Thm. 6, we now turn to a convergence analysis for the EASGD algorithm, using the results of Lemma 7.

###### Theorem 8

Consider the continuous-time EASGD algorithm

 dxi d~x =kp(x∙−~x)dt

for . Assume that is strictly convex, so that . Assume that there exists a uniform upper bound on the Hessian of the components of the negative gradient, . Let denote the contraction rate of this system in the metric with . Further assume that with a positive constant. Then, after exponential transients,

 E[∥z−z∗∥]≤Q(p−1)C√nη4b√pγ(λ+k)+√ηCbpγ (25)

where and .

###### Proof of Theorem 5

Adding up the agent dynamics as in Thm. 6, the center of mass trajectory follows

 dx∙=(−∇f(x∙)+ϵ+k(~x−x∙))dt+√ηbpBdWt

with the usual definition of . Consider the hierarchy of virtual systems,

 ˙y1 =−∇f(y1)+k(~y1−y1) ˙~y1 =kp(y1−~y1) ˙y2 =−∇f(y2)+k(~y2−y2)+ϵ ˙~y2 =kp(y2−~y2) dy3 =(−∇f(y3)+k(~y3−y3))dt+ϵ+√ηbpBdWt d~y3 =kp(y3−~y3)dt

The first system is contracting towards the unique global minimum by the assumptions of the theorem. The second system is contracting for any external input , and we have already bounded in Sec. 3 (the bound is independent of dynamics for the quorum variable - see the appendix for details). Let and . By Thm. 1, and noting that the condition number of is ,

 E[∥∥z2−z∗∥∥]≤√pγE[∥ϵ∥]≤Q(p−1)Cη√n4√p(λ+k)γb

after exponential transients. Note that . Hence we can take in Thm. 2, and

 E[∥z3−z2∥2]≤ηCbpγ

after exponential transients. Then, by Jensen’s inequality,

 E[∥z3−z2∥]≤√ηCbpγ

Combining these results and using the triangle inequality, we find that after exponential transients

 E[∥z−z∗∥]≤Q(p−1)C√nη4b√pγ(λ+k)+√ηCbpγ

where . This completes the proof.

Thm. 8 demonstrates an explicit bound on the expected deviation of both the center of mass variable and the quorum variable from the global minimizer of a strictly convex function. Unlike in Thm. 6, the bound is now applied to the combined vector rather than the center of mass variable itself, and the contraction rate is used rather than . The specific metric transformation used adds a factor of to the first quantity in the bound, so this term scales like . This suggests that should be chosen to scale with faster than . This consideration is not necessary for the QSGD algorithm.

As in the discussion after Thm. 6, the results will still hold with state- and time-dependent couplings of the form , and the same ideas can be used to eliminate the effect of the first term in the bound.

### 5.3 Specialization to a multivariate quadratic objective

In the original discrete-time analysis of EASGD in (Zhang et al., 2015), it was proven that iterate averaging (Polyak and Juditsky, 1992) of leads to a Fisher-optimal variance around the minimum of a quadratic objective. We now derive an identical result in continuous-time for the QSGD algorithm, demonstrating that this optimality is independent of the additional dynamics in the EASGD algorithm.

For a multivariate quadratic with symmetric and positive definite, the stochastic dynamics of each agent can be written

 dxit=(−Axit+k(x∙−xit))dt+BdWit

with . To make the optimal result more clear, we group the factor of with , unlike in previous sections. Then satisfies

 dx∙t=−Ax∙tdt+1√pBdW

This is a multivariate Ornstein-Uhlenbeck process with solution

 x∙(t)=e−Atx∙(0)+1√p∫t0e−A(t−s)BdWs (26)

By assumption, is negative definite, so that the stationary expectation . The stationary variance is given by

 AV+VAT=1pΣ

(see, for example, (Gardiner, 2009), p.107). We now define

 z(t)=1t∫t0x∙(t′)dt′

and can immediately state the following lemma.

###### Lemma 0

The averaged variable

converges weakly to a normal distribution with mean zero and standard deviation