1 Introduction
Stochastic gradient descent (SGD) and its variants have become the defacto algorithms for largescale machine learning applications such as deep neural networks (Bottou, 2010; Goodfellow et al., 2016; LeCun et al., 2015; Mallat, 2016). SGD is used to optimize finitesum loss functions, where a stochastic approximation to the gradient is computed using only a random selection of the input data points. Wellknown results on almostsure convergence rates to global minimizers for strictly convex functions and to stationary points for nonconvex functions exist under sufficient regularity conditions (Bottou, 1998; Robbins and Siegmund, 1971). Classic work on iterate averaging for SGD (Polyak and Juditsky, 1992) and other more recent extensions in (Defazio et al., 2014; Roux et al., 2012; Bach and Moulines, 2013; Schmidt et al., 2017) can improve convergence under a set of reasonable assumptions typically satisfied in the machine learning setting. Convergence proofs rely on a suitably chosen decreasing step size; for constant step sizes and strictly convex functions, the parameters ultimately converge to a distribution peaked around the optimum.
For largescale machine learning applications, parallelization of SGD is a critical problem of significant modern research interest (Dean et al., 2012; Recht and Ré, 2013; Recht et al., 2011; Chaudhari et al., 2017). Recent work in this direction includes the elastic averaging SGD (EASGD) algorithm, in which
distributed agents coupled through a common signal optimize the same loss function. EASGD can be derived from a single SGD step on a global variable consensus objective with a quadratic penalty, and the common signal takes the form of an average over space and time of the individual agents’ parameter vectors
(Zhang et al., 2015; Boyd et al., 2010). At its core, the EASGD algorithm is a system of identical, coupled, discretetime dynamical systems.Surprisingly, mathematical models of quorum sensing in bacteria follow exactly the same structure as the EASGD algorithm, though the dynamics of the common (quorum) signal can be arbitrary, and analysis has typically been performed in continuoustime (Miller and Bassler, 2001; Waters and Bassler, 2005; Russo and Slotine, 2010; Dockery D. and James, 2000). Motivated by this immediate analogy, we present here a continuoustime analysis of distributed stochastic gradient algorithms, of which EASGD is a special case. A significant focus of this work is the interaction between the degree of synchronization of the individual agents, characterized rigorously by a bound on the expected distance between all agents and governed by the coupling strength, and the amount of noise induced by their stochastic gradient approximations.
The effect of coupling between identical continuoustime dynamical systems has a rich history. In particular, synchronization phenomena in such coupled systems have been the subject of much mathematical (Wang and Slotine, 2005), biological Russo and Slotine (2010), neuroscientific (Tabareau et al., 2010), and physical interest (Javaloyes et al., 2008). In nonlinear dynamical systems, synchronization has been shown to play a crucial role in protection of the individual systems from independent sources of noise (Tabareau et al., 2010)
. The interaction between synchronization and noise has also been posed as a possible source of regularization in biological learning, where quorum sensinglike mechanisms could be implemented between neurons through local field potentials
(Bouvrie and Slotine, 2013). Given the significance of stochastic gradient (Zhang et al., 2018b) and externally injected (Neelakantan et al., 2015) noise in regularization of largescale machine learning models such as deep networks (Zhang et al., 2017), it is natural to expect that the interplay between synchronization of the individual agents and the noise from their stochastic gradient approximations is of central importance in distributed SGD algorithms.Recently, there has been renewed interest in a continuoustime view of optimization algorithms (Wilson et al., 2016; Wibisono et al., 2016; Wibisono and Wilson, 2015; Betancourt et al., 2018). Nesterov’s accelerated gradient method (Nesterov, 1983) was fruitfully analyzed in continuoustime in (Su et al., 2014), and a unifying extension to other algorithms can be found in (Wibisono et al., 2016). Continuoustime analysis has also enabled discretetime algorithm development through classical discretization techniques from numerical analysis Zhang et al. (2018a). Analysis in continuoustime optimization is limited, in part, by a difficulty in concluding corresponding discretetime convergence rates: the dynamics can be arbitrarily sped up in continuoustime to achieve any convergence rate. An absolute time reference can be imposed, for example, by using a singular perturbation framework as in (Nguyen et al., 2018).
The paper is organized as follows. In Sec. 2, we provide some necessary mathematical preliminaries: a review of SGD in continuoustime, a continuoustime limit of the EASGD algorithm, a review of stochastic nonlinear contraction theory, and a statement of some needed assumptions. In Sec. 3, we demonstrate that the effect of synchronization of the distributed SGD agents is to protect each agent and their spatial mean from noise. We derive this for an algorithm where alltoall coupling is implemented through communication with the spatial mean of the distributed parameters, and we refer to this algorithm as quorum SGD (QSGD). In the appendix, a similar derivation is presented with arbitrary dynamics for the quorum variable, of which EASGD is a special case. In Sec. 4, we connect this noise reduction property with a recent analysis (Kleinberg et al., 2018), which shows SGD can be interpreted as performing gradient descent on a smoothed loss in expectation. In Sec. 5, we provide convergence results for QSGD, as well as new, stronger convergence results for EASGD than were previously obtained in discretetime (Zhang et al., 2015). In Sec. 6, we consider a local analysis around a minimum of a nonconvex function, and demonstrate that a larger number of distributed agents leads to lower expected loss function values and wider minima, generalizing recent work to the distributed setting (Jastrzȩbski et al., 2017). We close with some concluding remarks in Sec. 7.
2 Mathematical preliminaries
In this section, we provide a brief review of the necessary mathematical tools employed in this work.
2.1 Stochastic gradient descent in discretetime
SGD has been essential for training largescale machine learning models such as deep neural networks, where empirical risk minimization leads to finitesum loss functions of the form
Above, is the input data example and the vector holds the model parameters. In the typical machine learning setting where is very large, the gradient of requires gradient computations of , which is prohibitively expensive.
Instead, a stochastic gradient is computed by taking a random selection of size , typically known as a minibatch. It is simple to see that the stochastic gradient
is an unbiased estimator of the true gradient. The parameters are updated according to the iteration
By adding and subtracting the true gradient, the SGD iteration can be rewritten
(1) 
where is a datadependent noise term.
can be taken to be Gaussian under a central limit theorem argument, assuming that the size of the minibatch is large enough
(Jastrzȩbski et al., 2017; Mandt et al., ).is then given by the variance of a singleelement stochastic gradient
with the index chosen uniformly from , and where the expectation is taken over the selection of this index.
2.2 Stochastic gradient descent in continuoustime
A significant difficulty in a continuoustime analysis of SGD is formulating an accurate stochastic differential equation (SDE) model. Recent works have proved rigorously (Hu et al., 2017; Feng et al., 2018; Li et al., 2015, 2018) that the sequence of values generated by the SDE
approximates the SGD iteration with weak error , where is a standard Wiener process, and where . Dropping the small term proportional to reduces the weak error to Hu et al. (2017). This leads to the SDE
(2) 
where . Equation (2) has appeared in a number of recent works (Mandt et al., 2017, 2016, ; Chaudhari and Soatto, 2018; Chaudhari et al., 2018; Jastrzȩbski et al., 2017), and is generally obtained by making the replacement and in (1), as a sort of reverse EulerMaruyama discretization (Kloeden and Platen, 1992). In this work, we will need to convert a related sequence to continuoustime, where these simple rules will not apply.
A procedure common in the stochastic modeling community for converting a discrete stochastic model to an SDE is to consider
(3) 
where denotes the discrete update . It can be shown that the discrete stochastic model and the SDE in (3) approximately have the same FokkerPlanck equation (Allen, 2007)
. The probability distribution of solutions to the SDE in (
3) and the original discrete model will therefore approximately be the same, and it is expected that insight about the behavior of the discrete model can be gained from analyzing the continuous model. Performing this procedure on (1) immediately leads toso that, considering as the timestep, the approximating SDE in the sense of (3) is given by
which is precisely (2). Later in this work, we will use this procedure to convert a discretetime sequence to a continuoustime limit.
2.3 EASGD in continuoustime
Following (Zhang et al., 2015), we provide a brief introduction to the EASGD algorithm, and convert the resulting sequences to continuoustime. We imagine a distributed optimization setting with agents and a single master. We are interested in solving a stochastic optimization problem
where is the vector of parameters and
is a random variable representing the stochasticity in the objective. This is equivalent to the distributed optimization problem
(Boyd et al., 2010)(4) 
where each is a local vector of parameters and is the quorum variable. The quadratic penalty ensures that all local agents remain close to the quorum variable, and sets the coupling strength. Smaller values of allow for more exploration, while larger values ensure a greater degree of synchronization. Intuitively, the interaction between agents mediated by is expected to help individual trajectories escape local minima, unless they all fall into the same deep or wide minimum together. Later in this work, we will make this intuition more precise.
We assume the expectation in (2.3) is approximated by a sum over input data points, and that the stochastic gradient is computed by taking a batch of size . After taking an SGD step, the updates for each agent and the quorum variable become
where and . Switching to the continuoustime limit, these equations become,
(5)  
(6) 
with . Note that in (6), the dynamics of represent a simple lowpass filter of the center of mass (spatial mean) variable . In the limit of large , the dynamics of this filter will be much faster than the SGD dynamics, and the continuoustime EASGD system can be replaced by
(7) 
We refer to (7) as quorum SGD (QSGD), and it will be the primary focus of this work. Analyses for EASGD outside of the convergence results in Sec. 5 can be found in the appendix; we find that additional dynamics for the quorum variable, no matter their form, make the analysis more technical but offer no discernible theoretical benefit.
2.4 Background on nonlinear contraction theory
The main mathematical tool used in this work is nonlinear contraction theory. Contraction is a form of incremental stability for nonlinear systems whose properties we briefly summarize here; further details can be found in (Lohmiller and Slotine, 1998). Intuitively, a nonlinear system is called contracting if the effects of initial conditions and perturbations are lost exponentially fast. Consider a nonlinear dynamical system
(8) 
where is the state and is a smooth function. Let
denote a uniformly invertible matrix. Equation (
8) is said to be contracting in the metric with contraction rate if the symmetric part of the generalized Jacobian is uniformly negative definite,(9) 
for all and for all time . In (9), subscript s refers to the symmetric part of a matrix, . If (9) is satisfied for a nonlinear system, all trajectories globally exponentially converge to each other regardless of initial conditions (Lohmiller and Slotine, 1998). If a dynamical system is contracting, we will interchangeably refer to , the system, and as contracting. We will also refer to as the metric transformation. Two specific robustness results for contracting systems needed for the derivations in this work are summarized below.
Theorem 1
Proof of Theorem 1
See (Vecchio and Slotine, 2013).
Theorem 2
Consider the stochastic differential equation
(11) 
with and where denotes an dimensional Wiener process. Assume that there exists a uniformly positive definite metric such that with , and that is contracting in this metric. Further assume that is uniformly upper bounded by a constant C. Then, for two trajectories and with stochastic initial conditions given by a probability distribution ,
where denotes the unit ramp function and . The expectation on the lefthand side is over the noise for all , and the expectation on the righthand side is over the distribution of initial conditions.
Proof of Theorem 2
See (Pham et al., 2009).
2.5 Assumptions
We require two main assumptions about the objective function , both of which have been employed in previous work analyzing synchronization and noise in nonlinear systems (Tabareau et al., 2010). The first is an assumption on the nonlinearity of the components of the gradient. Intuitively, this can be understood as a condition on the third derivative of the objective, or on its degree of nonquadraticity.
Assumption 1
Assume that the Hessian matrix of each component of the gradient has bounded maximum eigenvalue,
for all .The second assumption is a condition on the resistance of the distributed, coupled gradient flows studied in this work to external stochastic perturbations.
Assumption 2 Consider two dynamical systems , and where is a continuoustime stochastic process. Then implies that .
3 Synchronization and noise
In this section, we analyze the interaction between synchronization of the distributed QSGD agents and the noise they experience. We begin with a derivation of a quantitative measure of synchronization that applies to any distributed SGD algorithm, and then present the section’s primary contribution.
3.1 A measure of synchronization
We begin with a simple theorem on synchronization in the deterministic setting, which will allow us to prove a bound on synchronization in the stochastic setting using Thm. 2.
Theorem 3
Consider the coupled gradient descent system
(12) 
where represents a common external signal. Let denote the maximum eigenvalue of . For , the individual systems globally exponentially synchronize with rate .
Proof of Theorem 3
Consider the virtual system
where is an external input. The Jacobian of this system is given by
(13) 
Equation (13) is symmetric and negative definite for for any external input . Because the individual are particular solutions of this virtual system with the same input, contraction for all implies that for all and , exponentially. The contraction rate is given by .
Thm. 3 gives a simple condition on the coupling for synchronization of the individual agents in (12). Because can represent any input, Thm. 3 applies to any dynamics of the quorum variable: with , it applies to the QSGD algorithm, and with , it applies to the EASGD algorithm. Under the assumption of a contracting deterministic system, we can use the stochastic contraction results in Thm. 2 to bound the expected distance between individual agents in the stochastic setting.
Lemma 0
Assume that uniformly and that . Then, after exponential transients,
(14) 
where each is a solution of (7).
Proof of Lemma 1
We will refer to (14) as a synchronization condition.
3.2 The interaction between synchronization and noise
We now provide a mathematical characterization of how synchronization reduces the amount of noise felt by the individual QSGD agents. In addition to the assumptions stated in Sec. 2.5, we require that the SGD agents are stochastically contracting with rate and bound . This ensures that the synchronization condition in (14) derived in the previous subsection can be applied. This subsection follows the mathematical procedure first employed in (Tabareau et al., 2010) in the context of quorum sensing. Adding up the stochastic dynamics in (7), we find
We then define
so that we can rewrite
(15) 
Let denote the gradient of and let denote its Hessian. We apply the Taylor formula with integral remainder to ,
(16) 
Summing (16) over and applying the assumed bound leads to the inequality
The lefthand side of the above inequality is . Squaring both sides and summing over provides a bound on . Taking a square root of this bound, performing an expectation over the noise, and using the synchronization condition in (14), we conclude that
(17) 
The bound in (17) depends on the contraction rate of each agent , the dimensionality of space , the bound on the nonquadraticity of the objective , and the bound on the noise strength . In the limit of large , the dependence on becomes negligible. The expected effect of the disturbance term tends to zero as the coupling gain tends to infinity, corresponding to the fully synchronized limit. By Assumption in Sec. 2.5, as , the difference between trajectories of (15) and the system without the presence of tends to zero.
As a sum of independent Gaussian random variables with mean zero and covariance , the quantity can be rewritten as a single Gaussian random variable . Thus, for a given noise covariance and corresponding bound , the difference between the dynamics followed by and the noisefree gradient descent system
tends to zero as and . The limit eliminates the effect of on the dynamics of , while the limit removes the effect of the additive noise. More precisely, the impact of the noise on the center of mass trajectory is characterized by
From the triangle inequality, the trajectory of any individual agent and that of the noisefree system are similar.
3.3 Discussion
The results in this section demonstrate that for distributed SGD algorithms, the noise strength is set by the ratio parameter at the expense of a distortion term which tends to zero with synchronization. This derivation will be essential for the strictly convex convergence proofs in Sec. 5. However, in the context of current empirical evidence, it is difficult to predict if this noise reduction is positive or negative for nonconvex
optimization. Many studies have reported the importance of stochastic gradient noise in deep learning
(Poggio et al., 2017; Zhu et al., 2018; Chaudhari and Soatto, 2018; Zhang et al., 2017). Larger batches are known to cause issues with generalization, and this has been hypothesized to be due to a reduction in the noise magnitude (Keskar et al., 2016). However, recent results suggest that the generalization gap may be due to a reduction in training time with larger batches (Hoffer et al., 2017), and a recent algorithm demonstrates that low noise levels can be sufficient (Jin et al., 2018).If large batch size training with properly tuned hyperparameters does not impact performance as suggested by
(Hoffer et al., 2017), this noise protection property could supply the “best of both worlds”: training time and I/O operations are unaffected, because standard batch sizes can be used, but the effective batch size is larger.If a larger noise strength is desired with typical batch sizes, these results suggest scaling with the number of agents . Similar linear scaling rules between the learning rate and the batch size have been suggested for successful training of largescale deep networks using large batches and batch parallelism (Goyal et al., 2017).
Note that the simultaneous limits and are undesirable in the context of nonconvex optimization, as the noisefree gradient descent system will simply converge to the nearest critical point.
Finally, our synchronization setting allows a similar result to be derived when each individual agent has a different learning rate , and thus a different expected smoothing. In this case, the synchronization condition in (14) is modified to
so that
(18) 
The noise term becomes a sum of independent Gaussians each with covariance , and can be written as a single Gaussian random variable . This could allow e.g. for multiresolution optimization, where agents with larger learning rates may help avoid sharper local minima, while agents with finer learning rates may help converge to good local minima. Similarly, typical learning rate schedules could be parallelized by starting multiple agents at the same initial set of parameters with different learning rates.
4 An alternative view of distributed stochastic gradient descent
Using similar techniques as employed in the previous section, we provide an analysis of the effect of synchronization on the degree of smoothing of the loss function imposed by SGD, first derived rigorously in (Kleinberg et al., 2018). Defining an auxiliary sequence and comparing with (1) shows that , yielding
so that
This demonstrates that the sequence performs gradient descent on the loss function convolved with the scaled noise in expectation^{1}^{1}1In (Kleinberg et al., 2018), the authors group the factor of with the covariance of the noise.. Using this argument, it is shown in (Kleinberg et al., 2018) that SGD can converge to minimizers for a much larger class of functions than just convex functions, though the convolution operation can disturb the locations of the minima.
4.1 The effect of synchronization on the convolution scaling
The analysis in Sec. 3 suggests that synchronization of the variables should reduce the convolution scaling for a variable related to the center of mass, and we now make this intuition more precise. We perform this calculation for the QSGD algorithm; an analogous, slightly more technical derivation for arbitrary dynamics of the quorum variable can be found in the appendix. We have that
so that
with as usual. Define the auxiliary variable , so that
(19) 
Equation (19) can then be used to state
Taylor expanding the gradient term, we find,
which alters the discrete update to
(20) 
Converting (20) to an SDE via the procedure outlined in Sec. 2.2 then leads to
(21) 
Equation (21) says that, in expectation, performs gradient descent on a convolved loss with scaling reduced by a factor of . The reduced scaling comes at the expense of the usual disturbance term , which decreases to zero with increasing synchronization in expectation over the noise for . The convolution operation will disturb the location of the minima with respect to the true loss function ; thus, minima found by QSGD will lie closer to minima of the true loss function than standard SGD.
5 Convergence analysis
We now provide contractionbased convergence proofs for QSGD and EASGD.
5.1 QSGD convergence analysis
We first present a simple lemma describing convergence of deterministic distributed gradient descent with arbitrary coupling.
Lemma 0
Consider the alltoall coupled system of ordinary differential equations
(22) 
with for . Assume that is contracting in some metric, and that is contracting in some (not necessarily the same) metric. Then all globally exponentially converge to a critical point of .
Proof of Lemma 2
Consider the virtual system
This system is contracting by assumption, and each of the individual agents is a particular solution. The agents therefore globally exponentially synchronize. After exponential transients, the dynamics of each agent is described by the reducedorder virtual system
By assumption, this system is contracting in some metric, and has a particular solution at any critical point such that .
We now make a few remarks on Lemma 5.

This simple lemma demonstrates that any form of coupling can be used so long as the quantity is contracting. A simple choice is where is the coupling gain, corresponding to balanced equal strength alltoall coupling. Then (22) can be simplified to
(23) which is QSGD without noise. Note that alltoall coupling can thus be implemented with connections by communicating with the center of mass variable.

If is strictly convex, will be contracting in the identity metric.

If is locally strictly convex or locally geodesically convex, will be locally contracting in the identity or Riemannian metric respectively. For example, for a nonconvex objective with initializations in a convex or geodesically convex region of parameter space, we can conclude exponential convergence to a local minimizer for each agent.
If is strictly or geodesically convex, the coupling between agents provides no advantage in the deterministic setting, as they would individually contract towards the minimum regardless. For stochastic dynamics, however, we would expect coupling to improve convergence. We now demonstrate the ramifications of the results in Sec. 3 in the context of QSGD agents with the following theorem.
Theorem 6
Consider the QSGD algorithm
with for . Assume that there exists a uniform upper bound on the Hessian of the components of the negative gradient, . Assume that is bounded such that uniformly. Assume that is strictly convex, so that uniformly with . Then, after exponential transients, the expected difference between the center of mass trajectory and the global minimizer of is given by
(24) 
Proof of Theorem 4
We first sum the dynamics of the individual agents to compute the dynamics of the center of mass variable. This leads to the SDE
with defined exactly as in Sec. 3. Consider the hierarchy of virtual systems
The system is contracting by assumption, and admits a particular solution . By Thm. 1, the difference between the and systems can be bounded by after exponential transients. We have already shown in Sec. 3 that ^{2}^{2}2In Sec. 3, the denominator contained the factor rather than . Convexity of was not assumed, so that the contraction rate of the coupled system was . In this proof, convexity of implies that the contraction rate of the coupled system is ., so that
The system is contracting for any input , and the system is identical with the addition of an additive noise term. By Thm. 2, after exponential transients,
By Jensen’s inequality, and noting that is a concave, increasing function,
Finally, note that is a particular solution of the virtual system. From these observations and an application of the triangle inequality, after exponential transients,
This completes the proof.
As in Sec. 3, the bound in (24) consists of two terms. The first term originates from a lack of complete synchronization and can be decreased by increasing . The second term comes from the additive noise, and can be decreased by increasing the number of agents. The first term in the bound may be loose in the largescale machine learning setting due to the appearance of , but its dependencies on , , and are insightful, and can be exploited.
For example, state and timedependent couplings of the form are also immediately applicable with the proof methodology above. Note that the statedependence must be identical for each agent for the coupling term to cancel when computing the dynamics of . Increasing over time can significantly decrease the influence of the first term in (24), leaving only a bound essentially equivalent to linear noise averaging. For nonconvex objectives, this suggests choosing low values of in the early stages of training for exploration, and larger values near the end of training to reduce the variance of around a minimum. If accessible, local curvature information could also be used to determine when is near a local minimum, and therefore when to increase . Using state and timedependent couplings would change the duration of exponential transients, but the result in Thm. 6 would still hold.
5.2 EASGD convergence analysis
We now incorporate the additional dynamics present in the EASGD algorithm. First, we prove a lemma demonstrating convergence to the global minimum of a strongly convex function with full gradient computations.
Lemma 0
Consider the deterministic continuoustime EASGD algorithm
with for . Assume is strictly convex, so that for some . Then all agents and the quorum variable globally exponentially converge to the unique global minimum .
Proof of Lemma 3
By Thm. 3 and convexity of , the individual trajectories globally exponentially synchronize with rate . On the synchronized subspace, the overall system can be described by the reducedorder virtual system
The system Jacobian is then given by
Choosing a metric transformation , the generalized Jacobian becomes
which is clearly symmetric. A sufficient condition for negative definiteness of this matrix is that (Wang and Slotine, 2005; Horn and Johnson, 2012). Rearranging leads to the condition , which is satisfied by strict convexity of . The overall system is therefore contracting. Finally, note that where is the unique global minimum is a particular solution. Because the overall virtual system is contracting, all trajectories will globally exponentially converge to this minimum.
Just as in Thm. 6, we now turn to a convergence analysis for the EASGD algorithm, using the results of Lemma 7.
Theorem 8
Consider the continuoustime EASGD algorithm
for . Assume that is strictly convex, so that . Assume that there exists a uniform upper bound on the Hessian of the components of the negative gradient, . Let denote the contraction rate of this system in the metric with . Further assume that with a positive constant. Then, after exponential transients,
(25) 
where and .
Proof of Theorem 5
Adding up the agent dynamics as in Thm. 6, the center of mass trajectory follows
with the usual definition of . Consider the hierarchy of virtual systems,
The first system is contracting towards the unique global minimum by the assumptions of the theorem. The second system is contracting for any external input , and we have already bounded in Sec. 3 (the bound is independent of dynamics for the quorum variable  see the appendix for details). Let and . By Thm. 1, and noting that the condition number of is ,
after exponential transients. Note that . Hence we can take in Thm. 2, and
after exponential transients. Then, by Jensen’s inequality,
Combining these results and using the triangle inequality, we find that after exponential transients
where . This completes the proof.
Thm. 8 demonstrates an explicit bound on the expected deviation of both the center of mass variable and the quorum variable from the global minimizer of a strictly convex function. Unlike in Thm. 6, the bound is now applied to the combined vector rather than the center of mass variable itself, and the contraction rate is used rather than . The specific metric transformation used adds a factor of to the first quantity in the bound, so this term scales like . This suggests that should be chosen to scale with faster than . This consideration is not necessary for the QSGD algorithm.
As in the discussion after Thm. 6, the results will still hold with state and timedependent couplings of the form , and the same ideas can be used to eliminate the effect of the first term in the bound.
5.3 Specialization to a multivariate quadratic objective
In the original discretetime analysis of EASGD in (Zhang et al., 2015), it was proven that iterate averaging (Polyak and Juditsky, 1992) of leads to a Fisheroptimal variance around the minimum of a quadratic objective. We now derive an identical result in continuoustime for the QSGD algorithm, demonstrating that this optimality is independent of the additional dynamics in the EASGD algorithm.
For a multivariate quadratic with symmetric and positive definite, the stochastic dynamics of each agent can be written
with . To make the optimal result more clear, we group the factor of with , unlike in previous sections. Then satisfies
This is a multivariate OrnsteinUhlenbeck process with solution
(26) 
By assumption, is negative definite, so that the stationary expectation . The stationary variance is given by
(see, for example, (Gardiner, 2009), p.107). We now define
and can immediately state the following lemma.
Lemma 0
The averaged variable
converges weakly to a normal distribution with mean zero and standard deviation
Comments
There are no comments yet.