1 Introduction
Training deep neural networks (DNNs) requires large amounts of computational resources, often using many devices operating over days and weeks. Furthermore, with everincreasing model size and available data, the amount of compute used was noted to increase exponentially over the past few years (Amodei & Hernandez, 2018). Fortunately, the operations made in DNN training are highly suitable for parallelization over many devices. Most commonly, the parallelization is done by "dataparallelism" in which the data is divided into separate batches which are distributed between different devices during training. By using many devices, with each device highly utilized, one could train on a large number of samples without paying in runtime. However, this training is done synchronously, with synchronization between the different devices done on every iteration (SSGD). With the growth in the number of devices participating in the training, the synchronization becomes a prominent bottleneck.
To overcome the synchronization bottleneck, asynchronous training (ASGD) has been proposed (Dean et al., 2012; Recht et al., 2011). The basic idea behind such training is that when a device finishes calculating its batch gradients, an update to the DNN parameters is immediately performed, without waiting for other devices. This approach is known as asynchronous centralized training with a parameter server (PS) that updates the parameters (Li, 2014). The main problem in such training is that, since the PS updates the parameters whenever a device communicates with him, the parameters that are being used in other devices calculation are no longer uptodate. This phenomena is called gradient staleness or delay as we will refer to it, and it causes a deterioration in generalization performance when using ASGD.
The generalization deterioration can be seen in Fig. 1. Unless the learning rate is significantly decreased, we can see a generalization gap: a deterioration in the generalization of ASGD (with delay ) from the baseline (SSGD, ) value near the steady state
— i.e., after we are near full convergence, following many training epochs (2000). Due to the generalization gap demonstrated in the figure, ASGD with a parameter server and large delays is not commonly used — despite it is relatively simple to implement and despite its vast potential to accelerate training. Therefore, our main goal in this work is to shed some theoretical light on the generalization gap problem and use it to improve ASGD training.
Previous theoretical analysis of ASGD (Liu et al., 2018; Lian et al., 2016, 2015; Dai et al., 2018; Dutta et al., 2018; Arjevani et al., 2018) focused on analyzing the convergence rate, i.e., the time it takes to reach steady state, and not on the properties of the obtained solution. In contrast, in our paper we focus on understanding how the delay affects the selection of the solution we converge to, and changes in this selection can impact generalization.
We tackle these questions from the perspective of dynamical stability. The dynamical stability approach was used in (Nar & Sastry, 2018; Wu et al., 2018), to study which minima are accessible under a specific choice of optimization algorithm and hyperparameters. In (Nar & Sastry, 2018), the authors analyzed gradient descent (GD) algorithm as a discretetime nonlinear dynamical system. By analyzing the Lyapunov stability of this system for different minima, they showed a relation between the learning rate choice and the accessible minima set, i.e., the subset of the local optima that GD can converge to. In (Wu et al., 2018), the authors focused on stochastic gradient descent (SGD), defined a criterion to evaluate the stability of a specific minimum and used this criterion to show how the learning rate and batchsize play a role in SGD minima selection process.
Here, we use dynamical stability to analyze the dynamics of ASGD. Using this approach, the main question we try to tackle is:
How do learning rate, delay and momentum interact and affect the minima selection process?
Contributions
We start by modelling ASGD as a dynamical system with delay. By analyzing the stability properties of that system, we find:

There exist an inverse linear relation between the delay of ASGD, and the threshold learning rate i.e., the critical learning rate in which a minimum looses stability.

This implies that in order to keep a specific minimum stability as we change the delay, we need to change the learning rate inversely proportional to the delay.

Momentum has a crucial role in determining stability properties for ASGD. Depending on the specific training algorithm, it can either hurt the stability or improve it. We show how to modify momentum to stabilize ASGD training.
With our theoretical findings, we derive closed form rules when training asynchronously. We suggest to scale the learning rate inversely to the delay in order to reach a better generalization performance. In section 3 we provide experiments on popular classification tasks to support our theoretical analysis. It is interesting to contrast this simple linear relation with the more complicated picture observed for the scaling of the learning rate with the batch size. Specifically, it has been observed by Shallue et al. (2018) that the optimal learning rate scales differently with the batch size when changing the models and datasets, i.e., there is no general "rule of thumb" that applies in all cases.
A similar scaling rule for the learning rate was proposed before (Zhang et al., 2015), however, there it lacked a theoretical basis or a demonstration of its effectiveness with large effective batch sizes, i.e., the sum of all the minibatch sizes of all the workers, as we do here.
2 Theoretical Analysis
2.1 Preliminaries and Problem Setup
Consider the problem of minimizing the empirical loss
(1) 
where is the number of samples, and : is twice continuously differentiable function. The update rule for minimizing in eq. 1 using asynchronous stochastic gradient descent (ASGD) is given by
(2) 
where is the learning rate, is a random selection process of a sample from at iteration , and is some delay due the asynchronous nature of our training. For simplicity, we focus on a fixed delay . The extension to stochastic delay is discussed in section 2.2.1.
Our main goal is understanding how different factors such as the gradient staleness and the learning rate affect the minima selection process in neural networks. In other words, we would like to understand how the different hyperparameters interact to divide the minima of the loss to two sets: those we can converge to and those we cannot converge to.
To analyze this we suppose we are in a vicinity of some minimum point , which implies . Given some values of and , we ask whether or not we can converge to , using the theory of dynamical stability. As we shall see, if the learning rate or delay are too high, then this minimum looses stability, i.e., ASGD cannot converge to , in expectation.
Specifically, we examine the expectation of the linearized dynamics around . We show in appendix A that, to ensure stability, it is sufficient that the following onedimensional equation is stable:
(3) 
where we defined and the sharpness term . Additionally, we show in appendix A that the characteristic equation of eq. 3 is
(4) 
2.2 Stability Analysis
For given , we would like to determine if the minimum point is asymptotically stable (Oppenheim, 1999) in expectation. This will enable us to divide the minima into two sets: those we might converge to since they are stable, and those we cannot possibly converge (except for a measure zero set of initialization) since they are unstable. We use the characteristic equation (eq. 4) to define the stability criterion:
Definition 1 (Stability criterion)
We would like to find conditions on when and how can we change the hyperparameters to maintain the same stability criterion. This, for example, will enable us to understand how to compensate for the delay introduced by asynchrony using the learning rate. In order to analyze stability we examine the characteristic equation in eq. 4 and ask: when does the maximal root of this polynomial have unit magnitude?
2.2.1 The Interaction Between Learning Rate and Delay
Recall the characteristic equation
(5) 
This polynomial has roots. In order to ensure stability we require that the root with the maximal magnitude, i.e., the maximal root, will be inside the unit circle. We want to find the threshold learning rate, i.e., the learning rate in which the maximal root is exactly on the unit circle, so we are at the threshold of stability. In appendix B, we show that this requires that
(6) 
Using Taylor approximation we get
(7) 
We observe numerically that the error between the exact solution (eq. 6) and the linear approximation (eq. 7) is smaller than for . Additionally, in the appendix Fig. 6 we demonstrate the high accuracy of the analytic approximation of the relation between and , compared to the numerical evaluation of the value of for which the maximal root of eq. 5 is on the unit circle.
Implications: We can see that to maintain stability for a given minimum point (or a given sharpness value) the learning rate should be kept inversely proportional to the delay. This implies that given some minimum for and a corresponding learning rate
for which this minimum is stable, we can evaluate the learning rate which will ensure this minimum remain stable for larger delay values. To do this, we first estimate
using eq. 6: . Note that we do not use eq. 7 to evaluate since we are interested in , while the relation in eq. 7 is only a good approximation for . Next, given some delay , we substitute into eq. 7 in order to evaluate a learning rate value for which the stability of all minima is maintained as in(8) 
Related empirical results: In section 3.1 we show empirically the existence of a stability threshold for different values of delay and compare it with our theoretical threshold.
Stochastic delay: In appendix C we extend our theoretical analysis to derive the characteristic equation for ASGD with stochastic delay, i.e., when is drawn from some discrete distribution. We analyze two cases:

[leftmargin=*]

Discrete uniform distribution:
. 
Gaussian distribution discrete approximation: where . In Zhang et al. (2015), the authors showed empirically a delay distribution that resembles to Gaussian distribution.
For both cases, we show numerically that the inverse relation between the learning rate and the expected delay still applies.
2.2.2 The Effects of Momentum on Stability
The optimization step of ASGD with momentum is defined as follows:
(9) 
where are the weights, and are the velocity and gradients at time , respectively, is the momentum parameter, is the step size and is the delay. The constant term which multiplies the gradients is called "dampening".
In appendix D we extend our theoretical analysis to derive the characteristic equation for ASGD with momentum. In addition, in section D.2.1 we show numerically that an inverse relation between the learning rate and the delay also applies when using momentum. Particularly,

[leftmargin=*, itemsep=0pt,topsep=0pt]

We need to keep the learning rate inversely proportional to the delay.

For a given delay, larger momentum values require smaller learning rate for stability.
Therefore, the range of stable learning rates decreases when the momentum parameter (’’) increases which can make tuning the learning rate difficult, especially with high delays. These results suggest that, to ensure stability when training with high delay, it is recommended to work without momentum.
Relation to previous results: The conclusion that when using ASGD with high delay values it’s recommend to turn off momentum is consistent with the results of Mitliagkas et al. (2017); Liu et al. (2018). In Liu et al. (2018), for streaming PCA, the authors showed it is necessary to reduce momentum in order to ensure convergence and acceleration through asynchrony. In Mitliagkas et al. (2017) the authors suggested that asynchrony adds implicit momentum to the training process and therefore it is necessary to reduce momentum when working asynchronously.
Next, we discuss a new method for introducing momentum into asynchronous training. This method enables asynchronous training where increasing the momentum parameter improves the stability.
Shifted momentum As discussed, training asynchronously is less stable when momentum is used. We propose a new method, called shifted momentum, for training asynchronously where the momentum benefits the training process and particularly its stability. We observe this method tends to improve the convergence stability compared to training without momentum (or with the original momentum), as is demonstrated in appendix Fig. 11. We suggest that instead of exchanging the gradients terms, as in eq. 2.2.2, we exchange the entire velocity term. This way, each worker calculates the velocity term based in its own gradients and momentum buffer. The formulation of this method is given in the following equation:
or
(10) 
In appendix D we show that the characteristic equation of eq. 10 is
(11) 
where we use the sharpness term as before.
We would like to characterize how the learning rate, delay and momentum affect minima stability. Thus, for each momentum value, we evaluate numerically the value of for which the maximal root of eq. 11 is on the unit circle, i.e., when we are on the threshold of stability. In Fig. 4 we see that the inverse relation between the learning rate and the delay still applies when using shifted momentum. Moreover, we can see that in contrast to regular momentum, in shifted momentum, a larger momentum value enables us to work with larger stepsize. Therefore, increasing the momentum improves the stability.
Relation to previous results: The analysis of delayed gradients and velocity might also shed light on the stability of recent asynchronous training approaches, where the workers communicating through weight averaging (Lian et al., 2017; Assran et al., 2019). This method is similar to ours in the sense that each worker holds its own momentum buffer.
3 Experiments
In this section, we provide experiments to support our theoretical findings. In the following subsections we: (1) demonstrate the relationship between hyperparameters and minima selection, and (2) show our findings can help improve asynchronous training. Details about the experiments and the implementation can be found in appendix E. ^{1}^{1}1Code will be available at https://github.com/papersubmissions/delay_stability
3.1 How Hyperparameters Affect Minima Selection
In this section, we demonstrate how the delay and learning rate affect the accessible minima set. We train with a fixed learning rate.
Minima stability. To demonstrate how delay and learning rate change the stability of a minimum, we start from a model that converged to some minimum and train with different values of delay and learning rate to see if the algorithm leaves that minimum, i.e., the minimum becomes unstable. We do so by training a VGG11 on CIFAR10 for 10,000 epochs – until we reach a steady state. This training is done without delay, momentum and weight decay. Next, introduce a delay , change the learning rate, and continue to train the model. Fig. 2 shows the number of epochs it takes to leave the steady state for a given delay and learning rate . We observe that for certain pairs, the algorithm stays in the minimum (below the black circle) while for others it leaves that minimum after some number of epochs (above the black circle). Importantly, we can see in the right panel in Fig. 2 that, as predicted by the theory, the inverse learning rate, , where is the maximal learning rate in which we did not diverge, scales linearly with the delay . Additional details are given in appendix F.
Generalization with delay. With delay, we need to adjust the hyperparameters in order to maintain the accessible minima set. However, even if a certain pair of (,) changes the accessible set, it is possible that there are still accessible minima that generalize well. We perform experiments to investigate what type of generalization we can expect from training with different (,) pairs. We examine this by training a VGG11 on CIFAR10 with different (,
) pairs until we reach a steady state (around 6,000 epochs). We use plain SGD without momentum or weight decay. We compare the generalization performance achieved by each pair of learning rate and delay. We repeat the experiment for each pair four times and report mean and standard deviation values. The results are presented in Fig.
3. We can see a linear scaling between the delay values and the learning rate empirical stability threshold, i.e., the learning rate in which the error diverge. In addition, we see that the smallest error (with small variance) is obtained at a learning rate which is somewhat smaller then the learning rate at the empirical threshold of stability, as expected
(LeCun et al., 2012).Shifted momentum stability. In subsection 2.2.2 we introduced shifted momentum as a modification of the standard momentum update, in which increasing the momentum value (’
’) improves stability. We validate the stability properties of this method by training a fully connected model with MNIST. The model has 3 layers, 1024 neurons in each layer with ReLU activations. Again, training is done synchronously with a large batch and no weight decay until it reaches a steady state. Then, after reaching a minimum, we continued to train the model, each time with a different triplet of
. We found the maximum learning rate for a given and for which training does not diverge, i.e, this maximum learning rate is at the edge of stability. Fig. 4 depicts the empirical results. The value of is estimated at the minimum point using power iteration. Two important results emerge from the graph. First, in order to maintain the stability threshold, the learning rate has to be inversely proportional to the delay, as described in subsection 2.2. Second, with increased momentum values, training becomes more stable. Meaning, we can use larger learning rates with larger momentum value. This stands in contrast to the analysis of the original momentum: as we show numerically in appendix D, increasing the momentum value in the standard momentum algorithm decreases the stability and hence we need to decrease the learning rate as the momentum value grows larger.3.2 Improving Asynchronous Training
So far we focused on the steady state behavior of ASGD, i.e. after many iterations. However, one of the prominent motivations for training asynchronously is to speedup training. In this subsection, we examine whether our findings are also relevant for improving the accuracy of models trained asynchronously with little to no excess budget of epochs, compared to equivalent large batch regime. In large batch training, the learning rate is scaled as a function of the ratio between the large batch size and the small one. As Shallue et al. (2018) pointed out, we can expect a different learning rate scaling for different models/datasets. We start with a set of models and datasets that follow a square root scaling of the learning rate with the batch size (Hoffer et al., 2017). Recall that our inversely linear scaling rule (i.e., ) applies to the large batch learning rate. We demonstrate the validity of our inversely linear learning rate scaling by comparing our method with one that does not account for the delay . The results are presented in Table 1. We can see that using the small batch regime with delay (ASGD) does not converge at all. With our inversely linear learning rate scaling (+LR), there is a small gap from the equivalent large batch. If we also increase the budget of epochs by 30%, our optimization regime generalizes better than the large batch (+30%). Because ASGD has better runtime performance compared to SSGD (Dutta et al., 2018; Assran et al., 2019), such excess in epochs may still result in improved overall wall clock time (this depends on the hardware details).
Network  Dataset  SB  LB  ASGD  +LR  +30% 

Resnet44 (He et al., 2016)  Cifar10  92.87%  90.42%  10.0%  88.57%  91.65% 
VGG11 (Simonyan & Zisserman, 2014)  Cifar10  89.8%  84.55%  10.0%  61.41%  84.74% 
C3 (Keskar et al., 2016)  Cifar100  60.0%  57.89%  1.0%  54.02%  58.94% 
WResnet164 (Zagoruyko & Komodakis, 2016)  Cifar100  76.78%  72.85%  17.8%  70.35%  73.81% 
ImageNet.
We next experiment with ResNet50 and ImageNet, which is a popular benchmark for large batch and asynchronous training because of its complexity and scale. Our focus is asynchronous centralized training with high delay. To the best of our knowledge, the generalization gap is not closed yet in this setting. This is an important setting because of its simplicity and its scalability potential. Optimization details can be found in appendix
E.1In Fig. 5, we compare three ASGD optimizations trained with ImageNet. As can be seen in the figure, the generalization gap decreases from to , just by turning off momentum. This corresponds with our analysis and previous results about the relation between momentum and delay. Shifted momentum similarly improves generalization, as can be seen in the right panel. This validates our analysis on the beneficial role the momentum parameter has, using shifted momentum. The learning rate is first scaled linearly with the total batch size of all the workers, as in Goyal et al. (2017). Our inverse linear scaling with delay cancels this increase, and we get a similar learning rate as in regular small batch training. In contrast, when we scaled the learning rate with , training did not converge at all.
References
 Amodei & Hernandez (2018) Dario Amodei and Danny Hernandez. Ai and compute. https://openai.com/blog/aiandcompute/, 2018. Accessed: 20190523.
 Arjevani et al. (2018) Yossi Arjevani, Ohad Shamir, and Nathan Srebro. A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates. jun 2018.

Assran et al. (2019)
Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat.
Stochastic Gradient Push for Distributed Deep Learning.
2019. 
Dai et al. (2018)
Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric Xing.
Toward Understanding the Impact of Staleness in Distributed Machine Learning, sep 2018.
 Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, and Quoc V Le. Large scale distributed deep networks. Advances in Neural Information Processing Systems, pp. 1223–1231, 2012.
 Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Ieee, 2009.
 Dutta et al. (2018) Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. Slow and Stale Gradients Can Win the Race: ErrorRuntime Tradeoffs in Distributed SGD. mar 2018.
 Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NIPS, pp. 1731–1741, 2017.
 Keskar et al. (2016) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. sep 2016.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 Lecun et al. (1998) Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Ha. GradientBased Learning Applied to Document Recognition. (November):1–46, 1998.
 LeCun et al. (2012) Yann A LeCun, Léon Bottou, Genevieve B Orr, and KlausRobert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.

Li (2014)
Mu Li.
Scaling Distributed Machine Learning with the Parameter Server.
Proceedings of the 2014 International Conference on Big Data Science and Computing  BigDataScience ’14
, pp. 1–1, 2014.  Lian et al. (2015) Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2737–2745, 2015.
 Lian et al. (2016) Xiangru Lian, Huan Zhang, ChoJui Hsieh, Yijun Huang, and Ji Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder. In Advances in Neural Information Processing Systems, pp. 3054–3062, 2016.
 Lian et al. (2017) Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017.
 Liu et al. (2018) Tianyi Liu, Shiyang Li, Jianping Shi, Enlu Zhou, and Tuo Zhao. Towards understanding acceleration tradeoff between momentum and asynchrony in nonconvex stochastic optimization. In Advances in Neural Information Processing Systems, pp. 3686–3696, 2018.
 Mitliagkas et al. (2017) Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and Christopher Re. Asynchrony begets momentum, with an application to deep learning. 54th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2016, pp. 997–1004, 2017.
 Nar & Sastry (2018) Kamil Nar and S Shankar Sastry. Step Size Matters in Deep Learning. In NIPS, 2018.
 Oppenheim (1999) Alan V Oppenheim. Discretetime signal processing. Pearson Education India, 1999.
 Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.
 Shallue et al. (2018) Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha SohlDickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Wu et al. (2018) Lei Wu, Chao Ma, and Weinan E. How SGD Selects the Global Minima in Overparameterized Learning: A Dynamical Stability Perspective. In NIPS, 2018.
 Yan et al. (2018) Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. A unified analysis of stochastic momentum methods for deep learning. arXiv preprint arXiv:1808.10396, 2018.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Zhang et al. (2015) Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Stalenessaware asyncsgd for distributed deep learning. arXiv preprint arXiv:1511.05950, 2015.
Appendix A Stability Analysis for Eq. 2
In this section, we would like to analyze the stability of eq. 2. In particular, we would like to examine if a given minimum point is stable. To analyze this we suppose we are in a vicinity of some minimum point , which implies , and ask whether or not we can converge to using the theory of dynamical stability. Formally, consider the linearized dynamics around :
where we denoted and changed the notation of the time index in to the subscript, i.e., .
Examining the expectation of this equation we obtain
where in we used , and in we used the fact that is a minimum point and thus and denoted .
Using variable change we obtain
(12) 
From the Spectral Factorization Theorem we can write where
is an orthogonal matrix and
is a diagonal matrix. Substituting this into eq. 12 we obtain:Multiplying the last equation with from the left, denoting , and using the fact that is an orthogonal matrix, i.e., we get:
Recall that is a diagonal matrix. Therefore, analyzing the dynamics of the last equation is equivalent to analyzing the dynamics of the following onedimensional equations:
where denotes the
component of the vector
.In order to ensure stability of the dimensional dynamical system, it’s sufficient to require that
(13) 
where we assume without loss of generality . This is true since the dynamics for is the first one that looses stability.
To simplify notations we define and the sharpness term
, the maximal singular value of
. Using these notations, we can rewrite eq. 13 as:(14) 
Appendix B Finding the Condition for the Maximal Root of the Characteristic Equation (Eq. 4) to be on the unit circle
To simplify the notation, we denote . Recall the characteristic eq. (eq. 5)
Since we are looking for solution on the unit circle we substitute into the equation where is the unit imaginary number and is some angle. We obtain:
Using we get
Using trigonometric productsum identities we obtain
(15) 
If then we get from the first equation . This is a contradiction since we assume that . Thus, . This implies from the second equation in eq. 15 that
Substituting this result into the first equation in eq. 15 and using we obtain that the possible solutions must satisfy
Also, since we can eliminate symmetric solution and get
Note that for : and thus is monotonically increasing with . Since the magnitude of the characteristic equation increases with and we are interested in the maximal root we are looking for the minimal value. This value corresponds to . Thus, the maximal root satisfies:
Appendix C Stability Analysis for Eq. 2 with Stochastic Delay
We assume that is drawn from some distribution. In this section, we would like to analyze the stability of eq. 2. In particular, we would like to examine if a given minimum point is stable. To analyze this we suppose we are in a vicinity of some minimum point , which implies , and ask whether or not we can converge to using the theory of dynamical stability. Formally, consider the linearized dynamics around :
where we denoted and changed the notation of the time index in to the subscript, i.e., .
Examining the expectation of this equation we obtain
where in we used and , and in we used the fact that is a minimum point and thus and denoted .
Using variable change we obtain
(16) 
From the Spectral Factorization Theorem we can write where is an orthogonal matrix and is a diagonal matrix. Substituting this into eq. 16 we obtain:
Multiplying the last equation with from the left, denoting , and using the fact that is an orthogonal matrix, i.e., we get:
Recall that is a diagonal matrix. Therefore, analyzing the dynamics of the last equation is equivalent to analyzing the dynamics of the following onedimensional equations:
where denotes the component of the vector .
In order to ensure stability of the dimensional dynamical system, it’s sufficient to require that
(17) 
where we assume without loss of generality . This is true since the dynamics for is the first one that looses stability.
To simplify notations we define and the sharpness term , the maximal singular value of . Using these notations, we can rewrite eq. 13 as:
(18) 
The characteristic equation of eq. 18 is
(19) 
c.1 Stabiliy Analysis
We would like to find conditions on when and how can we change the hyperparameters to maintain the same stability criterion (definition 1). In order to analyze stability we examine the characteristic equation in eq. 18 and ask: when does the maximal root of this polynomial have unit magnitude?

:
(20) 
Gaussian distribution discrete approximation: where :
Appendix D Theoretical Analysis with Momentum
In this section, we repeat the steps of the analysis in section 2 for ASGD with momentum.
d.1 Preliminaries and Problem Setup
Consider the problem of minimizing the empirical loss
(21) 
where is the number of samples, and : is twice continuously differentiable function. The update rule for minimizing in eq. 21 using asynchronous stochastic gradient descent with momentum (AMSGD) is given by
or
(22) 
where is the momentum, is the learning rate, , is a random selection process of a sample from at iteration , and is some delay due the asynchronous nature of our training.
Our main goal is understanding how different factors such as the gradient staleness , and hyperparameters such as the momentum and learning rate affect the minima selection process in neural networks. In other words, we would like to understand how the different hyperparameters interact to divide the global minima of our network to two sets: those we can converge to and those we cannot converge to.
To analyze this we suppose we are in a vicinity of a minimum point , which implies , and ask whether or not we can converge to using the theory of dynamical stability. Formally, consider the linearized dynamics around some minimum point :
Examining the first moment of this equation we obtain
where since is a minimum point and .
Using variable change we obtain
(23) 
From the Spectral Factorization Theorem we can write where is an orthogonal matrix and is a diagonal matrix. Substituting this into eq. 23 we obtain:
Multiplying the last equation with from the left, denoting , and using the fact that is an orthogonal matrix, i.e., we get:
Recall that is a diagonal matrix. Therefore, analyzing the dynamics of the last equation is equivalent to analyzing the dynamics of the following onedimensional equations:
where denotes the component of the vector .
In order to ensure stability of the dimensional dynamical system, it’s sufficient to require that
(24) 
where we assume without loss of generality . This is true since the dynamics for is the first one that looses stability.
To simplify notations we define and the sharpness term , the maximal singular value of . Using these notations, we can rewrite eq. 24 as:
(25) 
The characteristic equation of this difference equation is
(26) 
d.2 Stability Analysis
We would like to find conditions on when and how can we change the hyperparameters to maintain the same stability criterion (definition 1). In order to analyze stability we examine the characteristic equation in eq. 26 and ask: when does the maximal root of this polynomial have unit magnitude?
d.2.1 The Interaction Between Learning Rate, Delay, and Momentum
For the general case and different values of momentum, we evaluate numerically the value of for which the maximal root of the general characteristic equation (eq. 26) is on the unit circle, i.e., when we are on the threshold of stability. We show the results in Fig. 10. We observe that there is still an inverse relation between the learning rate and delay when using momentum. Specifically, we observe that:

We need to decrease the learning rate as the delay increases.

The lines becomes more steep for larger values of momentum. This implies that, for a given delay, maintaining stability for larger values of momentum requires smaller learning rate. Therefore, it is easier to maintain stability with .
d.3 Shifted Momentum
In this section, we repeat the steps of the analysis in section D.1 for ASGD with shifted momentum. The update rule for minimizing in eq. 21 using asynchronous stochastic gradient descent with shifted momentum is given by
or
Following the same steps as in section D.1 we obtain the following onedimensional equation:
The last equation characteristic equation is:
Appendix E Experiments Details
We experimented with a set of popular classification tasks: MNIST (Lecun et al., 1998), CIFAR10, CIFAR100 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009)
. In order to incorporate delay in the experiments, we keep replicas of the model parameters and perform SGD step with different replica at a time according to a round robin scheduling. The implementation is done with PyTorch framework. Section
C in the appendix shows that constant delay, i.e., round robin, acts similarly in terms of stability to several other delay distributions. In addition, Zhang et al. (2015) shows empirically with realistic distribution that constant delay distribution is a good model for delay distribution. Fig. 9 shows empirically that round robin scheduling works in a similar way to discrete Gaussian distribtion.Although stochastic gradient descent with momentum was originally introduced with the dampening term, many drop the latter when using momentum SGD. This is true for most DNN frameworks like Caffe, PyTorch, and TensorFlow where the dampening term is set to zero by default. As
Shallue et al. (2018); Yan et al. (2018) pointed out, without dampening, the momentum scales the learning rate so that the effective learning rate is equal to after sufficiently many iterations (i.e., when ). Because we alter the momentum value in our experiments, we use dampening and scale the learning rate accordingly to keep the effective learning rate the same as without dampening.e.1 ImageNet Optimization Details
Our baseline is Goyal et al. (2017) large batch training. They trained ResNet50 for 90 epochs with a large batch of 8192 samples and reached the same accuracy level of 23.8% as the small batch training. They used linear scaling with the batch size (scaling the learning rate by the large batch size divided by the baseline batch size) and a warmup of the learning rate at the first 5 epochs. We use 32 workers, each with a minibatch size of 256.
Appendix F Minima Stability Experiments
In this section, we provide additional details about the minima stability experiment presented in section 3. As discussed in section 3, we are interested to examine for which pairs of the minima stability remains the same.
In Fig. 12 we show the validation error of such pairs as a function of epochs. We note that these graphs are the same experiment as in Fig. 2. As can be seen, for larger learning rates, it takes less epochs to leave the minimum. It is interesting to see that after leaving the minimum, the ASGD algorithm converges again to a minimum with generalization as good as the baseline (at ). This suggests that the minima selection process of ASGD is affected by the whole optimization path. In other words, suppose we start the optimization from a minimum with good generalization (since it was selected using optimization with ), and then it becomes unstable due to a change in the values of , as in this experiment. These results in Fig. 12 suggest we typically converge to a stable minimum with similar generalization properties, possibly nearby the original minimum. In contrast, if we start to train from scratch using the same pair which lost stability in our experiment, we typically get a generalization gap (as observed in our experiments), which suggests the optimization path might have taken a very different path from the start, leading to other regions with worse generalization than the original minimum.