The most powerful supercomputer in the world is currently a cluster of over 27,000 GPUs at Oak Ridge National Labs (TOP500, 2018)
. Distributed algorithms designed for such large-scale systems typically involve both computation and communication: worker nodes compute intermediate results locally, before sharing them with their peers. When devising new machine learning algorithms for distribution over networks of thousands of workers, we posit the following desiderata:
fast algorithmic convergence;
good generalisation performance;
robustness to network faults.
When seeking an algorithm that satisfies all four desiderata D1–4
, inevitably some tradeoff must be made. Stochastic gradient descent (SGD) naturally satisfies D1–2
, and this has buoyed recent advances in deep learning. Yet when it comes to large neural network models with hundreds of millions of parameters, distributedSGD can suffer large communication overheads. To make matters worse, any faulty SGD worker can corrupt the entire model at any time by sending an infinite gradient, meaning that SGD without modification is not robust.
A simple algorithm with aspirations towards all desiderata D1–4 is as follows: workers send the sign of their gradient up to the parameter server, which aggregates the signs and sends back only the majority decision. We refer to this algorithm as signSGD with majority vote. All communication to and from the parameter server is compressed to one bit, so the algorithm certainly gives us D3. What’s more, in deep learning folklore sign based methods are known to perform well, indeed inspiring the popular RMSprop and Adam optimisers (Balles & Hennig, 2018), giving hope for D1. As far as robustness goes, aggregating gradients by a majority vote denies any individual worker too much power, suggesting it may be a natural way to achieve D4.
In this work, we make the above aspirations rigorous. Whilst D3 is immmediate, we provide the first convergence guarantees for signSGD in the mini-batch setting, providing theoretical grounds for D1. We show how theoretically the behaviour of signSGD changes as gradients move from high to low signal-to-noise ratio. We also extend the theory of majority vote to show that it achieves Byzantine fault tolerance assuming that adversaries cannot cooperate. A distributed algorithm is Byzantine fault tolerant (Blanchard et al., 2017) if its convergence is robust when up to 50% of workers behave adversarially. This is a relatively strong property that often entails desirable weaker properties, such as robustness to a corrupted worker sending random bits, or an outdated worker sending stale gradients. This means that Byzantine fault tolerance is not just a property of security, but also confers robustness to a wide variety of plausible network faults, giving us D4. Assuming non-cooperative adversaries is an interesting failure model, though not the most general one.
Next, we embark on a large-scale empirical validation of our theory. We implement majority vote in the Pytorch deep learning framework, using CUDA kernels to bit pack sign tensors down to one bit. Our results provide experimental evidence forD1–D4. Comparing our framework to NCCL (the state of the art communications library), we were able to speed up Imagenet training by 25% when distributing over 7 to 15 AWS p3.2xlarge machines, albeit at a slight loss in generalisation.
Finally, in an interesting twist, the theoretical tools we develop may be brought to bear on a seemingly unrelated problem in the machine learning literature. Reddi et al. (2018) proved that the extremely popular Adam optimiser in general does not converge in the mini-batch setting. This result belies the success of the algorithm in a wide variety of practical applications. signSGD is equivalent to a special case of Adam, and we establish the convergence rate of mini-batch signSGD for a large class of practically realistic objectives. Therefore, we expect that these tools should carry over to help understand the success modes of Adam
. Our insight is that gradient noise distributions in practical problems are often unimodal and symmetric because of the Central Limit Theorem, yetReddi et al. (2018)’s construction relies on bimodal noise distributions.
2 Related Work
For decades, neural network researchers have adapted biologically inspired algorithms for efficient hardware implementation. Hopfield (1982), for example, considered taking the sign of the synaptic weights of his memory network for readier adaptation into integrated circuits. This past decade, neural network research has focused on training feedforward networks by gradient descent (LeCun et al., 2015)
. It is natural to ask what practical efficiency may accompany simply taking the sign of the backpropagated gradient. In this section, we explore related work pertaining to this question.
Deep learning: whilst stochastic gradient descent (SGD) is the workhorse of machine learning (Robbins & Monro, 1951), algorithms like RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015) are also extremely popular neural net optimisers. These algorithms have their roots in the Rprop optimiser (Riedmiller & Braun, 1993), which is a sign-based method similar to signSGD except for a component-wise adaptive learning rate.
Non-convex optimisation: parallel to (and oftentimes in isolation from) advances in deep learning practice, a sophisticated optimisation literature has developed. Nesterov & Polyak (2006) proposed cubic regularisation as an algorithm that can escape saddle points and provide guaranteed convergence to local minima of non-convex functions. This has been followed up by more recent works such as Natasha (Allen-Zhu, 2017) that use other theoretical tricks to escape saddle points. It is still unclear how relevant these works are to deep learning, since it is not clear to what extent saddle points are an obstacle in practical problems. We avoid this issue altogether and satisfy ourselves with establishing convergence to critical points.
Gradient compression: prior work on gradient compression generally falls into two camps. In the first camp, algorithms like QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017) and Atomo (Wang et al., 2018) use stochastic quantisation schemes to ensure that the compressed stochastic gradient remains an unbiased approximation to the true gradient. These works are therefore able to bootstrap existing SGD
convergence theory. In the second camp, more heuristic algorithms like1BitSGD (Seide et al., 2014) and deep gradient compression (Lin et al., 2018) pay less attention to theoretical guarantees and focus more on practical performance. These algorithms track quantisation errors and feed them back into subsequent updates. The commonality between the two camps is an effort to, one way or another, correct for bias in the compression.
signSGD with majority vote takes a different approach to these two existing camps. In directly employing the sign of the stochastic gradient, the algorithm unabashedly uses a biased approximation of the stochastic gradient. Carlson et al. (2016) and Bernstein et al. (2018) provide theoretical and empirical evidence that signed gradient schemes can converge well in spite of their biased nature. Their theory only applies in the large batch setting, meaning the theoretical results are less relevant to deep learning practice. Still Bernstein et al. (2018) showed promising experimental results in the small batch setting. An appealing feature of majority vote is that it naturally leads to compression in both directions of communication between workers and parameter server. As far as we are aware, all existing gradient compression schemes lose compression before scattering results back to workers.
Byzantine fault tolerant optimisation: the problem of modifying SGD to make it Byzantine fault tolerant has recently attracted interest in the literature. For example, Blanchard et al. (2017) proposed Krum
, which operates by detecting and excluding outliers in the gradient aggregation.Alistarh et al. (2018) propose ByzantineSGD which instead focuses on detecting and eliminating adversaries. Clearly both these strategies incur overheads, and eliminating adversaries precludes the possibility that they might reform. Majority vote is a simple algorithm which avoids these problems.
We aim to develop an optimisation theory that is relevant for real problems in deep learning. For this reason, we are careful about the assumptions we make. For example, we do not assume convexity because neural network loss functions are typically not convex. Though we allow our objective function to be non-convex, we insist on a lower bound to enable meaningful convergence results.
on Cifar-10 at mini-batch size 128. At the start of epochs 0, 1 and 5, we do a full pass over the data and collect the gradients for three randomly chosen weights (left, middle, right). In all cases the distribution is close to unimodal and symmetric.
Assumption 1 (Lower bound).
For all and some constant , we have objective value .
Our next two assumptions of Lipschitz smoothness and bounded variance are standard in the stochastic optimisation literature(Allen-Zhu, 2017). That said, we give them in a component-wise form. This allows our convergence results to encode information not just about the total noise level and overall smoothness, but also about how these quantities are distributed across dimension.
Assumption 2 (Smooth).
Let denote the gradient of the objective evaluated at point . Then we require that for some non-negative constant
Assumption 3 (Variance bound).
Upon receiving query , the stochastic gradient oracle gives us an independent, unbiased estimate
, the stochastic gradient oracle gives us an independent, unbiased estimatethat has coordinate bounded variance:
for a vector of non-negative constants .
Our final assumption is non-standard. We assume that the gradient noise is unimodal and symmetric. Clearly, Gaussian noise is a special case. Note that even for a moderate mini-batch size, we expect the central limit theorem to kick in rendering typical gradient noise distributions close to Gaussian. See Figure 2 for noise distributions measured whilst training resnet18 on Cifar-10.
Assumption 4 (Unimodal, symmetric gradient noise).
At any given point , each component of the stochastic gradient vector has a unimodal distribution that is also symmetric about the mean.
Showing how to work with this assumption is a key theoretical contribution of this work. Combining Assumption 4 with an old tail bound of Gauss (1823) yields Lemma 1, which will be crucial for guaranteeing mini-batch convergence of signSGD. As will be explained in Section 3.3, this result also constitutes a convergence proof for a parameter regime of Adam. This suggests that Assumption 4 may more generally be a theoretical fix for Reddi et al. (2018)’s non-convergence proof of mini-batch Adam, a fix which does not involve modifying the Adam algorithm itself.
3.2 Mini-Batch Convergence of signSGD
With our assumptions in place, we move on to presenting our theoretical results, which are all proved in Appendix A. Our first result establishes the mini-batch convergence behaviour of signSGD. We will first state the result and make some remarks. We provide intuition for the proof in Section 3.3.
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white]
Theorem 1 (Non-convex convergence rate of small-batch signSGD).
Run the following algorithm for iterations under Assumptions 1 to 4: . Set the learning rate, , and mini-batch size, , as
Let be the set of gradient components at step with large signal-to-noise ratio , i.e. . We refer to as the ‘critical SNR’. Then we have
where is the total number of stochastic gradient calls up to step .
Theorem 1 provides a bound on the average gradient norm. The right hand side of the bound decays like , establishing convergence to critical points of the objective.
Remark 1: mini-batch signSGD attains the same non-convex convergence rate as SGD.
Remark 2: the gradient appears as a mixed norm: an norm for high SNR components, and a weighted norm for low SNR compoenents.
Remark 3: we wish to understand the dimension dependence of our bound. We may simplify matters by assuming that, during the entire course of optimisation, every gradient component lies in the low SNR regime. Figure 3 shows that this is almost true when training a resnet18 model. In this limit, the bound becomes:
Further assume that we are in a well-conditioned setting, meaning that the variance is distributed uniformly across dimension (), and every weight has the same smoothness constant (). is the total variance bound, and is the conventional Lipschitz smoothness. These are the quantities which appear in the standard analysis of SGD. Then we get
The factors of dimension have conveniently cancelled. This illustrates that there are problem geometries where mini-batch signSGD does not pick up an unfavourable dimension dependence.
3.3 The Subtleties of Mini-Batch Convergence
Intuitively, the convergence analysis of signSGD
depends on the probability that a given bit of the sign stochastic gradient vector is incorrect, or. Lemma 1 provides a bound on this quantity under Assumption 4 (unimodal symmetric gradient noise).
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white]
Lemma 1 (Bernstein et al. (2018)).
Let be an unbiased stochastic approximation to gradient component , with variance bounded by . Further assume that the noise distribution is unimodal and symmetric. Define signal-to-noise ratio . Then we have that
which is in all cases less than .
The bound characterises how the failure probability of a sign bit depends on the signal-to-noise ratio (SNR) of that gradient component. Intuitively as the SNR decreases, the quality of the sign estimate should degrade. The bound is elegant since it tells us that, under conditions of unimodal symmetric gradient noise, even at extremely low SNR we still have that . This means that even when the gradient is very small compared to the noise, the sign stochastic gradient still tells us, on average, useful information about the true gradient direction, allowing us to guarantee convergence as in Theorem 1.
Without Assumption 4
, the mini-batch algorithm can diverge. This can be seen by considering Cantelli’s inequality, which tells us that for a random variablewith mean and variance : . From this, we obtain a reliability measure of the sign stochastic gradient:
There exist noise distributions (violating Assumption 4) for which Cantelli’s equality is tight111See this discussion. and so Inequality 1 becomes an equality. Close to a minimum where the SNR , the failure probability of the sign bit for these distributions . Therefore signSGD cannot converge for these noise distributions, since the sign stochastic gradient will tend to point in the wrong direction close to a minimum.
Note that signSGD is a special case of the Adam algorithm (Balles & Hennig, 2018). To see this, set in Adam, and the update becomes:
This correspondence suggests that Assumption 4 should be useful for obtaining mini-batch convergence guarantees for Adam. Note that when Reddi et al. (2018) construct toy divergence examples for Adam, they rely on bimodal noise distribtuions which violate Assumption 4.
We conclude this section by noting that without Assumption 4, signSGD can still be guaranteed to converge. The trick is to use a “large” batch size that grows with the number of iterations. This will ensure that the algorithm stays in the high SNR regime where the failure probability of the sign bit is low. This is the approach taken by both Carlson et al. (2016) and Bernstein et al. (2018).
machines. Adversaries invert their sign stochastic gradient. Left: all experiments are run at identical hyperparameter settings, with weight decay switched off for simplicity. The network still learns even at 43% adversarial. Right: at 43% adversarial, learning became slightly unstable. We decreased the learning rate for this setting, and learning stabilised.
3.4 Robustness of convergence
We wish to study signSGD’s robustness when it is distributed by majority vote. We model adversaries as machines that are able to compute a real stochastic gradient estimate and manipulate it however they like, though they cannot cooperate. In SGD any adversary can set the gradient to infinity and immediately corrupt the entire model. Our algorithm restricts adversaries to send sign vectors, therefore the worst they can do is send the negation of their sign gradient vector.
For ease of analysis, here we derive large batch results. We make sure to give results in terms of sample complexity (and not iteration number ) to enable fair comparison with other algorithms.
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white]
Theorem 2 (Non-convex convergence rate of majority vote with adversarial workers).
Run algorithm 1 for iterations under Assumptions 1 to 4. Switch off momentum and weight decay (. Set the learning rate, , and mini-batch size, , for each worker as
Assume that a fraction of the workers behave adversarially by sending to the server the negation of their sign gradient estimate. Then majority vote converges at rate:
where is the total number of stochastic gradient calls per worker up to step .
The result is intuitive: provided there are more machines sending honest gradients than adversarial gradients, we expect that the majority vote should come out correct on average.
Remark 1: if we switch off adversaries by setting the proportion of adversaries , this result reduces to Theorem 2 in (Bernstein et al., 2018). In this case, we note the nice variance reduction that majority vote obtains by distributing over machines, similar to distributed SGD.
Remark 2: the convergence rate degrades as we ramp up from to . For , convergence can still be attained if the parameter server (realising it is under attack) inverts the sign of the vote.
Remark 3: from an optimisation theory perspective, the large batch size is an advantage. This is because when using a large batch size, fewer iterations and rounds of communication are theoretically needed to reach a desired accuracy, since only iterations are needed to reach samples. But from a practical perspective, workers may be unable to handle such a large batch size in a timely manner. It should be possible to extend the result to the mini-batch setting by combining the techniques of Theorems 1 and 2, but we leave this for future work.
For our experiments, we distributed Signum (Algorithm 1) by majority vote. Signum is the momentum counterpart of signSGD, where each worker maintains a momentum and transmits the sign momentum to the parameter server at each step. The addition of momentum to signSGD is proposed and studied in (Balles & Hennig, 2018; Bernstein et al., 2018).
We built Signum with majority vote in the Pytorch deep learning framework (Paszke et al., 2017) using the Gloo (2018) communication library. Unfortunately Pytorch and Gloo do not natively support 1-bit tensors, therefore we wrote our own compression code to compress a sign tensor down to an efficient 1-bit representation. Looking under the hood, we use the GPU to efficiently bit-pack groups of 32 sign bits into single 32-bit floats for transmission. We obtained a performance boost by fusing together smaller tensors, which saved on compression and communication costs.
We benchmark majority vote against SGD distributed using the state of the art, closed source NCCL (2018) communication library. NCCL provides an efficient implementation of allreduce. Our framework often provides a greater than communication speedup compared to NCCL, as can be seen in Figure 5. This includes the cost of compression.
4.1 Communication Efficiency
We first benchmark majority vote on the Imagenet dataset. We train a resnet50 model and disitribute learning over 7 to 15 AWS p3.2xlarge machines. These machines each contain one Nvidia Tesla V100 GPU, and AWS lists the connection speed between machines as “up to 10 Gbps”. Results are plotted in Figure 6. Per epoch, distributing by majority vote is able to attain a similar speedup to distributed SGD. But per hour majority vote is able to process more epochs than NCCL, meaning it can complete the 80 epoch training job roughly 25% faster. In terms of overall generalisation, majority vote reaches a slightly degraded test set accuracy. We hypothesise that this may be fixed by inventing a better regularisation scheme or tuning momentum, which we did not do.
As can be seen in Figure 5, this 25% speedup undersells the efficiency of our communication scheme. This is because resnet50 is a computation heavy model, meaning the cost of backpropagation is on par with the cost of communication. This is not the case for all deep learning models. We also see in Figure 5 that majority vote yields an almost overall speedup for training an epoch of the 151 million parameter QRNN model from (Merity et al., 2018).
In this section we test the robustness of Signum with majority vote to Byzantine faults. Again we run tests on the Imagenet dataset, training resnet50 across 7 AWS p3.2xlarge machines. Our adversarial workers take the sign of their stochastic gradient calculation, but send the negation to the parameter server. Our results are plotted in Figure 4. In the left hand plot, all experiments were carried out using hyperparameters tuned for the 0% adversarial case. Weight decay was not used in these experiments to simplify matters. We see that learning is tolerant of up to 43% (3 out of 7) machines behaving adversarially. The 43% adversarial case was slightly unstable (Figure 4, left), but re-tuning the learning rate for this specific case stabilised learning (Figure 4, right).
5 Discussion and Conclusion
Our implementation of majority vote can be further optimised. Our primary inefficiency is that we use a single parameter located on one of the machines. This single parameter server becomes a communication bottleneck, and it also means one machine must handle the taking the entire vote. Fragmenting the parameter server and distributing it across all machines should further increase our speedup relative to NCCL, and we will include this feature in our open source code release.
Though our framework speeds up Imagenet training, we still have a test set gap. We hypothesise that this gap may be closed by more extensive tuning of hyperparameters, or by inventing new regularisation schemes for signed updates. Results in Figure 5 (right) suggest that our framework might dramatically speed up training of large language models like QRNN on WikiText-103. Whilst our preliminary run did not obtain satisfactory perplexity, we plan to run more extensive experiments in the immediate future, and will update the paper when we have conclusive results.
To conclude, we have analysed the theoretical and empirical properties of a very simple algorithm for distributed, stochastic optimisation; workers send their sign gradient estimate to the server, and the server returns the majority vote to each worker. We have shown that this algorithm is theoretically robust and communication efficient. We have also shown that its empirical convergence rate is competitive with SGD for training large-scale convolutional neural nets on image datasets, whilst also conferring robustness and communication efficiency in practice.
Our work touches upon various active areas of machine learning research such as distributed systems, robust optimisation and adaptive gradient methods. We hope that our general philosophy of characterising and exploiting simple, realistic properties of neural network error landscapes can inspire future work in these directions.
- Alistarh et al. (2017) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems (NIPS-17), 2017.
- Alistarh et al. (2018) Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine Stochastic Gradient Descent. arXiv:1803.08917, 2018.
- Allen-Zhu (2017) Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD. arXiv:1708.08694, 2017.
- Balles & Hennig (2018) Lukas Balles and Philipp Hennig. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients. In International Conference on Machine Learning (ICML-18), 2018.
- Bernstein et al. (2018) Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed Optimisation for Non-Convex Problems. In International Conference on Machine Learning (ICML-18), 2018.
- Blanchard et al. (2017) Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Advances in Neural Information Processing Systems (NIPS-17), 2017.
- Cantelli (1928) Francesco Paolo Cantelli. Sui confini della probabilità. Atti del Congresso Internazionale dei Matematici, 1928.
- Carlson et al. (2016) David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher. Stochastic spectral descent for discrete graphical models. IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016.
- Gauss (1823) Carl Friedrich Gauss. Theoria combinationis observationum erroribus minimis obnoxiae, pars prior. Commentationes Societatis Regiae Scientiarum Gottingensis Recentiores, 1823.
- Gloo (2018) Gloo. Gloo Collective Communications Library, 2018. URL https://github.com/facebookincubator/gloo. Accessed on 9/27/18.
- Hopfield (1982) J J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 1982.
- Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR-15), 2015.
- LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
- Lin et al. (2018) Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations (ICLR-18), 2018.
- Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An Analysis of Neural Language Modeling at Multiple Scales. arXiv:1803.08240, 2018.
- NCCL (2018) NCCL. Nvidia Collective Communications Library, 2018. URL https://developer.nvidia.com/nccl. Accessed on 9/27/18.
- Nesterov & Polyak (2006) Yurii Nesterov and B.T. Polyak. Cubic Regularization of Newton Method and its Global Performance. Mathematical Programming, 2006.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic Differentiation in PyTorch. In Advances in Neural Information Processing Systems, Autodiff Workshop (NIPS-17), 2017.
- Pukelsheim (1994) Friedrich Pukelsheim. The Three Sigma Rule. The American Statistician, 1994.
- Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the Convergence of Adam and Beyond. In International Conference on Learning Representations (ICLR-18), 2018.
- Riedmiller & Braun (1993) M. Riedmiller and H. Braun. A Direct Adaptive Method for Faster Backpropagation Learning: the RPROP Algorithm. In International Conference on Neural Networks (ICNN-93), pp. 586–591. IEEE, 1993.
- Robbins & Monro (1951) Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 1951.
- Seide et al. (2014) Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. In Conference of the International Speech Communication Association (INTERSPEECH-14), 2014.
- Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. RMSprop. Coursera: Neural Networks for Machine Learning, Lecture 6.5, 2012.
- TOP500 (2018) TOP500. IBM Summit Supercomputer, 2018. URL https://www.top500.org/system/179397. Accessed on 9/19/18.
- Wang et al. (2018) Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary B. Charles, Dimitris S. Papailiopoulos, and Stephen Wright. ATOMO: Communication-efficient Learning via Atomic Sparsification. In Advances in Neural Information Processing Systems (NIPS-18), 2018.
- Wen et al. (2017) Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Advances in Neural Information Processing Systems (NIPS-17), 2017.
Appendix A Proofs
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white] See 1
By the symmetry assumption, the mode is equal to the mean, so we replace mean and variance .
Without loss of generality assume that is negative. Then applying symmetry followed by Gauss, the failure probability for the sign bit satisfies:
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white] See 1
First let’s bound the improvement of the objective during a single step of the algorithm for one instantiation of the noise. is the indicator function, denotes the component of the true gradient and is a stochastic sample obeying Assumption 3.
First take Assumption 2, plug in the algorithmic step, and decompose the improvement to expose the stochasticity-induced error:
Next we find the expected improvement at time conditioned on the previous iterate.
Substituting this in, we get that
Interestingly a mixture between an and a variance weighted norm has appeared. Now substitute in the learning rate schedule, and we get:
Now extend the expectation over the randomness in the trajectory and telescope over the iterations:
Finally, rearrange and substitute in to yield the bound
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white] See 2
We begin by bounding the failure probability of the vote. We will then use this bound to derive a convergence rate.
In our setting, we have good workers and bad workers. The good workers each compute a stochastic gradient estimate, take its sign and transmit this to the server. The bad workers follow an identical procedure except they negate their sign bits prior to transmission to the server. Therefore we are assuming that all workers are equally powerful, meaning no worker has the extra computational resources needed to obtain a more accurate gradient estimate than their peers. Because of this, and the fact that there are more good workers than bad workers (), it is intuitive that the good workers will win the vote on average. To make this rigorous, we will need Lemma 1 and Cantelli’s inequality. Cantelli (1928) tells us that for a random variable with mean and variance :
For a given gradient component, let random variable denote the number of correct sign bits received by the parameter server. Let random variables and denote the number of good and bad workers (respectively) who (possibly inadvertently) sent the correct sign bit. Then, letting be the probability that a good worker computed the correct sign bit, and we can decompose as follows:
The vote only fails if which happens with probability
|by Cantelli’s inequality|
From this we can derive that as follows. First take the case . Then and . Now take the case . Then and we have by the condition on .
We have now completed the first part of the proof by showing the key statement that for the gradient component with signal to noise ratio , the failure probability of the majority vote is bounded by
The second stage of the proof will proceed by straightforwardly substituting this bound into the convergence analysis of signSGD from Bernstein et al. (2018).
First let’s bound the improvement of the objective during a single step of the algorithm for one instantiation of the noise. is the indicator function, denotes the component of the true gradient and is the outcome of the vote at the iteration.
Next we find the expected improvement at time conditioned on the previous iterate.
From , we have that the probability of the vote failing for the coordinate is bounded by
where refers to the variance of the stochastic gradient estimate, computed over a mini-batch of size . Therefore, by Assumption 3, we have that .
We now substitute these results and our learning rate and mini-batch settings into the expected improvement:
Now extend the expectation over randomness in the trajectory, and perform a telescoping sum over the iterations:
We can rearrange this inequality to yield the rate:
Since we are growing our mini-batch size, it will take gradient calls to reach step . Substitute this in on the right hand side, square the result, use that , and we are done: