Signum_pytorch
None
view repo
Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. signSGD can exploit mismatches between L1 and L2 geometry: when noise and curvature are much sparser than the gradients, signSGD is expected to converge at the same rate or faster than full-precision SGD. Measurements of the L1 versus L2 geometry of real networks support our theoretical claims, and we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss, we prove that the non-convex convergence rate of majority vote matches that of distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve both communication efficiency and high accuracy.
READ FULL TEXT VIEW PDF
Distributed stochastic gradient descent (DSGD) has been widely used for
...
read it
Communication overhead is a major bottleneck hampering the scalability o...
read it
Stochastic gradient descent (SGD) is one of the most widely used optimiz...
read it
Training neural networks on large datasets can be accelerated by distrib...
read it
Sign-based algorithms (e.g. signSGD) have been proposed as a biased grad...
read it
This work addresses the instability in asynchronous data parallel
optimi...
read it
Most of today's distributed machine learning systems assume reliable
ne...
read it
None
Deep neural networks are now solving natural human tasks previously considered decades out of reach for machines (LeCun et al., 2015; Schmidhuber, 2015). Training these large-scale models can take days or even weeks. The learning process can be accelerated by distributing training over multiple processors—either GPUs linked within a single machine, or even multiple machines linked together. Communication between workers is typically handled using a parameter-server framework (Li et al., 2014), which involves repeatedly communicating the gradients of every parameter in the model. This can still be time-intensive for large-scale neural networks. The communication cost can be reduced if gradients are compressed before being transmitted. In this paper, we analyse the theory of robust schemes for gradient compression.
An elegant form of gradient compression is just to take the sign of each coordinate of the stochastic gradient vector, which we call
signSGD. The algorithm is as simple as throwing away the exponent and mantissa of a 32-bit floating point number. Sign-based methods have been studied at least since the days of Rprop (Riedmiller & Braun, 1993). This algorithm inspired many popular optimisers—like RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). But researchers were interested in Rprop and variants because of their robust and fast convergence, and not their potential for gradient compression.Until now there has been no rigorous theoretical explanation for the empirical success of sign-based stochastic gradient methods. The sign of the stochastic gradient is a biased approximation to the true gradient, making it more challenging to analyse compared to standard SGD. In this paper, we provide extensive theoretical analysis of sign-based methods for non-convex optimisation under transparent assumptions. We show that signSGD is especially efficient in problems with a particular geometry: when gradients are significantly more dense than stochasticity and curvature, then signSGD can converge theoretically faster than SGD. We find empirically that both gradients and noise are dense in deep learning problems, consistent with the observation that signSGD converges at the same rate as SGD in practice.
We then analyse signSGD in the distributed setting where the parameter server aggregates gradient signs of the workers by a majority vote. Thus we allow worker-server communication to be 1-bit compressed in both directions. We prove that the theoretical convergence rate matches that of distributed SGD, under natural assumptions that are validated by experiments.
We also extend our theoretical framework to the Signum optimiser—which takes the sign
of the moment
um. Our theory suggests that momentum may be useful for controlling a tradeoff between bias and variance in the estimate of the stochastic gradient. On the practical side, we show that
Signumeasily scales to large Imagenet models, and provided the learning rate and weight decay are tuned, all other hyperparameter settings—such as momentum, weight initisialiser, learning rate schedules and data augmentation—may be lifted from an
SGD implementation.Distributed machine learning:
From the information theoretic angle, Suresh et al. (2017) study the communication limits of estimating the mean of a general quantity known about only through samples collected from workers. In contrast, we focus exclusively on communication of gradients for optimisation, which allows us to exploit the fact that we do not care about incorrectly communicating small gradients in our theory. Still our work has connections with information theory. When the parameter server aggregates gradients by majority vote, it is effectively performing maximum likelihood decoding of a repetition encoding of the true gradient sign that is supplied by the M workers.As for existing gradient compression schemes, Seide et al. (2014) and Strom (2015) demonstrated empirically that 1-bit quantisation can still give good performance whilst dramatically reducing gradient communication costs in distributed systems. Alistarh et al. (2017) and Wen et al. (2017) provide schemes with theoretical guarantees by using random number generators to ensure that the compressed gradient is still an unbiased approximation to the true gradient. Whilst unbiasedness allows these schemes to bootstrap SGD theory, it unfortunately comes at the cost of hugely inflated variance, and this variance explosion^{1}^{1}1For the version of QSGD with 1-bit compression, the variance explosion is by a factor of , where is the number of weights. It is common to have in modern deep networks. basically renders the SGD-style bounds vacuous in the face of the empirical success of these algorithms. The situation only gets worse when the parameter server must aggregate and send back the received gradients, and the existing literature largely sidesteps discussing this issue. We compare the schemes in Table 1—notice how the existing schemes pick up log factors in the transmission from parameter-server back to workers. Our proposed approach is different, in that we directly employ the sign gradient which is biased. This avoids the randomisation needed for constructing an unbiased quantised estimate, avoids the problem of variance exploding in the theoretical bounds, and even enables 1-bit compression in both directions between parameter-server and workers, at no theoretical loss compared to distributed SGD.
Algorithm | # bits per iteration |
---|---|
SGD (Robbins & Monro, 1951) | |
QSGD (Alistarh et al., 2017) | |
TernGrad (Wen et al., 2017) | |
signSGD with majority vote |
Deep learning:stochastic gradient descent (Robbins & Monro, 1951) is a simple and extremely effective optimiser for training neural networks. Still Riedmiller & Braun (1993) noted the good practical performance of sign-based methods like Rprop for training deep nets, and since then variants such as RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015) have become increasingly popular. Adam updates the weights according to the mean divided by the root mean square of recent gradients. Let denote an exponential moving average with timescale , and the stochastic gradient. Then
Adam step | ||||
Therefore taking the time scale of the exponential moving averages to zero, , yields signSGD | ||||
To date there has been no convincing theory of the {Rprop, RMSprop, Adam} family of algorithms, known as ‘adaptive gradient methods’. Indeed Reddi et al. (2018) point out problems in the original convergence proof of Adam, even in the convex setting. Since signSGD belongs to this same family of algorithms, we expect that our theoretical analysis should be relevant for all algorithms in the family. In a parallel work, Balles & Hennig (2018) explore the connection between signSGD and Adam in greater detail, though their theory is more restricted and lives in the convex world, and they do not analyse Signum
as we do but employ it on heuristic grounds.
Optimisation: much of classic optimisation theory focuses on convex problems, where local information in the gradient tells you global information about the direction to the minimum. Whilst elegant, this theory is less relevant for modern problems in deep learning which are non-convex. In non-convex optimisation, finding the global minimum is intractable. Theorists usually settle for measuring some restricted notion of success, such as rate of convergence to stationary points (Ghadimi & Lan, 2013; Allen-Zhu, 2017a) or local minima (Nesterov & Polyak, 2006). Though Dauphin et al. (2014) suggest saddle points should abound in neural network error landscapes, practitioners report not finding this a problem in practice (Goodfellow et al., 2015) and therefore a theory of convergence to stationary points is useful and informative.
We begin our analysis of sign stochastic gradient descent in the non-convex setting. The standard assumptions of the stochastic optimisation literature are nicely summarised by Allen-Zhu (2017b). We will use more fine-grained assumptions, which reduce to the coarser standard assumptions as special cases. signSGD can exploit this additional structure, much as Adagrad (Duchi et al., 2011) exploits sparsity. We emphasise that these fine-grained assumptions do not lose anything over typical SGD assumptions, since they contain the standard assumptions as special cases.
For all and some constant , we have objective value
This assumption is standard and necessary for guaranteed convergence to a stationary point.
The next two assumptions will naturally encode notions of heterogeneous curvature and gradient noise.
Let denote the gradient of the objective evaluated at point . And let be an upper bound on our learning rate. Then satisfying we require that for some non-negative constant
For twice differentiable , this implies that . This is related to the slightly weaker coordinate-wise Lipschitz condition used in the block coordinate descent literature (Richtárik & Takáč, 2014).
Lastly, we assume that we have access to the following stochastic gradient oracle:
Upon receiving query , the stochastic gradient oracle gives us an independentunbiased estimate that has coordinate bounded variance:
for a vector of non-negative constants .
This oracle is realised merely by evaluating the gradient with respect to a data point chosen uniformly at random. In our theorem, we will be working with a mini-batch of size in the iteration, and the corresponding mini-batch stochastic gradient is modeled as the average of calls to the above oracle at . This squashes the variance bound on to .
Assumptions 2 and 3 are different from the assumptions typically used for analysing the convergence properties of SGD (Nesterov, 2013; Ghadimi & Lan, 2013), but they are natural to the geometry induced by algorithms with signed updates such as signSGD and Signum.
Assumption 2 need only apply in a local neighborhood since all sign based optimisers—such as signSGD—only take steps of bounded size. Assumption 2 is more fine-grained than the standard assumption, which is recovered by defining Lipschitz constant . Then Assumption 2 implies that
which is the standard assumption of Lipschitz smoothness.
Assumption 3 is more fined-grained than the standard stochastic gradient oracle assumption used for SGD analysis. But again, the standard variance bound is recovered by defining . Then Assumption 3 implies that
which is the standard assumption of bounded total variance.
Under these assumptions, we can prove the following theorem:
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white]
Run algorithm 1 for iterations under Assumptions 1 to 3. Set the learning rate and mini-batch size (independently of step ) as
Let be the cumulative number of stochastic gradient calls up to step , i.e. . Then we have
The proof is given in Section B of the supplementary material. It follows the well known strategy of relating the norm of the gradient to the expected improvement made in a single algorithmic step, and comparing this with the total possible improvement under Assumption 1. A key technical challenge we overcome is in showing how to directly deal with a biased approximation to the true gradient. Here we will provide some intuition about the proof.
To pass the stochasticity through the non-linear sign operation in a controlled fashion, we need to prove the key statement that
This formalises the intuition that the probability of the sign of a component of the stochastic gradient being incorrect should be controlled by the signal-to-noise ratio of that component. When a component’s gradient is large, the probability of making a mistake is low, and one expects to make good progress. When the gradient is small compared to the noise, the probability of making mistakes can be high, but that doesn’t matter because going in the wrong direction is not costly when gradients are small. Since the algorithm can be as bad or worse than chance when the gradient is smaller than the noise, to converge precisely to a critical point we must gradually reduce the scale of the noise as time goes on by increasing the theoretical batch size.
Now the intuition about the proof is out of the way, let’s understand what the theorem itself is saying. Understanding the theorem statement revolves around understanding the geometry of signSGD. The convergence rate strikingly depends on the -norm of the gradient, the stochasticity and the curvature. To understand this better, let’s define a notion of density of a high-dimensional vector as follows:
(1) |
To see that this is a natural definition of density, notice that for a fully dense vector, and for a fully sparse vector, . We trivially have that so this notion of density provides an easy way to translate from norms in to both and .
Remember that under our assumptions, SGD-style assumptions hold with Lipschitz constant and total variance bound . Using our notion of density we can translate our constants into the language of SGD:
where we have assumed to be a lower bound on the gradient density over the entire space. Using that and changing variables in the bound, we reach the following result for signSGD
whereas, for comparison, a typical SGD bound is
The bounds are very similar, except for the appearance of ratios of densities and , defined as
So how can we interpret this bound? Let’s break into cases:
and . This means that both the curvature and the stochasticity are much denser than the typical gradient and theory suggests SGD should converge faster than signSGD.
NOT and NOT. This means that neither the curvature nor the stochasticity are much denser than the gradient, and our theory suggests that signSGD should converge as fast or faster than SGD, and also get the benefits of gradient compression.
neither of the above holds, for example and . Then our theory is indeterminate about whether signSGD or SGD should converge faster.
Let’s briefly provide some intuition to understand how it’s possible that signSGD could outperform SGD. Imagine a scenario where the gradients are dense but there is a sparse set of extremely noisy components. Then the dynamics of SGD will be dominated by this noise, and SGD will effectively perform a random walk along these noisy components, ignoring almost all of the gradient signal. signSGD however will treat all components equally, so it will scale down the sparse noise and scale up the dense gradients comparatively, and thus make good progress.
To summarise, our theory suggests that when gradients are dense, signSGD should be more robust to large curvature and stochasticity on a sparse set of coordinates. In practice, we find that signSGD converges about as fast as SGD. That would suggest that we are either in regime (II) or (III) above. But what is the situation in practice, for the error landscape of deep neural networks?
To measure gradient and noise densities in practice, we use Welford’s algorithm (Welford, 1962; Knuth, 1997) to compute the true gradient and its stochasticity vector at every epoch of training for a Resnet-20 model on CIFAR-10. Welford’s algorithm is numerically stable and only takes a single pass through the data to compute the vectorial mean and variance. Therefore if we train a network for 160 epochs, we make an additional 160 passes through the data to evaluate these gradient statistics. Results are plotted in Figure 1. Notice that the gradient density and noise density are of the same order throughout training, and this indeed puts us in regime (II) or (III) as predicted by our theory.
In Figure 4 of the supplementary, we present preliminary evidence that this finding generalises, by showing that gradients are dense across a range of datasets and network architectures. We have not yet devised an efficient means of measuring curvature densities, so we leave that for future work.
In the most common form of distributed training, workers (such as GPUs) each evaluate gradients on their own split of the data, and send the results up to a parameter-server. The parameter server aggregates the results and transmits them back to each worker (Li et al., 2014).
Up until this point in the paper, we have only analysed signSGD where the update is of the form
To get the benefits of compression we want the worker to send the sign of the gradient evaluated only on its portion of the data. This would lead to an update of the form
(good) |
This scheme is good since what gets sent to the parameter will be 1-bit compressed. But what gets sent back almost certainly will not. Could we hope for a scheme where all communication is 1-bit compressed?
What about the following scheme:
(best) |
This is called majority vote, since each worker is essentially voting with its belief about the sign of the true gradient. The parameter server counts the votes, and sends its 1-bit decision back to every worker.
Whilst we have proofs showing that both the (good) scheme and the better majority vote scheme both converge, majority vote is more elegant and more communication efficient, therefore we do not present the proof for the (good) scheme.
In Theorem 2 we first establish the general convergence rate of majority vote. Then we characterise a regime where majority vote enjoys a unilateral variance reduction from to .
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white]
Run algorithm 3 for iterations under Assumptions 1 to 3. Set the learning rate and mini-batch size for each worker (independently of step ) as
Then (a) majority vote with workers converges at least as fast as signSGD in Theorem 1.
And (b) further assuming that the noise in each component of the stochastic gradient is unimodal and symmetric about the mean (e.g. Gaussian), majority vote converges at unilaterally improved rate:
where is the cumulative number of stochastic gradient calls per worker up to step .
The proof is given in the supplementary material, but here we sketch some details. Consider the signal-to-noise ratio of a single component of the stochastic gradient, defined as . For the gradient is small and it doesn’t matter if we get the sign wrong. For , we can show using a one-sided version of Chebyshev’s inequality (Cantelli, 1928) that the failure probability, , of that sign bit on an individual worker satisfies . This means that the parameter server is essentially receiving a repetition code and the majority vote decoder is known to drive down the failure probability of a repetition code exponentially in the number of repeats (MacKay, 2002).
Remark: Part (a) of the theorem does not describe a unilateral speedup over just using a single machine, and that might hint that all those extra workers are a waste in this setting. This is not the case. From the proof sketch above, it should be clear that part (a) is an extremely conservative statement. In particular, we expect all regions of training where the signal-to-noise ratio of the stochastic gradient satisfies to enjoy a significant speedup due to variance reduction. It’s just that since we don’t get the speedup when , it’s hard to express this in a compact bound.
To sketch a proof for part (b), note that a sign bit from each worker is a Bernoulli trial—call its failure probability . We can get a tight control of by a convenient tail bound owing to Gauss (1823)
that holds under conditions of unimodal symmetry. Then the sum of bits received by the parameter server is a binomial random variable, and we can use Cantelli’s inequality to bound it’s tail. This turns out to be enough to get tight enough control on the error probability of the majority vote decoder to prove the theorem.
Remark 1: assuming that the stochastic gradient of each worker is approximately symmetric and unimodal is very reasonable. In particular for increasing mini-batch size it will be an ever-better approximation by the central limit theorem. Figure 2 plots histograms of real stochastic gradients for deep neural networks. Even at batch-size 256 the stochastic gradient for an Imagenet model already looks Gaussian.
Remark 2: if you delve into the proof of Theorem 2 and graph all of the inequalities, you will notice that some of them are uniformly slack. This suggests that the assumptions of symmetry and unimodality can actually be relaxed to only hold approximately. This raises the possibility of proving a relaxed form of Gauss’ inequality and using a third moment bound in the Berry-Esseen theorem to derive a minimal batch size for which the majority vote scheme is guaranteed to work by the central limit theorem. We leave this for future work.
Remark 3:
why does this theorem have anything to do with unimodality or symmetry at all? It’s because there exist very skewed or bimodal random variables
with mean such that is arbitrarily small. This can either be seen by applying Cantelli’s inequality which is known to be tight, or by playing with distributions likeDistributions like these are a problem because it means that adding more workers will actually drive up the error probability rather than driving it down. The beauty of the central limit theorem is that even for such a skewed and bimodal distribution, the mean of just a few tens of samples will already start to look Gaussian.
Momentum is a popular trick used by neural network practitioners that can, in our experience, speed up the training of deep neural networks and improve the robustness of algorithms to other hyperparameter settings. Instead of taking steps according to the gradient, momentum algorithms take steps according to a running average of recent gradients.
Existing theoretical analyses of momentum often rely on the absence of gradient stochasticity (e.g. Jin et al. (2017)) or convexity (e.g. Goh (2017)) to show that momentum’s asymptotic convergence rate can beat gradient descent.
It is easy to incorporate momentum into signSGD, merely by taking the sign of the momentum We call the resulting algorithm Signum and present the algorithmic step formally in Algorithm 2. Signum fits into our theoretical framework, and we prove its convergence rate in Theorem 3.
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white]
In Algorithm 2, set the learning rate, mini-batch size and momentum parameter respectively as
Our analysis also requires a warmup period to let the bias in the momentum settle down. The warmup should last for iterations, where is a constant that depends on the momentum parameter as follows:
Note that for , we have which is negligible. For the first iterations, accumulate the momentum as normal, but use the sign of the stochastic gradient to make updates instead of the sign of the momentum.
Let be the cumulative number of stochastic gradient calls up to step , i.e. . Then for we have
where we have used O to hide numerical constants and the -dependent constant .
The proof is the greatest technical challenge of the paper, and is given in the supplementary material. We focus on presenting the Signum theorem in a modular form, anticipating that parts may be useful in future theoretical work. It involves a very general master lemma, Lemma 1, which can be used to help prove all the theorems in this paper.
Remark 1: switching optimisers after a warmup period is in fact commonly done by practitioners (Akiba et al., 2017).
Remark 2: the theory suggests that momentum can be used to control a bias-variance tradeoff in the quality of stochastic gradient estimates. Sending kills the variance term in due to averaging gradients over a longer time horizon. But averaging in stale gradients induces bias due to curvature of , and this blows up the term.
Remark 3: for generality, we state this theorem with a tunable learning rate . For variety, we give this theorem in any-time form with a growing batch size and decaying learning rate. This comes at the cost of factors appearing.
More heuristic gradient compression schemes like TernGrad (Wen et al., 2017) quantise gradients into three levels . This can sometimes be desirable, and in practical settings we may wish to integrate ternary quantisation with our framework of majority vote. Our scheme should easily enable ternary quantisation—in both directions. This can be cast as “majority vote with abstention”. The scheme is as follows: workers send their vote to the parameter server, unless they are very unsure about the sign of the true gradient in which case they send zero. The parameter-server counts the votes, and if quorum is not reached (i.e. too many workers disagreed or abstained) the parameter-server sends back zero. This extended algorithm should readily fit into our theory.
In Section 2 we pointed out that signSGD and Signum, like Adam, are members of the family of adaptive gradient methods. In all our experiments we find that Signum and Adam get extremely similar performance, although both lose out to SGD by about 2% test accuracy on Imagenet. Wilson et al. (2017) also observed that Adam tends to generalise slightly worse than SGD. It is still unclear whether this is due to the experimental baselines being biased towards models where SGD had previously been found to work well, or whether there is a deficiency in adaptive gradient methods like Adam. Our theory suggests situations where adaptive methods should outperform SGD, suggesting that these methods can be of high value—even more so given the natural compression properties of sign-based methods.
Perhaps Signum and Adam could be generalising slightly worse because we don’t know how to properly regularise such methods. Whilst we found that neither standard weight decay nor the suggestion of Loshchilov & Hutter (2017) completely closed our Imagenet test set gap with SGD, it is possible that some other regularisation scheme might. One idea, suggested by our theory, is that signSGD could be squashing down noise levels. There is some empirical evidence, for example by (Smith & Le, 2018), that a certain level of noise can be good for generalisation, biasing the optimiser towards wider valleys in the objective function. Perhaps, then, adding Gaussian noise to the Signum update might help it generalise better. This can be achieved in a communication efficient manner in the distributed setting by sharing a random seed with each worker, and then generating the same noise on each worker. We leave this idea for future investigation, but note that due to Signum’s inherent compression properties, getting it to generalise better is of high expected value.
We expect that our theoretical framework should be flexible enough to handle other interesting problems in distributed optimisation, such as delayed gradients and asynchronous worker updates. Delayed gradients can happen when communication channels are unreliable and packets can arrive later than expected. We note that these delayed or ‘stale’ gradients can be conceptualised as a form of momentum where the averaging function is not just exponential decay but exponential decay with a time lag. Therefore we expect that the theoretical framework of Signum should naturally extend to cover this case.
Finally, in Section 3 we discuss some geometric implications of our theory, and provide an efficient and robust experimental means of measuring one aspect—the ratio between noise and gradient density—through the Welford algorithm. We believe that since this density ratio is easy to measure, it may be useful to help guide those doing architecture search, to find network architectures which are amenable to fast training through gradient compression schemes.
We have presented a very general framework for studying sign-based methods in stochastic non-convex optimisation. We present the first non-vacuous bounds for gradient compression schemes, and elucidate the special geometries under which these schemes can be expected to succeed. Our theoretical framework is broad enough to handle signed-momentum schemes—like Signum—and also multi-worker distributed schemes—like majority vote.
Our work touches upon interesting aspects of the geometry of high-dimensional error surfaces, which we wish to explore in future work. But the next step for us will be to reach out to members of the distributed systems community to help benchmark the majority vote algorithm which shows such great theoretical promise for 1-bit compression in both directions between parameter-server and workers.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, volume 9 of Proceedings of Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.A direct adaptive method for faster backpropagation learning: the RPROP algorithm.
In IEEE International Conference on Neural Networks, 1993.[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white] See 1
First let’s bound the improvement of the objective during a single step of the algorithm for one instantiation of the noise. is the indicator function, denotes the component of the true gradient and is a stochastic sample obeying Assumption 3.
First take Assumption 2, plug in the step from Algorithm 1, and decompose the improvement to expose the stochasticity-induced error:
Next we find the expected improvement at time conditioned on the previous iterate.
So the expected improvement crucially depends on the probability that each component of the sign vector is correct, which is intuitively controlled by the relative scale of the gradient to the noise. To make this rigorous, first relax the probability, then use Markov’s inequality followed by Jensen’s inequality:
refers to the variance of the stochastic gradient estimate, computed over a mini-batch of size . Therefore, by Assumption 3, we have that .
We now substitute these results and our learning rate and mini-batch settings into the expected improvement:
Now extend the expectation over randomness in the trajectory, and perform a telescoping sum over the iterations:
We can rearrange this inequality to yield the rate:
Since we are growing our mini-batch size, it will take gradient calls to reach step . Substitute this in, square the result, and we are done. ∎
[boxsep=0pt, arc=0pt, boxrule=0.5pt, colback=white] See 2 Before we introduce the unimodal symmetric assumption, let’s first address the claim that M-worker majority vote is at least as good as single-worker signSGD as in Theorem 1 only using Assumptions 1 to 3.
Recall that a crucial step in Theorem 1 is showing that
for component of the stochastic gradient with variance bound .
The only difference in majority vote is that instead of using to approximate , we are instead using . If we can show that the same bound in terms of holds instead for
() |
then we are done, since the machinery of Theorem 1 can then be directly applied.
Define the signal-to-noise ratio of a component of the stochastic gradient as . Note that when then is trivially satisfied, so we need only consider the case that .
Without loss of generality, assume that is negative, and thus using Assumption 3 and Cantelli’s inequality (Cantelli, 1928) we get that for the failure probability of a single worker
For then we have failure probability . If the failure probability of a single worker is smaller than then the server is essentially receiving a repetition code of the true gradient sign. Majority vote is the maximum likelihood decoder of the repetition code, and of course decreases the probability of error—see e.g. (MacKay, 2002). Therefore in all regimes of we have that
and we are done. ∎
That’s all well and good, but what we’d really like to show is that using workers provides a unilateral speedup. Is
() |
too much to hope for?
Well in the regime where such a speedup is very reasonable since by Cantelli, and the repetition code actually supplies exponential reduction in failure rate. But we need to exclude very skewed or bimodal distributions where and adding more voting workers will not help. That brings us naturally to part (b) of Theorem 2.
If we can show we’ll be done, since the machinery of Theorem 1 follows through with replaced everywhere by . Note that the important quantity appearing in is
Let count the number of workers with correct sign bit. To ensure that
must be larger than . But is the sum of independent Bernoulli trials, and is therefore binomial with success probability and failure probability to be determined. Therefore we have reduced proving to showing that
() |
where is the number of successes of a binomial random variable and is our signal-to-noise ratio .
Let’s start by getting a bound on the success probability (or equivalently failure probability ) of a single Bernoulli trial. Recall Gauss’ inequality for unimodal random variable X with mean and variance (Gauss, 1823):
Without loss of generality assume that is negative. Then applying symmetry followed by Gauss, the failure probability for the sign bit of a single worker satisfies:
Where we have defined to be our -dependent bound on . Since , there is hope to show . Define to be the defect of from one half, and let be its -dependent bound.
Now we have an analytical handle on random variable , we may proceed to show . There are a number of different inequalities that we can use to bound the tail of a binomial random variable, but Cantelli’s inequality will be good enough for our purposes.
Let denote the number of failures. is binomial with mean and variance . Then using Cantelli we get
Now using the fact that we get
To finish, we need only show that is smaller than , or equivalently that its square is smaller than . Well plugging in our bound on we get that
where
First take the case . Then and . Now take the case . Then and we have by the condition on .
So we have shown both cases, which proves from which we get and we are done. ∎
Now we generalize the arguments in the proof of signSGD and prove a master lemma that provides a general recipe for analyzing the approximate sign gradient method. This allows us to handle momentum and the majority voting schemes, hence proving Theorem 3 and Theorem 2.
Let and be integers satisfying . Consider the algorithm given by , for a fixed positive sequence of and where is a measurable and square integrable function of the entire history up to time , including and all stochastic gradient oracle calls up to time . Let . If Assumption 1 and Assumption 2 are true and in addition for
(2) |
where the expectation is taken over the all random variables, and the rate obeys that as and then we have
In particular, if and , for some problem dependent constant , then we have
Our general strategy will be to show that the expected objective improvement at each step will be good enough to guarantee a convergence rate in expectation. First let’s bound the improvement of the objective during a single step of the algorithm for , and then take expectation. Note that is the indicator function, and denotes the component of the vector .
Now, for we need to find the expected improvement at time conditioned on , where the expectation is over the randomness of the stochastic gradient oracle. Note that denotes the probability of event .
Note that becomes fixed when we condition on . Further take expectation over , and apply (2) we get:
(3) |
Rearrange the terms and sum over (3) for .
Comments
There are no comments yet.