signSGD: compressed optimisation for non-convex problems

02/13/2018
by   Jeremy Bernstein, et al.
0

Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. signSGD can exploit mismatches between L1 and L2 geometry: when noise and curvature are much sparser than the gradients, signSGD is expected to converge at the same rate or faster than full-precision SGD. Measurements of the L1 versus L2 geometry of real networks support our theoretical claims, and we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss, we prove that the non-convex convergence rate of majority vote matches that of distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve both communication efficiency and high accuracy.

READ FULL TEXT

page 11

page 12

research
05/27/2019

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Communication overhead is a major bottleneck hampering the scalability o...
research
07/20/2018

signProx: One-Bit Proximal Algorithm for Nonconvex Stochastic Optimization

Stochastic gradient descent (SGD) is one of the most widely used optimiz...
research
10/11/2018

signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

Training neural networks on large datasets can be accelerated by distrib...
research
09/30/2022

Downlink Compression Improves TopK Sparsification

Training large neural networks is time consuming. To speed up the proces...
research
02/14/2022

Orthogonalising gradients to speed up neural network optimisation

The optimisation of neural networks can be sped up by orthogonalising th...
research
01/28/2019

Error Feedback Fixes SignSGD and other Gradient Compression Schemes

Sign-based algorithms (e.g. signSGD) have been proposed as a biased grad...
research
08/26/2020

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Adam is the important optimization algorithm to guarantee efficiency and...

Please sign up or login with your details

Forgot password? Click here to reset