On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

05/09/2019
by   Hao Yu, et al.
0

Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enable us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted in training machine learning models and can often converge faster and generalize better. For example, many practitioners use distributed SGD with momentum to train deep neural networks with big data. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGD and has reduced communication complexity. This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property.

READ FULL TEXT
research
10/01/2020

Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis

Momentum methods are now used pervasively within the machine learning co...
research
02/21/2020

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

Distributed stochastic gradient descent (SGD) is essential for scaling t...
research
05/30/2019

Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...
research
05/10/2022

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

In distributed training of deep neural networks or Federated Learning (F...
research
11/20/2022

Non-reversible Parallel Tempering for Deep Posterior Approximation

Parallel tempering (PT), also known as replica exchange, is the go-to wo...
research
06/28/2019

Asymptotic Network Independence in Distributed Optimization for Machine Learning

We provide a discussion of several recent results which have overcome a ...
research
08/24/2020

Periodic Stochastic Gradient Descent with Momentum for Decentralized Training

Decentralized training has been actively studied in recent years. Althou...

Please sign up or login with your details

Forgot password? Click here to reset