CADA: Communication-Adaptive Distributed Adam

12/31/2020
by   Tianyi Chen, et al.
15

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/26/2020

LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient Distributed Learning

This paper targets solving distributed machine learning problems such as...
11/20/2019

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Recent years have witnessed the growth of large-scale distributed machin...
09/09/2019

Communication-Censored Distributed Stochastic Gradient Descent

This paper develops a communication-efficient algorithm to solve the sto...
05/25/2018

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

This paper presents a new class of gradient methods for distributed mach...
02/20/2020

Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

We study distributed optimization algorithms for minimizing the average ...
04/21/2016

Stabilized Sparse Online Learning for Sparse Data

Stochastic gradient descent (SGD) is commonly used for optimization in l...
08/16/2019

NUQSGD: Improved Communication Efficiency for Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need...

Code Repositories

CADA-master

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gr


view repo