Parallel SGD: When does averaging help?

06/23/2016
by   Jian Zhang, et al.
0

Consider a number of workers running SGD independently on the same pool of data and averaging the models every once in a while -- a common but not well understood practice. We study model averaging as a variance-reducing mechanism and describe two ways in which the frequency of averaging affects convergence. For convex objectives, we show the benefit of frequent averaging depends on the gradient variance envelope. For non-convex objectives, we illustrate that this benefit depends on the presence of multiple globally optimal points. We complement our findings with multicore experiments on both synthetic and real data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/17/2018

Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication

For large scale non-convex stochastic optimization, parallel mini-batch ...
research
02/18/2020

Is Local SGD Better than Minibatch SGD?

We study local SGD (also known as parallel SGD and federated averaging),...
research
03/12/2019

A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

Reducing communication overhead is a big challenge for large-scale distr...
research
03/09/2020

Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Stochastic gradient descent (SGD) has been widely studied in the literat...
research
01/11/2022

Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits

Local Stochastic Gradient Descent (SGD) with periodic model averaging (F...
research
10/25/2021

Optimal Model Averaging: Towards Personalized Collaborative Learning

In federated learning, differences in the data or objectives between the...
research
06/12/2018

The Unusual Effectiveness of Averaging in GAN Training

We show empirically that the optimal strategy of parameter averaging in ...

Please sign up or login with your details

Forgot password? Click here to reset