Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

06/05/2023
by   Tongtian Zhu, et al.
0

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-β-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.

READ FULL TEXT

page 8

page 21

research
06/05/2023

Improved Stability and Generalization Analysis of the Decentralized SGD Algorithm

This paper presents a new generalization error analysis for the Decentra...
research
06/25/2022

Topology-aware Generalization of Decentralized SGD

This paper studies the algorithmic stability and generalizability of dec...
research
06/26/2019

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Large-batch stochastic gradient descent (SGD) is widely used for trainin...
research
10/30/2017

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Stochastic gradient descent (SGD) is widely believed to perform implicit...
research
11/15/2020

Acceleration of stochastic methods on the example of decentralized SGD

In this paper, we present an algorithm for accelerating decentralized st...
research
03/31/2021

Empirically explaining SGD from a line search perspective

Optimization in Deep Learning is mainly guided by vague intuitions and s...
research
08/09/2021

On the Power of Differentiable Learning versus PAC and SQ Learning

We study the power of learning via mini-batch stochastic gradient descen...

Please sign up or login with your details

Forgot password? Click here to reset