DeepAI
Log In Sign Up

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

05/10/2022
by   Mingrui Liu, et al.
13

In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have O(1/Nϵ^4) iteration complexity for finding an ϵ-stationary point, where N is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.

READ FULL TEXT

page 32

page 33

05/09/2019

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Recent developments on large-scale distributed machine learning applicat...
05/11/2021

FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis

Federated Learning (FL) is an emerging learning scheme that allows diffe...
12/09/2015

Distributed Training of Deep Neural Networks with Theoretical Analysis: Under SSP Setting

We propose a distributed approach to train deep neural networks (DNNs), ...
07/25/2021

SGD May Never Escape Saddle Points

Stochastic gradient descent (SGD) has been deployed to solve highly non-...
05/05/2020

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

In this paper, we study distributed algorithms for large-scale AUC maxim...
11/03/2022

A Convergence Theory for Federated Average: Beyond Smoothness

Federated learning enables a large amount of edge computing devices to l...