A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

by   Mingrui Liu, et al.

In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have O(1/Nϵ^4) iteration complexity for finding an ϵ-stationary point, where N is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.


page 32

page 33


EPISODE: Episodic Gradient Clipping with Periodic Resampled Corrections for Federated Learning with Heterogeneous Data

Gradient clipping is an important technique for deep neural networks wit...

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Recent developments on large-scale distributed machine learning applicat...

Distributed Training of Deep Neural Networks with Theoretical Analysis: Under SSP Setting

We propose a distributed approach to train deep neural networks (DNNs), ...

FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis

Federated Learning (FL) is an emerging learning scheme that allows diffe...

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

In this paper, we study distributed algorithms for large-scale AUC maxim...

Nonconvex Stochastic Bregman Proximal Gradient Method with Application to Deep Learning

The widely used stochastic gradient methods for minimizing nonconvex com...

Relaxed Earth Mover's Distances for Chain- and Tree-connected Spaces and their use as a Loss Function in Deep Learning

The Earth Mover's Distance (EMD) computes the optimal cost of transformi...

Please sign up or login with your details

Forgot password? Click here to reset