Log In Sign Up

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

by   Mingrui Liu, et al.

In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have O(1/Nϵ^4) iteration complexity for finding an ϵ-stationary point, where N is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.


page 32

page 33


On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Recent developments on large-scale distributed machine learning applicat...

FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis

Federated Learning (FL) is an emerging learning scheme that allows diffe...

Distributed Training of Deep Neural Networks with Theoretical Analysis: Under SSP Setting

We propose a distributed approach to train deep neural networks (DNNs), ...

SGD May Never Escape Saddle Points

Stochastic gradient descent (SGD) has been deployed to solve highly non-...

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

In this paper, we study distributed algorithms for large-scale AUC maxim...

A Convergence Theory for Federated Average: Beyond Smoothness

Federated learning enables a large amount of edge computing devices to l...