Improving Efficiency in Large-Scale Decentralized Distributed Training

02/04/2020
by   Wei Zhang, et al.
16

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5 test set and 13.3 1.98 hours with 7.7 fastest training time reported to date.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2021

Asynchronous Decentralized Distributed Training of Acoustic Models

Large-scale distributed training of deep acoustic models plays an import...
research
07/10/2019

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Modern Automatic Speech Recognition (ASR) systems rely on distributed de...
research
04/10/2019

Distributed Deep Learning Strategies For Automatic Speech Recognition

In this paper, we propose and investigate a variety of distributed deep ...
research
04/24/2021

DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

The scale of deep learning nowadays calls for efficient distributed trai...
research
11/03/2015

Distributed Deep Learning for Question Answering

This paper is an empirical study of the distributed deep learning for qu...
research
04/12/2021

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

With increasing data and model complexities, the time required to train ...
research
08/14/2023

O-1: Self-training with Oracle and 1-best Hypothesis

We introduce O-1, a new self-training objective to reduce training bias ...

Please sign up or login with your details

Forgot password? Click here to reset