EmbRace: Accelerating Sparse Communication for Distributed Training of NLP Neural Networks

10/18/2021
by   Shengwei Li, et al.
0

Distributed data-parallel training has been widely used for natural language processing (NLP) neural network models. However, the embedding tables in NLP models, holding a large portion of parameters and bringing dramatic sparsity in communication, make it a big challenge to efficiently scale the distributed training. Current distributed training frameworks mainly concentrate on dense models but neglect the sparsity of NLP models, resulting in significant communication overhead and relatively poor scalability. In this paper, we propose EmbRace, an efficient communication framework designed to accelerate sparse communication of distributed NLP model training. EmbRace introduces Sparsity-aware Hybrid Communication, which combines AlltoAll and AllReduce to optimize the communication overhead for sparse and dense data in NLP models. EmbRace further introduces a 2D Communication Scheduling approach to thoroughly overlap communication with computation by optimizing model computation procedure, relaxing the dependency of embeddings, and scheduling communication with a priority queue. We implement EmbRace based on PyTorch and Horovod, and conduct comprehensive evaluations with four representative NLP models on two high-performance GPU clusters. Experimental results show that EmbRace achieves up to 30.66X speedup on 16 GPUs clusters among four popular distributed training baselines.

READ FULL TEXT
research
01/24/2023

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

In training of modern large natural language processing (NLP) models, it...
research
10/31/2022

Lita: Accelerating Distributed Training of Sparsely Activated Models

Scaling model parameters usually improves model quality, but at the pric...
research
08/10/2023

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train...
research
12/14/2021

HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework

Embedding models have been an effective learning paradigm for high-dimen...
research
10/20/2020

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale...
research
01/19/2022

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Communication overhead is one of the major obstacles to train large deep...
research
01/29/2022

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

More than 70 of these idle compute are cheap CPUs with few cores that ar...

Please sign up or login with your details

Forgot password? Click here to reset