Lita: Accelerating Distributed Training of Sparsely Activated Models

10/31/2022
by   Jiamin Li, et al.
0

Scaling model parameters usually improves model quality, but at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have constant computation cost over their dense counterparts, thus providing opportunities to train and serve a large model at a reasonable cost. However, the distributed training of an MoE model is prone to low efficiency, mainly due to the interleaved all-to-all communication during model computation. This paper makes three main contributions. First, we systematically analyze the all-to-all overhead in distributed training of MoE. Second, we propose a new communication scheduling scheme based on tensor partitioning that prioritizes the all-to-all operations over other communication, due to its blocking nature. Third, we introduce expert packing that reduces the all-to-all transfer size and incorporates optimizations to mitigate its overheads. Both techniques effectively tackle the all-to-all bottleneck, and we integrate them into a new system called Lina. Experiments on an A100 GPU testbed show that Lina improves the training step time of popular NLP models by up to 1.73x over the state-of-the-art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2021

EmbRace: Accelerating Sparse Communication for Distributed Training of NLP Neural Networks

Distributed data-parallel training has been widely used for natural lang...
research
03/11/2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Mixture-of-Experts (MoE) is a neural network architecture that adds spar...
research
03/28/2022

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

As giant dense models advance quality but require large-scale expensive ...
research
06/30/2022

Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning

The second-order optimization methods, notably the D-KFAC (Distributed K...
research
05/25/2020

Towards Efficient Scheduling of Federated Mobile Devices under Computational and Statistical Heterogeneity

Originated from distributed learning, federated learning enables privacy...
research
04/22/2023

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

The Mixture of Experts (MoE) model becomes an important choice of large ...
research
05/12/2021

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Recent trend towards increasing large machine learning models require bo...

Please sign up or login with your details

Forgot password? Click here to reset