A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

03/11/2023
by   Siddharth Singh, et al.
0

Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26 optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2022

Optimizing Mixture of Experts using Dynamic Recompilations

The Mixture of Experts architecture allows for outrageously large neural...
research
05/22/2023

Communication-minimizing Asynchronous Tensor Parallelism

As state-of-the-art neural networks scale to billions of parameters, des...
research
10/31/2022

Lita: Accelerating Distributed Training of Sparsely Activated Models

Scaling model parameters usually improves model quality, but at the pric...
research
04/22/2023

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

The Mixture of Experts (MoE) model becomes an important choice of large ...
research
02/01/2023

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

Model parallelism has become necessary to train large neural networks. H...
research
01/20/2022

A Guide to Particle Advection Performance

The performance of particle advection-based flow visualization technique...
research
03/02/2022

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models

The state-of-the-art Mixture-of-Experts (short as MoE) architecture has ...

Please sign up or login with your details

Forgot password? Click here to reset