DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

01/14/2022
by   Samyam Rajbhandari, et al.
8

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

READ FULL TEXT

page 10

page 17

research
11/18/2022

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Mixture of Experts (MoE) models with conditional execution of sparsely a...
research
12/09/2022

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Training large, deep neural networks to convergence can be prohibitively...
research
05/24/2022

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with ...
research
01/27/2023

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Improving the deployment efficiency of transformer-based language models...
research
04/11/2023

Training Large Language Models Efficiently with Sparsity and Dataflow

Large foundation language models have shown their versatility in being a...
research
05/04/2022

Optimizing Mixture of Experts using Dynamic Recompilations

The Mixture of Experts architecture allows for outrageously large neural...
research
11/21/2022

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Large-scale Transformer models bring significant improvements for variou...

Please sign up or login with your details

Forgot password? Click here to reset