Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

11/18/2022
by   Young Jin Kim, et al.
0

Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large memory requirements and inefficient inference. In this work, we introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models and cut down the memory consumption significantly. While we achieve up to 26x speed-up in terms of throughput, we also reduce the model size almost to one eighth of the original 32-bit float model by quantizing expert weights into 4-bit integers. As a result, we are able to deploy 136x larger models with 27 cost and significantly better quality compared to the existing solutions. This enables a paradigm shift in deploying large scale multilingual MoE transformers models replacing the traditional practice of distilling teacher models into dozens of smaller models per language or task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

As the training of giant dense models hits the boundary on the availabil...
research
09/22/2021

Scalable and Efficient MoE Training for Multitask Multilingual Models

The Mixture of Experts (MoE) models are an emerging class of sparsely ac...
research
08/23/2023

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Large language models (LLMs) based on transformers have made significant...
research
02/17/2022

Designing Effective Sparse Expert Models

Scale has opened new frontiers in natural language processing – but at a...
research
01/23/2017

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

The capacity of a neural network to absorb information is limited by its...
research
05/25/2023

Mixture-of-Expert Conformer for Streaming Multilingual ASR

End-to-end models with large capacity have significantly improved multil...
research
03/10/2023

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Mixture-of-Experts (MoE) models have gained popularity in achieving stat...

Please sign up or login with your details

Forgot password? Click here to reset