Scalable and Efficient MoE Training for Multitask Multilingual Models

09/22/2021
by   Young Jin Kim, et al.
0

The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2022

Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

Compared to conventional bilingual translation systems, massively multil...
research
02/19/2023

Scaling Laws for Multilingual Neural Machine Translation

In this work, we provide a large-scale empirical study of the scaling pr...
research
11/18/2022

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Mixture of Experts (MoE) models with conditional execution of sparsely a...
research
04/30/2021

Scaling End-to-End Models for Large-Scale Multilingual ASR

Building ASR models across many language families is a challenging multi...
research
12/20/2022

Lego-MT: Towards Detachable Models in Massively Multilingual Machine Translation

Traditional multilingual neural machine translation (MNMT) uses a single...
research
05/22/2022

What Do Compressed Multilingual Machine Translation Models Forget?

Recently, very large pre-trained models achieve state-of-the-art results...
research
06/30/2020

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Neural network scaling has been critical for improving the model quality...

Please sign up or login with your details

Forgot password? Click here to reset