FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

04/08/2023
by   Xiaonan Nie, et al.
0

With the increasing data volume, there is a trend of using large-scale pre-trained models to store the knowledge into an enormous number of model parameters. The training of these models is composed of lots of dense algebras, requiring a huge amount of hardware resources. Recently, sparsely-gated Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks. However, such a sparse conditional computation may not be effective as expected in practical systems due to the routing imbalance and fluctuation problems. Generally, MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible. In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow. We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism. Then we introduce a novel scheduling module over the existing DNN runtime to monitor the data flow, make the scheduling plans, and dynamically adjust the model-to-hardware mapping guided by the real-time data traffic. A simple but efficient heuristic algorithm is exploited to dynamically optimize the device placement during training. We have conducted experiments on both NLP models (e.g., BERT and GPT) and vision models (e.g., Swin). And results show FlexMoE can achieve superior performance compared with existing systems on real-world workloads – FlexMoE outperforms DeepSpeed by 1.70x on average and up to 2.10x, and outperforms FasterMoE by 1.30x on average and up to 1.45x.

READ FULL TEXT
research
06/30/2022

On-Device Training Under 256KB Memory

On-device training enables the model to adapt to new data collected from...
research
07/10/2023

One-Shot Pruning for Fast-adapting Pre-trained Models on Devices

Large-scale pre-trained models have been remarkably successful in resolv...
research
06/13/2023

Efficient Adapters for Giant Speech Models

Large pre-trained speech models are widely used as the de-facto paradigm...
research
05/02/2021

MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

Large-scale pre-trained models like BERT, have obtained a great success ...
research
06/14/2021

Pre-Trained Models: Past, Present and Future

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently...
research
05/20/2023

Lifelong Language Pretraining with Distribution-Specialized Experts

Pretraining on a large-scale corpus has become a standard method to buil...
research
02/20/2023

TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training

Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectivenes...

Please sign up or login with your details

Forgot password? Click here to reset