SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

05/20/2022
by   Liang Shen, et al.
0

With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33 training and 13 unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64 lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.

READ FULL TEXT
research
03/28/2022

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

As giant dense models advance quality but require large-scale expensive ...
research
03/10/2023

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Mixture-of-Experts (MoE) models have gained popularity in achieving stat...
research
06/08/2022

Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners

Domain generalization (DG) aims at learning generalizable models under d...
research
12/10/2022

Elixir: Train a Large Language Model on a Small GPU Cluster

In recent years, the number of parameters of one deep learning (DL) mode...
research
08/12/2021

PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management

The pre-trained model (PTM) is revolutionizing Artificial intelligence (...
research
07/14/2021

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Distributed training with synchronous stochastic gradient descent (SGD) ...
research
04/20/2017

Hard Mixtures of Experts for Large Scale Weakly Supervised Vision

Training convolutional networks (CNN's) that fit on a single GPU with mi...

Please sign up or login with your details

Forgot password? Click here to reset