HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

03/28/2022
by   Xiaonan Nie, et al.
0

As giant dense models advance quality but require large-scale expensive GPU clusters for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, are proposed to scale models while keeping the computation constant. Specifically, the input data is routed by a gate network and only activates a part of the expert network. Existing MoE training systems only support part of mainstream MoE models (e.g. Top k) training under expensive high-bandwidth GPU clusters. In this paper, we present HetuMoE, a high-performance large-scale sparse MoE training system built on Hetu. HetuMoE provides multiple gating strategies and efficient GPU kernel implementations. To further improve the training efficiency on commodity GPU clusters (e.g, with only 1 NiC), we introduce the hierarchical AllToAll communication that combines hierarchical networks and aggregating messages. Compared with existing state-of-the-art MoE systems, HetuMoE obtains at least 15 the switch gate with a batch size of 32. The code is available at: https://github.com/PKU-DAIR/Hetu.

READ FULL TEXT

page 3

page 5

research
12/29/2021

Dense-to-Sparse Gate for Mixture-of-Experts

Mixture-of-experts (MoE) is becoming popular due to its success in impro...
research
03/24/2021

FastMoE: A Fast Mixture-of-Expert Training System

Mixture-of-Expert (MoE) presents a strong potential in enlarging the siz...
research
05/20/2022

SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

With the increasing diversity of ML infrastructures nowadays, distribute...
research
08/23/2023

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Large language models (LLMs) based on transformers have made significant...
research
10/31/2022

Lita: Accelerating Distributed Training of Sparsely Activated Models

Scaling model parameters usually improves model quality, but at the pric...
research
05/11/2023

HINT: Hierarchical Mixture Networks For Coherent Probabilistic Forecasting

We present the Hierarchical Mixture Networks (HINT), a model family for ...
research
04/11/2023

Revisiting Single-gated Mixtures of Experts

Mixture of Experts (MoE) are rising in popularity as a means to train ex...

Please sign up or login with your details

Forgot password? Click here to reset