Dense-to-Sparse Gate for Mixture-of-Experts

12/29/2021
by   Xiaonan Nie, et al.
0

Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, especially in Transformers. By routing tokens with a sparse gate to a few experts that each only contains part of the full model, MoE keeps the model size unchanged and significantly reduces per-token computation, which effectively scales neural networks. However, we found that the current approach of jointly training experts and the sparse gate introduces a negative impact on model accuracy, diminishing the efficiency of expensive large-scale model training. In this work, we proposed Dense-To-Sparse gate (DTS-Gate) for MoE training. Specifically, instead of using a permanent sparse gate, DTS-Gate begins as a dense gate that routes tokens to all experts, then gradually and adaptively becomes sparser while routes to fewer experts. MoE with DTS-Gate naturally decouples the training of experts and the sparse gate by training all experts at first and then learning the sparse gate. Experiments show that compared with the state-of-the-art Switch-Gate in GPT-MoE(1.5B) model with OpenWebText dataset(40GB), DTS-Gate can obtain 2.0x speed-up to reach the same validation perplexity, as well as higher FLOPs-efficiency of a 1.42x speed-up.

READ FULL TEXT
research
03/28/2022

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

As giant dense models advance quality but require large-scale expensive ...
research
04/11/2023

Revisiting Single-gated Mixtures of Experts

Mixture of Experts (MoE) are rising in popularity as a means to train ex...
research
12/09/2022

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Training large, deep neural networks to convergence can be prohibitively...
research
06/05/2023

COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales ...
research
11/29/2022

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) t...
research
08/08/2022

A Theoretical View on Sparsely Activated Networks

Deep and wide neural networks successfully fit very complex functions to...
research
02/28/2023

Improving Expert Specialization in Mixture of Experts

Mixture of experts (MoE), introduced over 20 years ago, is the simplest ...

Please sign up or login with your details

Forgot password? Click here to reset