Mixture-of-Experts with Expert Choice Routing

02/18/2022
by   Yanqi Zhou, et al.
0

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2022

On the Representation Collapse of Sparse Mixture of Experts

Sparse mixture of experts provides larger model capacity while requiring...
research
12/10/2022

SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing

The mixture of Expert (MoE) parallelism is a recent advancement that sca...
research
08/02/2023

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert architectures (MoEs) scale model capacity witho...
research
05/31/2021

Exploring Sparse Expert Models and Beyond

Mixture-of-Experts (MoE) models can achieve promising results with outra...
research
10/19/2022

On the Adversarial Robustness of Mixture of Experts

Adversarial robustness is a key desirable property of neural networks. I...
research
05/03/2023

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture-of-experts (MoE) models that employ sparse activation have demon...
research
10/14/2021

Towards More Effective and Economic Sparsely-Activated Model

The sparsely-activated models have achieved great success in natural lan...

Please sign up or login with your details

Forgot password? Click here to reset