From Sparse to Soft Mixtures of Experts

08/02/2023
by   Joan Puigcerver, et al.
0

Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we proposeSoft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms standard Transformers (ViTs) and popular MoE variants (Tokens Choice and Experts Choice). For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, while inference time cost grows by only 2 substantially better.

READ FULL TEXT

page 2

page 16

page 37

page 38

page 39

research
02/18/2022

Mixture-of-Experts with Expert Choice Routing

Sparsely-activated Mixture-of-experts (MoE) models allow the number of p...
research
05/28/2022

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Sparsely activated transformers, such as Mixture of Experts (MoE), have ...
research
04/22/2022

Balancing Expert Utilization in Mixture-of-Experts Layers Embedded in CNNs

This work addresses the problem of unbalanced expert utilization in spar...
research
03/30/2021

BASE Layers: Simplifying Training of Large, Sparse Models

We introduce a new balanced assignment of experts (BASE) layer for large...
research
06/01/2022

Task-Specific Expert Pruning for Sparse Mixture-of-Experts

The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pr...
research
08/11/2023

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

Structural re-parameterization is a general training scheme for Convolut...
research
04/10/2019

Soft Conditional Computation

Conditional computation aims to increase the size and accuracy of a netw...

Please sign up or login with your details

Forgot password? Click here to reset