On the Representation Collapse of Sparse Mixture of Experts

04/20/2022
by   Zewen Chi, et al.
0

Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2022

Mixture-of-Experts with Expert Choice Routing

Sparsely-activated Mixture-of-experts (MoE) models allow the number of p...
research
04/18/2022

StableMoE: Stable Routing Strategy for Mixture of Experts

The Mixture-of-Experts (MoE) technique can scale up the model size of Tr...
research
06/01/2022

Task-Specific Expert Pruning for Sparse Mixture-of-Experts

The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pr...
research
09/26/2022

Diversified Dynamic Routing for Vision Tasks

Deep learning models for vision tasks are trained on large datasets unde...
research
05/31/2021

Exploring Sparse Expert Models and Beyond

Mixture-of-Experts (MoE) models can achieve promising results with outra...
research
06/07/2023

Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks

In deep learning, mixture-of-experts (MoE) activates one or few experts ...
research
11/29/2022

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) t...

Please sign up or login with your details

Forgot password? Click here to reset