Towards More Effective and Economic Sparsely-Activated Model

10/14/2021
by   Hao Jiang, et al.
0

The sparsely-activated models have achieved great success in natural language processing through large-scale parameters and relatively low computational cost, and gradually become a feasible technique for training and implementing extremely large models. Due to the limit of communication cost, activating multiple experts is hardly affordable during training and inference. Therefore, previous work usually activate just one expert at a time to alleviate additional communication cost. Such routing mechanism limits the upper bound of model performance. In this paper, we first investigate a phenomenon that increasing the number of activated experts can boost the model performance with higher sparse ratio. To increase the number of activated experts without an increase in computational cost, we propose SAM (Switch and Mixture) routing, an efficient hierarchical routing mechanism that activates multiple experts in a same device (GPU). Our methods shed light on the training of extremely large sparse models and experiments prove that our models can achieve significant performance gain with great efficiency improvement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2021

Exploring Sparse Expert Models and Beyond

Mixture-of-Experts (MoE) models can achieve promising results with outra...
research
11/23/2021

SpeechMoE2: Mixture-of-Experts Model with Improved Routing

Mixture-of-experts based acoustic models with dynamic routing mechanisms...
research
02/18/2022

Mixture-of-Experts with Expert Choice Routing

Sparsely-activated Mixture-of-experts (MoE) models allow the number of p...
research
10/08/2021

Taming Sparsely Activated Transformer with Stochastic Experts

Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can ...
research
01/11/2021

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inp...
research
04/10/2019

Soft Conditional Computation

Conditional computation aims to increase the size and accuracy of a netw...
research
10/19/2022

On the Adversarial Robustness of Mixture of Experts

Adversarial robustness is a key desirable property of neural networks. I...

Please sign up or login with your details

Forgot password? Click here to reset