Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

05/28/2022
by   Rui Liu, et al.
0

Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to its assigned experts. Since the common practice for efficient training of such models requires distributing experts and tokens across different machines, this routing strategy often incurs huge cross-machine communication cost because tokens and their assigned experts likely reside in different machines. In this paper, we propose Gating Dropout, which allows tokens to ignore the gating network and stay at their local machines, thus reducing the cross-machine communication. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance. We validate the effectiveness of Gating Dropout on multilingual machine translation tasks. Our results demonstrate that Gating Dropout improves a state-of-the-art MoE model with faster wall-clock time convergence rates and better BLEU scores for a variety of model sizes and datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2022

StableMoE: Stable Routing Strategy for Mixture of Experts

The Mixture-of-Experts (MoE) technique can scale up the model size of Tr...
research
12/25/2018

Dropout Regularization in Hierarchical Mixture of Experts

Dropout is a very effective method in preventing overfitting and has bec...
research
08/02/2023

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert architectures (MoEs) scale model capacity witho...
research
03/02/2023

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Despite their remarkable achievement, gigantic transformers encounter si...
research
08/11/2023

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

Structural re-parameterization is a general training scheme for Convolut...
research
05/03/2023

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture-of-experts (MoE) models that employ sparse activation have demon...
research
07/21/2020

SliceOut: Training Transformers and CNNs faster while using less memory

We demonstrate 10-40 EfficientNets, and Transformer models, with minimal...

Please sign up or login with your details

Forgot password? Click here to reset