Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

05/03/2023
by   Haoran Xu, et al.
0

Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks, e.g., in a multilingual setting, languages based on their resource levels might require different capacities. In light of this, we propose Stratified Mixture of Experts(SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on two multilingual machine translation benchmarks, where it outperforms multiple state-of-the-art MoE models. On a diverse 15-language dataset, SMoE improves the translation quality over vanilla MoE by +0.93 BLEU points on average. Additionally, SMoE is parameter-efficient, matching vanilla MoE performance with around 50% fewer parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2021

Taming Sparsely Activated Transformer with Stochastic Experts

Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can ...
research
05/23/2023

Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

Incorporating language-specific (LS) modules is a proven method to boost...
research
02/18/2022

Mixture-of-Experts with Expert Choice Routing

Sparsely-activated Mixture-of-experts (MoE) models allow the number of p...
research
12/19/2022

Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

Compared to conventional bilingual translation systems, massively multil...
research
01/30/2023

Alternating Updates for Efficient Transformers

It is well established that increasing scale in deep transformer network...
research
05/28/2022

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Sparsely activated transformers, such as Mixture of Experts (MoE), have ...
research
05/13/2020

A Mixture of h-1 Heads is Better than h Heads

Multi-head attentive neural architectures have achieved state-of-the-art...

Please sign up or login with your details

Forgot password? Click here to reset