Efficient Language Modeling with Sparse all-MLP

03/14/2022
by   Ping Yu, et al.
7

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2× improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

Hash Layers For Large Sparse Models

We investigate the training of sparse layers that use different paramete...
research
01/30/2023

Alternating Updates for Efficient Transformers

It is well established that increasing scale in deep transformer network...
research
04/20/2022

Residual Mixture of Experts

Mixture of Experts (MoE) is able to scale up vision transformers effecti...
research
12/21/2020

RealFormer: Transformer Likes Residual Attention

Transformer is the backbone of modern NLP models. In this paper, we prop...
research
07/15/2023

Transformers are Universal Predictors

We find limits to the Transformer architecture for language modeling and...
research
08/01/2022

On the Limitations of Sociodemographic Adaptation with Transformers

Sociodemographic factors (e.g., gender or age) shape our language. Previ...
research
12/20/2021

Efficient Large Scale Language Modeling with Mixtures of Experts

Mixture of Experts layers (MoEs) enable efficient scaling of language mo...

Please sign up or login with your details

Forgot password? Click here to reset