BASE Layers: Simplifying Training of Large, Sparse Models

03/30/2021
by   Mike Lewis, et al.
0

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released at https://github.com/pytorch/fairseq/

READ FULL TEXT
research
04/18/2022

StableMoE: Stable Routing Strategy for Mixture of Experts

The Mixture-of-Experts (MoE) technique can scale up the model size of Tr...
research
06/08/2021

Hash Layers For Large Sparse Models

We investigate the training of sparse layers that use different paramete...
research
08/02/2023

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert architectures (MoEs) scale model capacity witho...
research
09/24/2021

Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts

Training large-scale mixture of experts models efficiently on modern har...
research
07/19/2022

MoEC: Mixture of Expert Clusters

Sparsely Mixture of Experts (MoE) has received great interest due to its...
research
06/07/2023

ModuleFormer: Learning Modular Large Language Models From Uncurated Data

Large Language Models (LLMs) have achieved remarkable results. But exist...
research
06/06/2023

Soft Merging of Experts with Adaptive Routing

Sparsely activated neural networks with conditional computation learn to...

Please sign up or login with your details

Forgot password? Click here to reset