Efficient Large Scale Language Modeling with Mixtures of Experts

12/20/2021
by   Mikel Artetxe, et al.
10

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ∼4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2021

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Scaling language models with more data, compute and parameters has drive...
research
05/30/2023

Likelihood-Based Diffusion Language Models

Despite a growing interest in diffusion-based language models, existing ...
research
11/22/2022

HyperTuning: Toward Adapting Large Language Models without Back-propagation

Fine-tuning large language models for different tasks can be costly and ...
research
08/11/2021

DEMix Layers: Disentangling Domains for Modular Language Modeling

We introduce a new domain expert mixture (DEMix) layer that enables cond...
research
03/14/2022

Efficient Language Modeling with Sparse all-MLP

All-MLP architectures have attracted increasing interest as an alternati...
research
03/25/2022

GPT-D: Inducing Dementia-related Linguistic Anomalies by Deliberate Degradation of Artificial Neural Language Models

Deep learning (DL) techniques involving fine-tuning large numbers of mod...
research
09/12/2023

Do Generative Large Language Models need billions of parameters?

This paper presents novel systems and methodologies for the development ...

Please sign up or login with your details

Forgot password? Click here to reset