Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

08/05/2022
by   Margaret Li, et al.
0

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; LM ensembles with random data splits do not perform well. We also present a study of scaling BTM into a new corpus of 64 domains (192B whitespace-separated tokens in total); the resulting LM (22.4B total parameters) performs as well as a Transformer LM trained with 2.5 times more compute. These gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.

READ FULL TEXT
research
03/24/2023

Scaling Expert Language Models with Unsupervised Domain Discovery

Large language models are typically trained densely: all parameters are ...
research
08/11/2021

DEMix Layers: Disentangling Domains for Modular Language Modeling

We introduce a new domain expert mixture (DEMix) layer that enables cond...
research
09/14/2019

Ouroboros: On Accelerating Training of Transformer-Based Language Models

Language models are essential for natural language processing (NLP) task...
research
11/30/2022

BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?

Pretrained transformer models have achieved state-of-the-art results in ...
research
02/14/2023

AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models

Pretrained language models (PLMs) are trained on massive corpora, but of...
research
08/08/2023

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

The legality of training language models (LMs) on copyrighted or otherwi...
research
08/05/2023

Set-alternating schemes: A new class of large Condorcet domains

In this paper, we introduce set-alternating schemes to construct Condorc...

Please sign up or login with your details

Forgot password? Click here to reset