DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

05/17/2023
by   Sang Michael Xie, et al.
0

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5 trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

READ FULL TEXT

page 2

page 9

page 17

page 20

research
10/16/2021

Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

Pretrained language models (PTLMs) are typically learned over a large, s...
research
09/02/2021

An Empirical Exploration in Quality Filtering of Text Data

While conventional wisdom suggests that more aggressively filtering data...
research
11/21/2022

CBEAF-Adapting: Enhanced Continual Pretraining for Building Chinese Biomedical Language Model

Continual pretraining is a standard way of building a domain-specific pr...
research
02/06/2023

Data Selection for Language Models via Importance Resampling

Selecting a suitable training dataset is crucial for both general-domain...
research
03/25/2023

Sem4SAP: Synonymous Expression Mining From Open Knowledge Graph For Language Model Synonym-Aware Pretraining

The model's ability to understand synonymous expression is crucial in ma...
research
08/16/2020

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

In this paper, we propose to investigate the problem of out-of-domain vi...
research
10/03/2020

Mining Knowledge for Natural Language Inference from Wikipedia Categories

Accurate lexical entailment (LE) and natural language inference (NLI) of...

Please sign up or login with your details

Forgot password? Click here to reset