Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

04/06/2023
by   Nolan Dey, et al.
0

We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-training (highest accuracy for a given compute budget). We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models to show all Cerebras-GPT models have state-of-the-art training efficiency on both pre-training and downstream objectives. We describe our learnings including how Maximal Update Parameterization (μP) can further improve large model scaling, improving accuracy and hyperparameter predictability at scale. We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes. Cerebras-GPT models are available on HuggingFace: https://huggingface.co/cerebras.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2021

PAGnol: An Extra-Large French Generative Model

Access to large pre-trained models of varied architectures, in many diff...
research
05/11/2021

Benchmarking down-scaled (not so large) pre-trained language models

Large Transformer-based language models are pre-trained on corpora of va...
research
04/14/2023

Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales

As language models scale up, it becomes increasingly expensive to verify...
research
05/25/2023

Scaling Data-Constrained Language Models

The current trend of scaling language models involves increasing both pa...
research
06/24/2023

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

Current trends to pre-train capable Large Language Models (LLMs) mostly ...
research
05/26/2023

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

In recent years, language models have drastically grown in size, and the...
research
07/12/2023

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

The computation necessary for training Transformer-based language models...

Please sign up or login with your details

Forgot password? Click here to reset