DeepAI
Log In Sign Up

What Language Model to Train if You Have One Million GPU Hours?

10/27/2022
by   Teven Le Scao, et al.
4

The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM–the Big Science Large Open-science Open-access Multilingual language model–our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/30/2021

On the Multilingual Capabilities of Very Large-Scale English Language Models

Generative Pre-trained Transformers (GPTs) have recently been scaled to ...
10/01/2022

Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks

Although large language models have achieved impressive zero-shot abilit...
08/02/2022

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

In this work, we demonstrate that multilingual large-scale sequence-to-s...
10/22/2020

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

The idea of combining multiple languages' recordings to train a single a...
12/09/2022

BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model

The BigScience Workshop was a value-driven initiative that spanned one a...
02/01/2022

Examining Scaling and Transfer of Language Model Architectures for Machine Translation

Natural language understanding and generation models follow one of the t...
10/16/2021

PAGnol: An Extra-Large French Generative Model

Access to large pre-trained models of varied architectures, in many diff...

1 Introduction

Figure 1: Smooth scaling of language modeling loss as compute budget and model size increase. We observe a power-law coefficient , in-line with kaplan2020scaling

. We use this fit to estimate the optimal size and number of tokens to train on for the final model given the available budget.

Recent years have seen the advent of large language models characterized by emergent capabilities (e.g., zero-shot generalization) arising from sheer scale alone radford2019language; brown2020gpt3. Scaling LLMs results in a predictable increase in performance: simple scaling laws connect the number of parameters, pretraining dataset size, and compute budget kaplan2020scaling; ganguli2022predictability; hoffmann2022training, providing a clear path towards more capable models. This paradigm shift has been fueled by the wide adoption of the Transformer vaswani2017attention, providing a scalable basis for practitioners to build upon.

In this paper, we design an architecture and training setup for a multilingual 100B+ parameters model (BLOOM, bigscience_workshop_2022), seeking to best use a fixed 1,000,000 A100-hours budget. Because of the costs involved with training large language models, we cannot exhaustively explore the landscape of possible models. Instead, we position ourselves as practitioners exploring "off-the-shelf" solutions. We thus test promising additions to the Transformer to attempt to reproduce their findings in a controlled, large-scale setting.

Although our main goal was to prepare the architecture and training setup of BLOOM, our findings are also valuable for practitioners building models in the 1-10B range, as they equally improve the performance of such smaller models. At variance with major works on large language models, we also make a significant effort towards reproducibility and openness: all of our pretrained models, code, and notes from our weekly meetings are made available. See Appendix

A for the relevant links.

Contributions.

We first study the impact of pretraining corpora, positional embeddings, activation functions, and embedding norm on zero-shot generalization. We base our study on the popular GPT-2 architecture

radford2019language, with experiments at the 1.3B parameters scale. We then consider the impact of massive multilinguality, showing language-specific scaling laws in a multilingual setting for the first time. Finally, we describe our approach to drafting an architecture for the final 176B parameters BLOOM model.

2 Methods

Model Parameters Pretraining tokens
Dataset 112B 250B 300B
OpenAI — Curie 6.7B 49.28
OpenAI — Babbage 1.3B 45.30
EleutherAI — GPT-Neo 1.3B The Pile 42.94
Ours 13B OSCAR v1 47.09
Ours 1.3B The Pile 42.79 43.12 43.46
1.3B C4 42.77
1.3B OSCAR v1 41.72
Table 1: Pretraining datasets with diverse cross-domain high-quality data improves zero-shot generalization. Average accuracy on EAI harness (higher is better) using different pretraining corpora and comparison with baseline models. Bold is best 1.3B model for amount of tokens seen, underline is best overall.

We first justify our choice to base our model on the popular recipe of combining a decoder-only model with an autoregressive language modeling objective, and introduce our experimental setup. We then discuss our evaluation benchmarks, and motivate our choice of zero-shot generalization as our key metric. Finally, we introduce the baselines we compare to throughout the paper.

2.1 Architecture and Pretraining Objective

In this paper, we base all models on a decoder-only Transformer pretrained with an autoregressive language modeling objective. This is a popular choice for large language models brown2020gpt3; rae2021scaling; thoppilan2022lamda, possibly because it lends itself to zero-shot application to many downstream tasks radford2019language. Alternatives include encoder-decoder models trained with a span-corruption objective (e.g., T5 raffel2019t5), as well as non-causal decoders models with visibility over a prefix (so-called Prefix LMs, liu2018generating; dong2019unified).

Our decision is motivated by the findings of wang2022language, which showed that decoder-only models combined with an autoregressive language modeling objective provide the best zero-shot generalization abilities immediately after pretraining. Although multitask finetuning Sanh2021MultitaskPT; wei2021finetuned will instead favor an encoder-decoder with span corruption for best zero-shot generalization, wang2022language found a compromise between these two practices. Following autoregressive pretraining, decoder-only models can be efficiently adapted into non-causal decoders, simply by extending pretraining with span corruption. This adaptation produces a second model, which can provide excellent zero-shot generalization after multitask finetuning. Accordingly, we follow their recommendation, and train an autoregressive decoder-only model first which we will later consider adapting and finetuning.

2.2 Experimental Setup

We follow the architectures GPT-2 (radford2019language)

and the hyperparameters of GPT-3

(brown2020gpt3). For learning rate, we use a maximum value of , with a linear warm-up over 375M tokens, followed by cosine decay to a minimum value of . We use a 1M tokens batch size, with linear ramp-up over the first 4B tokens, and a sequence length of 2,048. We use the Adam optimizer kingma2014adam, with , ,

, weight decay 0.1, and gradient clipping to 1.0. We also tie the word embedding and softmax matrix 

(tying). Unless noted otherwise, we conduct our experiments with 1.3B parameters models, pretraining on 112B tokens.

We picked this size and dataset size as a compromise between compute cost and the likelihood that our conclusions would transfer to the target 100B+ model. Notably, we needed to be able to reliably measure zero-shot generalization above random chance. We note that training for 112B tokens 1.3B parameters models bring them significantly above the optimality threshold of kaplan2020scaling, and of hoffmann2022training.

The main architectural difference with GPT-3 is that all our layers use full attention, while GPT-3 uses alternating sparse attention layers (sparse). The main value of sparse attention layers is to save compute with long sequence lengths. However, at the 100B+ scale, sparse attention layers provide negligible compute savings, as the vast majority of the compute is spent on the large feed-forward layers. kaplan2020scaling estimated the amount of compute per token to be:

where is the cost for the forward pass, is the number of layers, is the hidden dimension, and is the sequence length. This means if , the second term is negligible, which is the case for our final model where and .

What is a FLOP exactly?

We report throughput per GPU in FLOPS and total budgets in PF-days (i.e. one PFLOPS sustained for a day). It is important to highlight that FLOPS are never directly measured, but always estimated, with widely different practices across papers. We refer to model FLOP the estimates based on the formula from kaplan2020scaling, where is the total compute, the model size, and the number of tokens processed. These are the FLOP actually used to train the model, and which are used for scaling laws. We refer to hardware FLOP the estimates reported by our codebase, using the formula from narayanan2021efficient. This notably includes gradient checkpointing, which trades additionnal computations for reduced memory needs, and a more thorough accounting of operations.

2.3 Evaluation Benchmarks

We measure upstream performance using the language modeling loss on an held out sample of the pretraining dataset. However, it is not always possible to compare losses across objectives and tokenizers. Moreover, as upstream performance is not always aligned with task performance Tay2021ScaleEI, we must also measure downstream performance explicitly. We could use zero/few-shot generalization, with or without specific finetuning.

Specifically, we choose to measure zero-shot generalization on a diverse set of tasks. Few-shot and zero-shot results are strongly correlated: we found a Pearson correlation coefficient of 0.93 between zero-shot and few-shot performance across model sizes in brown2020gpt3. We do not rely on finetuning as it is not how the main final model is likely to be used, given its size and the challenges associated with finetuning at the 100B+ scale.

We use the popular EleutherAI Language Model Evaluation Harness (EAI harness, eval-harness), evaluating models across 27 diverse tasks that are similar to those used in brown2020gpt3 (see Appendix C for a list of tasks). Overall, the random baseline on our benchmark sits at 33.3%.

2.4 Baselines

We use GPT-Neo gpt-neo, a 1.3B decoder-only autoregressive language model trained on the Pile gao2020pile, and GPT-3 brown2020gpt3, accessed via the OpenAI API. We evaluate two models, Babbage and Curie111These models are now referred to as text-babbage-001 and text-curie-001.. Based on gaosize and our own analysis, we assume Babbage is 1.3B while Curie is 6.7B based on how close our computed results are to those reported in the original paper. However, as details of the OpenAI API are kept secret, there is no way to make sure that the models are actually the ones described in brown2020gpt3 – the number of pretraining tokens reported in Table 1 is thus to be taken cautiously.

3 Impact of Pretraining Data

We first study the impact of pretraining data on zero-shot generalization. More diverse pretraining data, ideally curated from a cross-domain collection of high-quality datasets, has been suggested to help with downstream task performance and zero-shot generalization rossettnlg; gao2020pile.

3.1 Corpora

We evaluate three possible corpora, all commonly used to train large language models:

  • OSCAR v1 (ortiz2019oscar)222The recent release of OSCAR v2 is a better dataset, but it wasn’t available when we started this project., a multilingual, filtered version of Common Crawl;

  • C4 (raffel2019t5), specifically its replication by AllenAI, a processed and filtered version of Common Crawl;

  • The Pile (gao2020pile), a diverse pretraining corpus that contains webscrapes from Common Crawl in addition to high-quality data from cross-domain sources such as academic texts and source code.

For each pretraining corpus, we train a 1.3B parameter model for 112B tokens. For the Pile specifically, motivated by good early results at 112B tokens, we train up to 300B tokens, to compare with GPT-3 models and validate against GPT-Neo.

3.2 Results

Evaluation results are outlined in Table 1. We find that training on the Pile produces models that are better at zero-shot generalization, with C4 a close second, and OSCAR significantly behind.

Importantly, this finding transfers to larger scales: as part of engineering test runs, a 13B model was trained on OSCAR for 300B tokens. We found this 13B model to underperform the 6.7B model from OpenAI API which we attribute to the low quality of the English data in OSCAR.

We also note that our model trained on The Pile outperforms the 1.3B GPT-Neo trained on the same dataset. Finally, our 1.3B model still underperforms the 1.3B model from the OpenAI API by 1.6%. It seems most likely that the difference is that of data, but we cannot investigate this further as the GPT-3 training dataset is neither publicly available nor reproducible.

Finding 1. Diverse cross-domain pretraining data combining web crawls with curated high-quality sources improves zero-shot generalization over pretraining datasets constructed from Common Crawl only.

4 Architecture Ablations

We now consider ablation studies to better identify the best positional embedding, activation function, and embedding normalization placement.

4.1 Positional Embeddings

Positional Embedding Average EAI Results
None 41.23
Learned 41.71
Rotary 41.46
ALiBi 43.70
Table 2: ALiBi significantly outperforms other embeddings for zero-shot generalization. All models are trained on the OSCAR dataset for 112 billion tokens.

Background

Originally, both static sinusoidal position embeddings and learned position embeddings were proposed to capture positionnal information; the latter are popular in large language models brown2020gpt3. su2021roformer proposed rotary embeddings, where the query and key representations inside the self-attention mechanism are modified such that the attention captures relative distances between them. Recently, press2021alibi introduced a method which does not use embeddings, instead directly attenuating the attention scores based on how far away the keys/queries are.

Results

We compare learned, rotary, and ALiBi position embeddings, and include a baseline without position embeddings. Our results are presented in Table 2. Although learned positional embeddings outperform rotary embeddings, ALiBi yields significantly better results than all alternatives. We also confirm the findings of biderman2021nopos

: a baseline with no positional information exhibits competitive performance. While bidirectional models require positional embeddings to determine the location of tokens, we find autoregressive models can simply leverage the causal attention mask. We also confirm the ability of ALiBi to extrapolate to longer sequences than trained on in Figure

2. Note that results in Table 2 do not use any extrapolation: ALiBi embeddings are a better choice even without taking into account their ability to extrapolate.

Figure 2: ALiBi embeddings can effectively extrapolate past the sequence length on which the model was trained, while rotary embeddings can not. This is in line with the findings of press2021alibi.
Activation function Average EAI Results
GELU 42.79
SwiGLU 42.95
Table 3: SwiGLU slightly outperforms GELU for zero-shot generalization. Models trained on The Pile for 112 billion tokens.

Finding 2. ALiBi positional embeddings significantly outperforms other embeddings for zero-shot generalization.

4.2 Activation Functions

Background.

Large language models by and large still mostly use the GELU activation hendrycks2016gaussian. We evaluate a recently proposed alternative, SwiGLU shazeer2020swiglu, which combines both Gated Linear Units dauphin2016glu with the Swish activation function ramachandran2017searching.

SwiGLU uses extra parameters in the feed-forward layers. As suggested in shazeer2020swiglu, we compensate for this by reducing the hidden size of the feed-forward layer.

Results.

We present our results in Table 3. SwiGLU produces slightly better results than GELU. For our final model, we adopted GELU, as we initially observed a lower throughput for SwiGLU. However, further benchmarking identified that this overhead was primarily associated with the change in the hidden size of the feedforward network. Indeed, this new size, 5,456, is divisible by neither the warp size of the GPU (Lashgar2013WarpSI) nor the number of streaming multiprocessors, resulting in both tile and wave quantization. We accordingly recommend using SwiGLU for future models.

4.3 Embedding Norm

bitsandbytes suggests that greater stability of training can be achieved by including an extra layer normalization layernorm after the embedding layer. We evaluate the performance impact of such a modification in Table 4. We note that this incurs a significant reduction in the performance of the model. However, models above 100 billion parameters are notoriously unstable and require considerable engineering efforts in order to be kept stable. If this addition provides increased stability when training, it may be valuable.

Embedding Norm Average EAI Results
No 43.46
Yes 42.24
Table 4: Layer normalization after the embedding layer diminishes performance significantly. Models trained on The Pile for 300 billion tokens.

Finding 3. Adding layer normalization after the embedding layer incurs a significant penalty on zero-shot generalization.

Model Size EN ZH ES FR VI AR HI UR Average
XGLM (XGLM) 7.5B 54.5 45 38.2 50.7 47.5 47.5 43.4 42.7 46.19
XGLM (reprod.) 7.5B 53.85 45.21 41.7 49.82 47.35 46.37 43.19 42.3 46.22
XGLM 1.7B 49.68 44.63 37.39 47.94 42.75 45.65 44.35 43.19 44.45
Ours 1.3B 49.9 44.53 36.77 46.51 45.75 43.41 45.95 42.91 44.47
Table 5: Our multilingual 1.3B model achieves accuracy on zero-shot XNLI in line with XGLM XGLM. First row is the reported XGLM results, and the second is our reproduction of their results to validate our multilingual evaluation setup. Last two rows show that our multilingual model matches the XGLM results.

5 Multilinguality

Pretraining Average EAI Results
English-only 41.72
Multilingual 38.55
Table 6: Multilingual pretraining very significantly diminishes English zero-shot generalization. Both models trained on OSCAR for 112B tokens.

The majority of 100B+ language models have been trained in English, with notable exceptions in Chinese (zeng2021pangu; wu2021yuan) and Korean Kim2021WhatCC models. Smaller massively multilingual models have seen wider adoption mT5, but these models are not suitable for zero-shot. Recent results on large GPT-like multilingual models show that English-only performance is usually disappointing XGLM.

Training data.

We train a multilingual model to evaluate the effectiveness and potential impacts of this practice. We use the OSCAR dataset (ortiz2019oscar)

, but here we include multiple languages, not only English as in the earlier experiments. The languages we include are Arabic, Basque, Bengali, Chinese, Catalan, English, French, Hindi, Indonesian, Portuguese, Spanish, Urdu, and Vietnamese. We sample each language with a different probability that downsamples the most frequent languages and upsamples the least frequent ones, so that all languages are represented. We estimate the sampling probabilities similar to 

Xue2021mT5AM.

English-only evaluation.

We first evaluate our multilingual model on the same set of English benchmarks we have used previously, in Table 6. Multilinguality significantly lowers accuracy on the English benchmark, which is in line with the results from XGLM.

Multilingual evaluation.

Zero-shot multilingual evaluation is more challenging to setup because it requires writing new prompts for each new language. Therefore, instead of manually writing prompts for each language, we follow the strategy proposed by XGLM, using English prompts for non-English examples–this can be viewed as cross-lingual zero-shot generalization. They validated this strategy by demonstrating its ability to achieve zero-shot performance on par with (and sometimes even better than) human-written language-specific prompts. This strategy also demonstrates cross-lingual abilities.

We evaluate on XNLI (conneau2018xnli), a multilingual NLI dataset that covers 8 of the languages we use for training. Our evaluation is different from the zero-shot evaluation of the XTREME benchmark Hu2020XTREMEAM. XTREME first finetunes the model on the English training data of each downstream task, then evaluates it on the non-English dataset, attempting cross-lingual generalization. Our evaluation avoids any finetuning, and instead relies entirely on zero-shot generalization.

Results.

Table 5 shows the XNLI results of our multilingual model and how it compares to XGLM XGLM. We were able to reproduce the results of XGLM-7.5B which validates our evaluation setup. Furthermore, the table shows that the performance of our 1.3B is in line with the XNLI 1.7B model, validating that our multilingual setup achieves competitive results. It is worth noting that our 1.3B model is trained on only 112B tokens from 13 languages while XGLM is trained on 500B tokens from 30 languages. As far as we are aware, this is the first independent replication of the main results of XGLM.

Language-specific scaling laws.

To explore how scale influences multilinguality, we train a wider range of models (i.e. 0.3-6B parameters) on a larger corpus of more than 300B tokens of text drawn from a variety of languages roots. In Figure 3, we show scaling laws for Arabic, Catalan, Code, English, Spanish, Basque, French, Indonesian, Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Odia, Punjabi, Tamil, Telugu, Urdu, aggregated Niger-Congo languages, Portuguese, Vietnamese, Simplified and Traditional Chinese.

Smaller models struggle more with under-represented languages such as those in the Indic and Niger-Congo family. For example, the loss of the sub-1 billion models goes up at the end of training for Malayalam, Odia, and Telugu. As data is not repeated, it is unlikely that this effect is due to overfitting; we interpret this as insufficient capacity in the model to handle many language representations, with data in the dominant language sets causing catastrophic forgetting of less represented languages. In contrast, the largest model sees its loss decrease smoothly for every language: larger models handle multilinguality more easily. Overall, scaling laws coefficients are consistent across well-represented languages, only differing in offsets.

Figure 3: Scaling laws across languages for the smaller BLOOM models. Black line is Pareto frontier of optimality (best loss at a given compute), dashed line is best fit. Fit coefficients are detailed in Appendix B. All sufficiently represented languages exhibit similar scaling behaviour, with mostly differences in loss offsets.

6 Scaling to 176B parameters

We now detail how our previous findings influence our architecture and scaling decisions for the final 176B BLOOM model.

Compute allocation.

We have been allocated 18 weeks of dedicated use of partition with 52 nodes of 8x 80GB A100 GPUs on the Jean Zay supercomputer. We set four nodes aside as spare, so that our compute budget amounts to 1,161,216 A100-hours in total. Assuming a throughput of 100 model TFLOPS, approximately corresponding to state-of-the-art hardware FLOPS of 150 narayanan2021efficient, we have a compute budget of 4,838 PF-days for the model training. We round this down to 4,500 PF-days, this % safety margin accounting for potential downtime and inefficiencies (e.g., batch size ramp-up) during training. To put this number in perspective, this is % more than the training budget of GPT-3. Given this compute budget, our English-only scaling laws in 1 predict an optimal allocation for training a 392B parameter model for 165B tokens. We will use these as an upper bound in size: the largest model we can afford is 392B parameters, and the minimum number of tokens to train on is 165B tokens.

Model Size Pretraining Budget Layers Hidden dim. Attention heads
[Bparams.] [Btokens] [PF-days] num. dim.
LaMDA thoppilan2022lamda 137 432 4,106 64 8,192 128 64
GPT-3 brown2020gpt3 175 300 3,646 96 12,288 96 128
J1-Jumbo J1WhitePaper 178 300 3,708 76 13,824 96 144
PanGu- zeng2021pangu 207 42 604 64 16,384 128 128
Yuan wu2021yuan 245 180 3,063 76 16,384
Gopher rae2021scaling 280 300 4,313 80 16,384 128 128
MT-530B smith2022using 530 270 9,938 105 20,480 128 160
Table 7: State-of-the-art 100B+ models with publicly available details. Compute budget is expressed in model PF-days required for training the models, from the approximation of kaplan2020scaling. Number of tokens for LaMDA is inferred from reported compute budget and size. Yuan did not report attention head details.
Model Size Layers Hidden dim. Attention heads Memory Performance
[params.] num. dim. [GB] [sec/iter.] [TFLOPs]
(1) 178 82 13,312 64 208 63 104 152
(2) 178 82 128 104 60 109 146
(3) 176 70 14,336 112 128 59 105 150
Table 8: We choose configuration (3) as the final configuration for our 176B model. (1) was rejected because of high attention heads dimension, and (3) was favored over (2) because of higher throughput. Appendix D details all 20 final configurations benchmarked, only the best three are displayed here.

Model shape.

kaplan2020scaling studied the dependence of the loss with model shape, and found only a limited impact within a wide range of feed-forward ratios , aspect ratios , and attention head dimensions.

levine2020limits proposed a theoretically motivated and empirically backed law describing the optimal compromise between width and depth. They predict that 100B+ parameters models such as GPT-3 are too deep, while models in the 10B or smaller range are usually too shallow. For a GPT-3-sized model with 175B parameters, they predict an ideal depth of 80 layers.

6.1 Final Model Architecture

We set three main guidelines for our final model:

  • [leftmargin=*]

  • 300-400B tokens. We want to guarantee our model will train on around 300-400B tokens of data. This is in the upper range for models in the size range we are pursuing, ensuring that low-resource languages will not be allocated too few tokens. Using the approximation kaplan2020scaling, with PF-days and -400B tokens, this constrains the model size to be around 160-200B parameters.

  • 70-80 layers. From levine2020limits and the size constraint above, we estimate that our model should have between 70 and 80 layers.

  • Maximum throughput.

    Finally, we want the final architecture to have as high of a throughput per GPU as possible, as more compute will translate directly into longer pretraining and thus a better model. Engineering constraints also come into light here: wide shallow models are typically easier to parallelize across nodes, up to a point where excessive tensor paralellism becomes necessary due to memory constraints.

We detail in Table 7 the architectures of current state-of-the-art 100B+ models. From these guidelines, we benchmark 20 model configurations, detailed in Appendix D. Among these configurations, we select three of particular interest, outlined in Table 8. They best fit our guidelines above, and offer high throughput, maximizing our training budget.

We discard configuration (1), as its attention heads are much larger than other models in the literature. Configuration (3) is shallower than recommended by levine2020limits, but delivers 3% higher throughput compared to (2). Thus, we choose configuration (3) and its better throughput, and because a shallower model is easier to deal with at inference time by introducing less latency.

7 Limitations

Optimal scaling.

Concurrent to this work, hoffmann2022training identified more optimal scaling laws. For our compute budget, they would suggest a 50B parameters model trained for a trillion tokens. Interestingly, even in hindsight, it would have been difficult to follow this recommendation as we would have been limited by the limited availability of high-quality multilingual data and by the size of the BigScience training dataset, ROOTS roots. Note that our Figure 1 reproduces kaplan2020scaling as we did not account for the learning rate schedule as suggested by hoffmann2022training.

Other hyperparameters.

In this work we have focused on a subset of the available hyperparameter space of large language models. We have investigated architecture decisions around positional embeddings, activation functions and the embedding norm. Alternative attention mechanisms tay2020long or optimizers are examples of other dimensions that could be investigated, potentially leading to improved models.

Efficient fine-tuning.

Our study is focused on zero-shot use and does not consider efficient fine-tuning lester2021power; zaken2021bitfit, which is quite relevant for large language models, and which may lead to different conclusions.

8 Conclusion

Seeking to establish the best possible model architecture that can be accommodated within a fixed 1,000,000 GPU-hours compute budget, we have presented an extensive study on principled modeling decisions for large language models.

First, we have found that complimenting Common Crawl data with high-quality cross-domain curated data can boost zero-shot generalization, validating previous suggestions rossettnlg; gao2020pile. Through an ablation study, we have identified ALiBi as the position embedding of choice, confirmed the potential of SwiGLU, and highlighted that stabilizing techniques such as embedding normalization sometimes come at the expense of zero-shot generalization. Exploring multilinguality, we have found that multilingual models significantly underperform their monolingual counterparts on English zero-shot benchmarks, but that they can learn under-resourced languages along with larger ones if given enough scale. Finally, we identified a candidate architecture for BLOOM 176B, outlining the full reasoning behind every architectural parameter, including model shape.

At variance with previous 100B+ models, such as GPT-3 brown2020gpt3 or Gopher rae2021scaling, this project was conducted in the open, and resulted in a number of open-access artefacts. Notable similar projects conducted in parallel to this one include OPT zhang2022opt and GLM zeng2022glm, although they lacked the collaborative and massively multilingual components of this project.

We hope our work can help practitioners better understand modeling decisions, leading to better language models, and that this transparency will accelerate future similar work.

Acknowledgements

This work was granted access to the HPC resources of Institut du développement et des ressources en informatique scientifique (IDRIS) du Centre national de la recherche scientifique (CNRS) under the allocation 2021-A0101012475 made by Grand équipement national de calcul intensif (GENCI). In particular, all the trainings ran on the Jean-Zay cluster of IDRIS, and we want to thank the IDRIS team for responsive support throughout the project, in particular Rémi Lacroix. Evaluations of GPT-3 models were provided in part by the Allen Institute for Artificial Intelligence. We thank Leo Gao for his expertise and advice on language model evaluation.

References

Appendix A Open artefacts: models, code, and logs

We make public all artefacts produced as part of this work:

Appendix B Multilingual scaling laws

Language Proportion [%]

Arabic
4.6 0.057 1.16
Catalan 1.1 0.057 1.11
Code 10.8 0.054 0.94
English 30.0 0.051 1.08
Spanish 10.8 0.050 1.01
Basque 0.15 0.069 1.28
French 12.9 0.047 1.06
Indonesian 1.2 0.051 1.14
Assamese 0.01 0.051 1.31
Bengali 0.5 0.037 1.15
Gujarati 0.04 0.051 1.30
Hindi 0.7 0.045 1.14
Kannada 0.06 0.046 1.26
Malayalam 0.1 0.044 1.17
Marathi 0.05 0.046 1.23
Nepali 0.07 0.055 1.25
Odia 0.04 0.044 1.25
Punjabi 0.05 0.043 1.20
Tamil 0.2 0.030 1.14
Telugu 0.09 0.056 1.31
Urdu 0.1 0.068 1.31
Niger-Congo (family) 0.03 0.039 1.22
Portuguese 4.9 0.049 1.05
Vietnamese 2.7 0.053 1.08
Chinese (simplified) 16.2 0.052 1.09
Chinese (traditionnal) 0.05 0.050 1.15

Table 9: Best scaling law fit per language. We fit to the runs reported in Figure 3. But for a handful of languages which are poorly represented in the overall mixture (Basque, most of the Indic family, and Niger-Congo languages), scaling mostly different in offset , not in exponent .

Appendix C Evaluation details

Task Type Random baseline
ARC (clark2018arc) Challenge Natural Language Inference 25.0
Easy 25.0
GLUE MRPC (dolan2016mrpc) Paraphrase Identification 50.0
QQP (iyer2019qqp) Paraphrase Identification 50.0
HellaSwag (zellers2019hellaswag) Sentence Completion 25.0
LAMBADA (paperno2016lambada) Sentence Completion 0.0
LogiQA (liu2020logiqa) Multiple-Choice Question Answering 25.0
MathQA (amini2019mathqa) Multiple-Choice Question Answering 20.1
MC-TACO (zhou2019mctaco) Multiple-Choice Question Answering 36.2
OpenBookQA (mihaylov2press2021train018openbookqa) Multiple-Choice Question Answering 25.0
PIQA (bisk2020piqa) Multiple-Choice Question Answering 50.0
PROST (aroca-ouellette2021prost) Multiple-Choice Question Answering 25.0
PudMedQA (jin2019pubmedqa) Multiple-Choice Question Answering 33.3
QNLI (rajpurkar2016squad; wang2019glue) Sentence Completion 50.0
Race lai2017large Closed-Book Question Answering 25.0
SciQ (welbl2017sciq) Multiple-Choice Question Answering 25.0
SST (socher2013sst) Sentiment 50.0
SuperGLUE Boolq (clark2019boolq) Multiple-Choice Question Answering 50.0
COPA (gordon2012copa) Sentence Completion 50.0
MultiRC (kashabi2018multirc) Multiple-Choice Question Answering 5.8
RTE (dagan2005rte) Natural Language Inference 50.0
WIC (pilehavar2018wic) Word Sense Disambiguation 50.0
WSC (levesque2012winograd) Word Sense Disambiguation 50.0
TriviaQA (joshi2017triviaqa) Closed-Book Question Answering 0.0
WebQuestions (berant2013semantic) Closed-Book Question Answering 0.0
Winogrande (sakaguchi2019winogrande) Coreference resolution 50.0
WNLI (sakaguchi2019winogrande) Natural Language Inference 50.0
EAI harness 33.3
Table 10: Evaluation tasks considered in the EAI harness and random baselines.

Appendix D Architecture details


Architecture Parallelism Performance Size Hidden dim. Layers Attention heads Data Tensor Pipeline MBS Memory Throughput [Bparams.] num. dim. [GB] [s/iter.] [TFLOPs] 206 14,336 82 128 112 8 4 12 2 OOM 203 13,312 94 128 104 8 4 12 2 67 124,1 146,1 195 12,288 106 128 96 8 4 12 2 67 121,4 143,7 96 128 4 79 120,3 145,0 128 2 65 118,8 146,9 64 192 67 116,5 149,8 184 12,288 100 64 192 16 4 6 2 OOM 1 OOM 8 8 4 72 121,0 136,2 2 61 140,0 117,9 178 13,312 82 128 104 8 4 12 2 60 108,8 145,7 104 128 62 123,7 128,1 64 208 4 74 104,8 151,2 4 8 52 111,8 141,8 8 4 2 63 104,5 151,7 176 14,336 70 128 112 8 4 12 2 60 105,9 148,1 112 128 59 104,5 150,1 64 224 4 73 102,3 153,3 2 59 102,0 153,7 4 8 12 40 121,6 128,9

Table 11: Throughput and memory usage of considered models sizes. Note that pipeline parallelism here considers equal "slots" for embeddings and Transformer layers. This is important to optimize pipeline use, as our multilingual embeddings are quite large (250k).

Appendix E All Results


Ablation Dataset Embedding Activation Embedding Norm Parameters 112GT 250GT 300GT Embeddings OSCAR Learned GELU No 1.3B 41.71 Embeddings OSCAR None GELU No 1.3B 41.23 Embeddings OSCAR Rotary GELU No 1.3B 41.46 Embeddings OSCAR ALiBi GELU No 1.3B 43.70 Dataset The Pile Learned GELU No 1.3B 42.79 43.12 43.46 Dataset C4 Learned GELU No 1.3B 42.77 Dataset OSCAR Learned GELU No 1.3B 42.79 Activation The Pile Learned GELU No 1.3B 42.79 Activation The Pile Learned SwiGLU No 1.3B 42.95 Embedding Norm The Pile Learned GELU No 1.3B 42.79 43.12 43.46 Embedding Norm The Pile Learned GELU Yes 1.3B 42.24 Multilinguality OSCAR-ML Learned GELU No 1.3B 38.55 Multilinguality OSCAR Learned GELU No 1.3B 41.72 Scale OSCAR Learned GELU No 1.3B 41.72 Scale OSCAR Learned GELU No 13B 47.09

Table 12: Summary of all results obtained in this study. The final three columns indicate the average EAI Harness results at across different billion tokens trained. Some rows are duplicated for ease of reading.
Public Name OpenAI: babbage Openai: curie gpt-neo 1.3B
Dataset C4 OSCAR The Pile The Pile The Pile The Pile The Pile OSCAR The Pile OSCAR OSCAR OSCAR OSCAR-ML
Embeddings Learned Learned Learned Learned Learned Learned Learned Learned Learned Rotary ALiBi None Learned
Activation GELU GELU GELU GELU GELU GELU GELU GELU SwiGLU GELU GELU GELU GELU
Embedding Norm No No No No No No No No No No No No No
Parameters in billion 1.3 6.7 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 13 1.3 1.3 1.3 1.3 1.3
Tokens trained in billion 300 300 300 112 112 112 250 300 300 330 300 112 112 112 112 112
task metric
arc_challenge acc arc_challengeacc 0.276 0.334 0.231 0.243 0.249 0.258 0.264 0.260 0.242 0.250 0.322 0.247 0.236 0.252 0.249 0.212
arc_challenge acc_norm arc_challengeacc_norm 0.295 0.375 0.259 0.274 0.261 0.275 0.277 0.286 0.277 0.290 0.342 0.268 0.270 0.276 0.260 0.243
arc_easy acc arc_easyacc 0.597 0.685 0.562 0.561 0.560 0.556 0.569 0.601 0.568 0.582 0.681 0.557 0.554 0.575 0.537 0.484
arc_easy acc_norm arc_easyacc_norm 0.555 0.633 0.502 0.503 0.478 0.506 0.518 0.528 0.516 0.515 0.600 0.502 0.476 0.491 0.461 0.434
boolq acc boolqacc 0.629 0.666 0.620 0.546 0.566 0.520 0.551 0.606 0.558 0.566 0.587 0.540 0.584 0.563 0.526 0.597
copa acc copaacc 0.810 0.850 0.690 0.700 0.720 0.710 0.710 0.730 0.690 0.690 0.880 0.660 0.690 0.780 0.680 0.710
hellaswag acc hellaswagacc 0.429 0.504 0.387 0.422 0.404 0.374 0.385 0.405 0.378 0.380 0.542 0.379 0.410 0.422 0.395 0.340
hellaswag acc_norm hellaswagacc_norm 0.545 0.664 0.489 0.551 0.515 0.464 0.486 0.521 0.477 0.476 0.716 0.475 0.524 0.549 0.495 0.424
lambada acc lambadaacc 0.625 0.694 0.572 0.469 0.481 0.569 0.575 0.609 0.581 0.580 0.634 0.574 0.496 0.501 0.454 0.408
logiqa acc logiqaacc 0.201 0.215 0.197 0.206 0.237 0.210 0.218 0.203 0.217 0.223 0.232 0.215 0.210 0.215 0.237 0.218
logiqa acc_norm logiqaacc_norm 0.269 0.292 0.273 0.267 0.270 0.275 0.286 0.269 0.281 0.280 0.275 0.272 0.254 0.272 0.293 0.283
mathqa acc mathqaacc 0.244 0.251 0.241 0.233 0.222 0.249 0.248 0.263 0.246 0.245 0.238 0.245 0.234 0.237 0.215 0.223
mathqa acc_norm mathqaacc_norm 0.242 0.247 0.237 0.228 0.228 0.246 0.245 0.259 0.242 0.242 0.235 0.234 0.229 0.238 0.221 0.222
mc_taco f1 mc_tacof1 0.458 0.484 0.493 0.361 0.293 0.485 0.488 0.494 0.487 0.489 0.497 0.493 0.461 0.337 0.477 0.387
mrpc acc mrpcacc 0.578 0.684 0.684 0.684 0.588 0.684 0.684 0.684 0.679 0.679 0.677 0.684 0.684 0.684 0.679 0.302
mrpc f1 mrpcf1 0.718 0.812 0.812 0.812 0.702 0.812 0.812 0.812 0.808 0.809 0.806 0.812 0.812 0.812 0.808 0.090
multirc acc multircacc 0.018 0.015 0.018 0.018 0.026 0.023 0.024 0.023 0.025 0.008 0.018 0.026 0.009 0.011 0.016 0.040
openbookqa acc openbookqaacc 0.224 0.290 0.216 0.220 0.200 0.190 0.196 0.222 0.194 0.208 0.294 0.214 0.212 0.224 0.210 0.170
openbookqa acc_norm openbookqaacc_norm 0.336 0.386 0.336 0.336 0.328 0.316 0.314 0.334 0.302 0.312 0.412 0.320 0.344 0.340 0.332 0.276
piqa acc piqaacc 0.745 0.763 0.711 0.732 0.716 0.693 0.704 0.716 0.698 0.706 0.777 0.693 0.720 0.729 0.711 0.674
piqa acc_norm piqaacc_norm 0.746 0.772 0.711 0.730 0.721 0.705 0.705 0.717 0.698 0.701 0.788 0.689 0.721 0.731 0.711 0.682
prost acc prostacc 0.270 0.288 0.238 0.243 0.237 0.249 0.229 0.204 0.219 0.226 0.281 0.244 0.287 0.280 0.240 0.253
prost acc_norm prostacc_norm 0.260 0.295 0.308 0.293 0.303 0.268 0.271 0.268 0.292 0.305 0.283 0.276 0.296 0.332 0.300 0.313
pubmedqa acc pubmedqaacc 0.611 0.622 0.544 0.573 0.438 0.563 0.589 0.662 0.612 0.612 0.615 0.589 0.507 0.514 0.486 0.412
qnli acc qnliacc 0.512 0.529 0.499 0.476 0.507 0.505 0.506 0.505 0.499 0.499 0.517 0.498 0.493 0.481 0.493 0.493
qqp acc qqpacc 0.372 0.441 0.382 0.396 0.384 0.381 0.370 0.375 0.371 0.369 0.368 0.435 0.370 0.423 0.370 0.389
qqp f1 qqpf1 0.534 0.515 0.522 0.530 0.519 0.534 0.537 0.537 0.538 0.538 0.533 0.495 0.539 0.475 0.537 0.505
race acc raceacc 0.356 0.386 0.341 0.330 0.323 0.334 0.329 0.344 0.321 0.323 0.374 0.337 0.317 0.344 0.332 0.326
rte acc rteacc 0.585 0.552 0.603 0.502 0.534 0.563 0.549 0.578 0.563 0.549 0.524 0.527 0.545 0.524 0.527 0.505
sciq acc sciqacc 0.867 0.919 0.860 0.825 0.810 0.838 0.853 0.868 0.860 0.867 0.895 0.849 0.818 0.828 0.816 0.793
sciq acc_norm sciqacc_norm 0.809 0.896 0.770 0.747 0.717 0.755 0.762 0.792 0.791 0.803 0.815 0.770 0.718 0.728 0.698 0.702
sst acc sstacc 0.732 0.666 0.656 0.676 0.560 0.753 0.721 0.501 0.528 0.710 0.514 0.760 0.493 0.588 0.588 0.510
triviaqa acc triviaqaacc 0.115 0.195 0.052 0.027 0.025 0.056 0.065 0.058 0.047 0.049 0.133 0.050 0.031 0.039 0.028 0.021
webqs acc webqsacc 0.048 0.065 0.017 0.012 0.004 0.023 0.026 0.023 0.020 0.021 0.027 0.012 0.006 0.004 0.015 0.001
wic acc wicacc 0.495 0.500 0.500 0.495 0.508 0.495 0.500 0.500 0.498 0.500 0.498 0.500 0.498 0.492 0.500 0.500
winogrande acc winograndeacc 0.595 0.648 0.551 0.564 0.565 0.536 0.552 0.560 0.533 0.543 0.647 0.538 0.564 0.583 0.543 0.519
wsc acc wscacc 0.394 0.558 0.365 0.539 0.567 0.365 0.365 0.365 0.414 0.385 0.500 0.365 0.394 0.635 0.462 0.539
Avg acc 45.30% 49.28% 42.94% 42.77% 41.72% 42.79% 43.12% 43.46% 42.24% 43.08% 47.09% 42.95% 41.45% 43.70% 41.23% 38.55%