Recent years have seen the advent of large language models characterized by emergent capabilities (e.g., zero-shot generalization) arising from sheer scale alone radford2019language; brown2020gpt3. Scaling LLMs results in a predictable increase in performance: simple scaling laws connect the number of parameters, pretraining dataset size, and compute budget kaplan2020scaling; ganguli2022predictability; hoffmann2022training, providing a clear path towards more capable models. This paradigm shift has been fueled by the wide adoption of the Transformer vaswani2017attention, providing a scalable basis for practitioners to build upon.
In this paper, we design an architecture and training setup for a multilingual 100B+ parameters model (BLOOM, bigscience_workshop_2022), seeking to best use a fixed 1,000,000 A100-hours budget. Because of the costs involved with training large language models, we cannot exhaustively explore the landscape of possible models. Instead, we position ourselves as practitioners exploring "off-the-shelf" solutions. We thus test promising additions to the Transformer to attempt to reproduce their findings in a controlled, large-scale setting.
Although our main goal was to prepare the architecture and training setup of BLOOM, our findings are also valuable for practitioners building models in the 1-10B range, as they equally improve the performance of such smaller models. At variance with major works on large language models, we also make a significant effort towards reproducibility and openness: all of our pretrained models, code, and notes from our weekly meetings are made available. See AppendixA for the relevant links.
We first study the impact of pretraining corpora, positional embeddings, activation functions, and embedding norm on zero-shot generalization. We base our study on the popular GPT-2 architectureradford2019language, with experiments at the 1.3B parameters scale. We then consider the impact of massive multilinguality, showing language-specific scaling laws in a multilingual setting for the first time. Finally, we describe our approach to drafting an architecture for the final 176B parameters BLOOM model.
|OpenAI — Curie||6.7B||49.28|
|OpenAI — Babbage||1.3B||45.30|
|EleutherAI — GPT-Neo||1.3B||The Pile||42.94|
We first justify our choice to base our model on the popular recipe of combining a decoder-only model with an autoregressive language modeling objective, and introduce our experimental setup. We then discuss our evaluation benchmarks, and motivate our choice of zero-shot generalization as our key metric. Finally, we introduce the baselines we compare to throughout the paper.
2.1 Architecture and Pretraining Objective
In this paper, we base all models on a decoder-only Transformer pretrained with an autoregressive language modeling objective. This is a popular choice for large language models brown2020gpt3; rae2021scaling; thoppilan2022lamda, possibly because it lends itself to zero-shot application to many downstream tasks radford2019language. Alternatives include encoder-decoder models trained with a span-corruption objective (e.g., T5 raffel2019t5), as well as non-causal decoders models with visibility over a prefix (so-called Prefix LMs, liu2018generating; dong2019unified).
Our decision is motivated by the findings of wang2022language, which showed that decoder-only models combined with an autoregressive language modeling objective provide the best zero-shot generalization abilities immediately after pretraining. Although multitask finetuning Sanh2021MultitaskPT; wei2021finetuned will instead favor an encoder-decoder with span corruption for best zero-shot generalization, wang2022language found a compromise between these two practices. Following autoregressive pretraining, decoder-only models can be efficiently adapted into non-causal decoders, simply by extending pretraining with span corruption. This adaptation produces a second model, which can provide excellent zero-shot generalization after multitask finetuning. Accordingly, we follow their recommendation, and train an autoregressive decoder-only model first which we will later consider adapting and finetuning.
2.2 Experimental Setup
We follow the architectures GPT-2 (radford2019language)
and the hyperparameters of GPT-3(brown2020gpt3). For learning rate, we use a maximum value of , with a linear warm-up over 375M tokens, followed by cosine decay to a minimum value of . We use a 1M tokens batch size, with linear ramp-up over the first 4B tokens, and a sequence length of 2,048. We use the Adam optimizer kingma2014adam, with , ,
, weight decay 0.1, and gradient clipping to 1.0. We also tie the word embedding and softmax matrix(tying). Unless noted otherwise, we conduct our experiments with 1.3B parameters models, pretraining on 112B tokens.
We picked this size and dataset size as a compromise between compute cost and the likelihood that our conclusions would transfer to the target 100B+ model. Notably, we needed to be able to reliably measure zero-shot generalization above random chance. We note that training for 112B tokens 1.3B parameters models bring them significantly above the optimality threshold of kaplan2020scaling, and of hoffmann2022training.
The main architectural difference with GPT-3 is that all our layers use full attention, while GPT-3 uses alternating sparse attention layers (sparse). The main value of sparse attention layers is to save compute with long sequence lengths. However, at the 100B+ scale, sparse attention layers provide negligible compute savings, as the vast majority of the compute is spent on the large feed-forward layers. kaplan2020scaling estimated the amount of compute per token to be:
where is the cost for the forward pass, is the number of layers, is the hidden dimension, and is the sequence length. This means if , the second term is negligible, which is the case for our final model where and .
What is a FLOP exactly?
We report throughput per GPU in FLOPS and total budgets in PF-days (i.e. one PFLOPS sustained for a day). It is important to highlight that FLOPS are never directly measured, but always estimated, with widely different practices across papers. We refer to model FLOP the estimates based on the formula from kaplan2020scaling, where is the total compute, the model size, and the number of tokens processed. These are the FLOP actually used to train the model, and which are used for scaling laws. We refer to hardware FLOP the estimates reported by our codebase, using the formula from narayanan2021efficient. This notably includes gradient checkpointing, which trades additionnal computations for reduced memory needs, and a more thorough accounting of operations.
2.3 Evaluation Benchmarks
We measure upstream performance using the language modeling loss on an held out sample of the pretraining dataset. However, it is not always possible to compare losses across objectives and tokenizers. Moreover, as upstream performance is not always aligned with task performance Tay2021ScaleEI, we must also measure downstream performance explicitly. We could use zero/few-shot generalization, with or without specific finetuning.
Specifically, we choose to measure zero-shot generalization on a diverse set of tasks. Few-shot and zero-shot results are strongly correlated: we found a Pearson correlation coefficient of 0.93 between zero-shot and few-shot performance across model sizes in brown2020gpt3. We do not rely on finetuning as it is not how the main final model is likely to be used, given its size and the challenges associated with finetuning at the 100B+ scale.
We use the popular EleutherAI Language Model Evaluation Harness (EAI harness, eval-harness), evaluating models across 27 diverse tasks that are similar to those used in brown2020gpt3 (see Appendix C for a list of tasks). Overall, the random baseline on our benchmark sits at 33.3%.
We use GPT-Neo gpt-neo, a 1.3B decoder-only autoregressive language model trained on the Pile gao2020pile, and GPT-3 brown2020gpt3, accessed via the OpenAI API. We evaluate two models, Babbage and Curie111These models are now referred to as text-babbage-001 and text-curie-001.. Based on gaosize and our own analysis, we assume Babbage is 1.3B while Curie is 6.7B based on how close our computed results are to those reported in the original paper. However, as details of the OpenAI API are kept secret, there is no way to make sure that the models are actually the ones described in brown2020gpt3 – the number of pretraining tokens reported in Table 1 is thus to be taken cautiously.
3 Impact of Pretraining Data
We first study the impact of pretraining data on zero-shot generalization. More diverse pretraining data, ideally curated from a cross-domain collection of high-quality datasets, has been suggested to help with downstream task performance and zero-shot generalization rossettnlg; gao2020pile.
We evaluate three possible corpora, all commonly used to train large language models:
OSCAR v1 (ortiz2019oscar)222The recent release of OSCAR v2 is a better dataset, but it wasn’t available when we started this project., a multilingual, filtered version of Common Crawl;
C4 (raffel2019t5), specifically its replication by AllenAI, a processed and filtered version of Common Crawl;
The Pile (gao2020pile), a diverse pretraining corpus that contains webscrapes from Common Crawl in addition to high-quality data from cross-domain sources such as academic texts and source code.
For each pretraining corpus, we train a 1.3B parameter model for 112B tokens. For the Pile specifically, motivated by good early results at 112B tokens, we train up to 300B tokens, to compare with GPT-3 models and validate against GPT-Neo.
Evaluation results are outlined in Table 1. We find that training on the Pile produces models that are better at zero-shot generalization, with C4 a close second, and OSCAR significantly behind.
Importantly, this finding transfers to larger scales: as part of engineering test runs, a 13B model was trained on OSCAR for 300B tokens. We found this 13B model to underperform the 6.7B model from OpenAI API which we attribute to the low quality of the English data in OSCAR.
We also note that our model trained on The Pile outperforms the 1.3B GPT-Neo trained on the same dataset. Finally, our 1.3B model still underperforms the 1.3B model from the OpenAI API by 1.6%. It seems most likely that the difference is that of data, but we cannot investigate this further as the GPT-3 training dataset is neither publicly available nor reproducible.
Finding 1. Diverse cross-domain pretraining data combining web crawls with curated high-quality sources improves zero-shot generalization over pretraining datasets constructed from Common Crawl only.
4 Architecture Ablations
We now consider ablation studies to better identify the best positional embedding, activation function, and embedding normalization placement.
4.1 Positional Embeddings
|Positional Embedding||Average EAI Results|
Originally, both static sinusoidal position embeddings and learned position embeddings were proposed to capture positionnal information; the latter are popular in large language models brown2020gpt3. su2021roformer proposed rotary embeddings, where the query and key representations inside the self-attention mechanism are modified such that the attention captures relative distances between them. Recently, press2021alibi introduced a method which does not use embeddings, instead directly attenuating the attention scores based on how far away the keys/queries are.
We compare learned, rotary, and ALiBi position embeddings, and include a baseline without position embeddings. Our results are presented in Table 2. Although learned positional embeddings outperform rotary embeddings, ALiBi yields significantly better results than all alternatives. We also confirm the findings of biderman2021nopos
: a baseline with no positional information exhibits competitive performance. While bidirectional models require positional embeddings to determine the location of tokens, we find autoregressive models can simply leverage the causal attention mask. We also confirm the ability of ALiBi to extrapolate to longer sequences than trained on in Figure2. Note that results in Table 2 do not use any extrapolation: ALiBi embeddings are a better choice even without taking into account their ability to extrapolate.
|Activation function||Average EAI Results|
Finding 2. ALiBi positional embeddings significantly outperforms other embeddings for zero-shot generalization.
4.2 Activation Functions
Large language models by and large still mostly use the GELU activation hendrycks2016gaussian. We evaluate a recently proposed alternative, SwiGLU shazeer2020swiglu, which combines both Gated Linear Units dauphin2016glu with the Swish activation function ramachandran2017searching.
SwiGLU uses extra parameters in the feed-forward layers. As suggested in shazeer2020swiglu, we compensate for this by reducing the hidden size of the feed-forward layer.
We present our results in Table 3. SwiGLU produces slightly better results than GELU. For our final model, we adopted GELU, as we initially observed a lower throughput for SwiGLU. However, further benchmarking identified that this overhead was primarily associated with the change in the hidden size of the feedforward network. Indeed, this new size, 5,456, is divisible by neither the warp size of the GPU (Lashgar2013WarpSI) nor the number of streaming multiprocessors, resulting in both tile and wave quantization. We accordingly recommend using SwiGLU for future models.
4.3 Embedding Norm
bitsandbytes suggests that greater stability of training can be achieved by including an extra layer normalization layernorm after the embedding layer. We evaluate the performance impact of such a modification in Table 4. We note that this incurs a significant reduction in the performance of the model. However, models above 100 billion parameters are notoriously unstable and require considerable engineering efforts in order to be kept stable. If this addition provides increased stability when training, it may be valuable.
|Embedding Norm||Average EAI Results|
Finding 3. Adding layer normalization after the embedding layer incurs a significant penalty on zero-shot generalization.
|Pretraining||Average EAI Results|
The majority of 100B+ language models have been trained in English, with notable exceptions in Chinese (zeng2021pangu; wu2021yuan) and Korean Kim2021WhatCC models. Smaller massively multilingual models have seen wider adoption mT5, but these models are not suitable for zero-shot. Recent results on large GPT-like multilingual models show that English-only performance is usually disappointing XGLM.
We train a multilingual model to evaluate the effectiveness and potential impacts of this practice. We use the OSCAR dataset (ortiz2019oscar)
, but here we include multiple languages, not only English as in the earlier experiments. The languages we include are Arabic, Basque, Bengali, Chinese, Catalan, English, French, Hindi, Indonesian, Portuguese, Spanish, Urdu, and Vietnamese. We sample each language with a different probability that downsamples the most frequent languages and upsamples the least frequent ones, so that all languages are represented. We estimate the sampling probabilities similar toXue2021mT5AM.
We first evaluate our multilingual model on the same set of English benchmarks we have used previously, in Table 6. Multilinguality significantly lowers accuracy on the English benchmark, which is in line with the results from XGLM.
Zero-shot multilingual evaluation is more challenging to setup because it requires writing new prompts for each new language. Therefore, instead of manually writing prompts for each language, we follow the strategy proposed by XGLM, using English prompts for non-English examples–this can be viewed as cross-lingual zero-shot generalization. They validated this strategy by demonstrating its ability to achieve zero-shot performance on par with (and sometimes even better than) human-written language-specific prompts. This strategy also demonstrates cross-lingual abilities.
We evaluate on XNLI (conneau2018xnli), a multilingual NLI dataset that covers 8 of the languages we use for training. Our evaluation is different from the zero-shot evaluation of the XTREME benchmark Hu2020XTREMEAM. XTREME first finetunes the model on the English training data of each downstream task, then evaluates it on the non-English dataset, attempting cross-lingual generalization. Our evaluation avoids any finetuning, and instead relies entirely on zero-shot generalization.
Table 5 shows the XNLI results of our multilingual model and how it compares to XGLM XGLM. We were able to reproduce the results of XGLM-7.5B which validates our evaluation setup. Furthermore, the table shows that the performance of our 1.3B is in line with the XNLI 1.7B model, validating that our multilingual setup achieves competitive results. It is worth noting that our 1.3B model is trained on only 112B tokens from 13 languages while XGLM is trained on 500B tokens from 30 languages. As far as we are aware, this is the first independent replication of the main results of XGLM.
Language-specific scaling laws.
To explore how scale influences multilinguality, we train a wider range of models (i.e. 0.3-6B parameters) on a larger corpus of more than 300B tokens of text drawn from a variety of languages roots. In Figure 3, we show scaling laws for Arabic, Catalan, Code, English, Spanish, Basque, French, Indonesian, Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Odia, Punjabi, Tamil, Telugu, Urdu, aggregated Niger-Congo languages, Portuguese, Vietnamese, Simplified and Traditional Chinese.
Smaller models struggle more with under-represented languages such as those in the Indic and Niger-Congo family. For example, the loss of the sub-1 billion models goes up at the end of training for Malayalam, Odia, and Telugu. As data is not repeated, it is unlikely that this effect is due to overfitting; we interpret this as insufficient capacity in the model to handle many language representations, with data in the dominant language sets causing catastrophic forgetting of less represented languages. In contrast, the largest model sees its loss decrease smoothly for every language: larger models handle multilinguality more easily. Overall, scaling laws coefficients are consistent across well-represented languages, only differing in offsets.
6 Scaling to 176B parameters
We now detail how our previous findings influence our architecture and scaling decisions for the final 176B BLOOM model.
We have been allocated 18 weeks of dedicated use of partition with 52 nodes of 8x 80GB A100 GPUs on the Jean Zay supercomputer. We set four nodes aside as spare, so that our compute budget amounts to 1,161,216 A100-hours in total. Assuming a throughput of 100 model TFLOPS, approximately corresponding to state-of-the-art hardware FLOPS of 150 narayanan2021efficient, we have a compute budget of 4,838 PF-days for the model training. We round this down to 4,500 PF-days, this % safety margin accounting for potential downtime and inefficiencies (e.g., batch size ramp-up) during training. To put this number in perspective, this is % more than the training budget of GPT-3. Given this compute budget, our English-only scaling laws in 1 predict an optimal allocation for training a 392B parameter model for 165B tokens. We will use these as an upper bound in size: the largest model we can afford is 392B parameters, and the minimum number of tokens to train on is 165B tokens.
|Model||Size||Pretraining||Budget||Layers||Hidden dim.||Attention heads|
|Model||Size||Layers||Hidden dim.||Attention heads||Memory||Performance|
kaplan2020scaling studied the dependence of the loss with model shape, and found only a limited impact within a wide range of feed-forward ratios , aspect ratios , and attention head dimensions.
levine2020limits proposed a theoretically motivated and empirically backed law describing the optimal compromise between width and depth. They predict that 100B+ parameters models such as GPT-3 are too deep, while models in the 10B or smaller range are usually too shallow. For a GPT-3-sized model with 175B parameters, they predict an ideal depth of 80 layers.
6.1 Final Model Architecture
We set three main guidelines for our final model:
300-400B tokens. We want to guarantee our model will train on around 300-400B tokens of data. This is in the upper range for models in the size range we are pursuing, ensuring that low-resource languages will not be allocated too few tokens. Using the approximation kaplan2020scaling, with PF-days and -400B tokens, this constrains the model size to be around 160-200B parameters.
70-80 layers. From levine2020limits and the size constraint above, we estimate that our model should have between 70 and 80 layers.
Finally, we want the final architecture to have as high of a throughput per GPU as possible, as more compute will translate directly into longer pretraining and thus a better model. Engineering constraints also come into light here: wide shallow models are typically easier to parallelize across nodes, up to a point where excessive tensor paralellism becomes necessary due to memory constraints.
We detail in Table 7 the architectures of current state-of-the-art 100B+ models. From these guidelines, we benchmark 20 model configurations, detailed in Appendix D. Among these configurations, we select three of particular interest, outlined in Table 8. They best fit our guidelines above, and offer high throughput, maximizing our training budget.
We discard configuration (1), as its attention heads are much larger than other models in the literature. Configuration (3) is shallower than recommended by levine2020limits, but delivers 3% higher throughput compared to (2). Thus, we choose configuration (3) and its better throughput, and because a shallower model is easier to deal with at inference time by introducing less latency.
Concurrent to this work, hoffmann2022training identified more optimal scaling laws. For our compute budget, they would suggest a 50B parameters model trained for a trillion tokens. Interestingly, even in hindsight, it would have been difficult to follow this recommendation as we would have been limited by the limited availability of high-quality multilingual data and by the size of the BigScience training dataset, ROOTS roots. Note that our Figure 1 reproduces kaplan2020scaling as we did not account for the learning rate schedule as suggested by hoffmann2022training.
In this work we have focused on a subset of the available hyperparameter space of large language models. We have investigated architecture decisions around positional embeddings, activation functions and the embedding norm. Alternative attention mechanisms tay2020long or optimizers are examples of other dimensions that could be investigated, potentially leading to improved models.
Our study is focused on zero-shot use and does not consider efficient fine-tuning lester2021power; zaken2021bitfit, which is quite relevant for large language models, and which may lead to different conclusions.
Seeking to establish the best possible model architecture that can be accommodated within a fixed 1,000,000 GPU-hours compute budget, we have presented an extensive study on principled modeling decisions for large language models.
First, we have found that complimenting Common Crawl data with high-quality cross-domain curated data can boost zero-shot generalization, validating previous suggestions rossettnlg; gao2020pile. Through an ablation study, we have identified ALiBi as the position embedding of choice, confirmed the potential of SwiGLU, and highlighted that stabilizing techniques such as embedding normalization sometimes come at the expense of zero-shot generalization. Exploring multilinguality, we have found that multilingual models significantly underperform their monolingual counterparts on English zero-shot benchmarks, but that they can learn under-resourced languages along with larger ones if given enough scale. Finally, we identified a candidate architecture for BLOOM 176B, outlining the full reasoning behind every architectural parameter, including model shape.
At variance with previous 100B+ models, such as GPT-3 brown2020gpt3 or Gopher rae2021scaling, this project was conducted in the open, and resulted in a number of open-access artefacts. Notable similar projects conducted in parallel to this one include OPT zhang2022opt and GLM zeng2022glm, although they lacked the collaborative and massively multilingual components of this project.
We hope our work can help practitioners better understand modeling decisions, leading to better language models, and that this transparency will accelerate future similar work.
This work was granted access to the HPC resources of Institut du développement et des ressources en informatique scientifique (IDRIS) du Centre national de la recherche scientifique (CNRS) under the allocation 2021-A0101012475 made by Grand équipement national de calcul intensif (GENCI). In particular, all the trainings ran on the Jean-Zay cluster of IDRIS, and we want to thank the IDRIS team for responsive support throughout the project, in particular Rémi Lacroix. Evaluations of GPT-3 models were provided in part by the Allen Institute for Artificial Intelligence. We thank Leo Gao for his expertise and advice on language model evaluation.
Appendix A Open artefacts: models, code, and logs
We make public all artefacts produced as part of this work:
Models. All trained models are centralized at https://huggingface.co/bigscience;
Code. All code is available at https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/megatron;
Discussions and logbook. The notes from the weekly meetings of our working group are made available at https://docs.google.com/document/d/1qbIkhd6bvbOsJOWXL7SfKQ0jey3MWQYQb_SshqH1LII/.
Appendix B Multilingual scaling laws
Appendix C Evaluation details
|ARC (clark2018arc)||Challenge||Natural Language Inference||25.0|
|GLUE||MRPC (dolan2016mrpc)||Paraphrase Identification||50.0|
|QQP (iyer2019qqp)||Paraphrase Identification||50.0|
|HellaSwag (zellers2019hellaswag)||Sentence Completion||25.0|
|LAMBADA (paperno2016lambada)||Sentence Completion||0.0|
|LogiQA (liu2020logiqa)||Multiple-Choice Question Answering||25.0|
|MathQA (amini2019mathqa)||Multiple-Choice Question Answering||20.1|
|MC-TACO (zhou2019mctaco)||Multiple-Choice Question Answering||36.2|
|OpenBookQA (mihaylov2press2021train018openbookqa)||Multiple-Choice Question Answering||25.0|
|PIQA (bisk2020piqa)||Multiple-Choice Question Answering||50.0|
|PROST (aroca-ouellette2021prost)||Multiple-Choice Question Answering||25.0|
|PudMedQA (jin2019pubmedqa)||Multiple-Choice Question Answering||33.3|
|QNLI (rajpurkar2016squad; wang2019glue)||Sentence Completion||50.0|
|Race lai2017large||Closed-Book Question Answering||25.0|
|SciQ (welbl2017sciq)||Multiple-Choice Question Answering||25.0|
|SuperGLUE||Boolq (clark2019boolq)||Multiple-Choice Question Answering||50.0|
|COPA (gordon2012copa)||Sentence Completion||50.0|
|MultiRC (kashabi2018multirc)||Multiple-Choice Question Answering||5.8|
|RTE (dagan2005rte)||Natural Language Inference||50.0|
|WIC (pilehavar2018wic)||Word Sense Disambiguation||50.0|
|WSC (levesque2012winograd)||Word Sense Disambiguation||50.0|
|TriviaQA (joshi2017triviaqa)||Closed-Book Question Answering||0.0|
|WebQuestions (berant2013semantic)||Closed-Book Question Answering||0.0|
|Winogrande (sakaguchi2019winogrande)||Coreference resolution||50.0|
|WNLI (sakaguchi2019winogrande)||Natural Language Inference||50.0|
Appendix D Architecture details
Appendix E All Results
|Public Name||OpenAI: babbage||Openai: curie||gpt-neo 1.3B|
|Dataset||C4||OSCAR||The Pile||The Pile||The Pile||The Pile||The Pile||OSCAR||The Pile||OSCAR||OSCAR||OSCAR||OSCAR-ML|
|Parameters in billion||1.3||6.7||1.3||1.3||1.3||1.3||1.3||1.3||1.3||1.3||13||1.3||1.3||1.3||1.3||1.3|
|Tokens trained in billion||300||300||300||112||112||112||250||300||300||330||300||112||112||112||112||112|