PAGnol: An Extra-Large French Generative Model

10/16/2021 ∙ by Julien Launay, et al. ∙ LightOn Inria 0

Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large pre-trained language models are the workhorses of modern Natural Language Processing (NLP). The use of scalable and efficient attention-based Transformers

Vaswani et al. (2017)

, rather than recurrent neural networks, has enabled increasingly large and capable models. Through self-supervised learning, these models learn contextual word embeddings, building a general representation of language. After this

pre-training they can be fine-tuned to target specific tasks (e.g. classification, parsing, summarization).

Three approaches dominate the field: (1) causal autoregressive decoder-only models, such as GPT Radford et al. (2018), learning from a general language modelling tasks; (2) bidirectional encoder-only models, such as BERT Devlin et al. (2018), learning from masked language modelling; (3) sequence-to-sequence models, such as BART Lewis et al. (2019) or T5 Raffel et al. (2020)

, combining both a bidirectional encoder and an autoregressive decoder, learning from a language denoising task. Encoder-only and sequence-to-sequence models excel in language understanding tasks, and have shadowed autoregressive models as a lesser option.

Autoregressive models have been shown to predictably benefit from increased size Kaplan et al. (2020); Henighan et al. (2020). Scaling laws establish a direct relationship between model size and end-task performance, justifying the training of increasingly large models Brown et al. (2020); Zeng et al. (2021); Kim et al. (2021); Wei et al. (2021). These laws can also inform design decisions, helping practitioners use their available compute budget optimally. A significant finding has been that larger models are more sample and compute efficient: with a given compute budget, it is preferable to train a larger model significantly short of convergence than to train a smaller model to convergence. Furthermore, at extreme-scale, such as the 175 billion parameters of GPT-3 Brown et al. (2020), autoregressive models exhibit unique few-shot abilities: they can learn from a few prompted examples, without weight updates. This capability questions the current fine-tuning paradigm, and may make billion parameters models more attractive and usable. Recent forays into prompt engineering/tuning Li and Liang (2021); Lester et al. (2021) have even seemingly bridged the gap between few-shot performance and fine-tuning.

Encoder-only (CamemBERT Martin et al. (2019) and FlauBERT Le et al. (2019)) and sequence-to-sequence models (BARThez Eddine et al. (2020a)) exist for the French language, and recently a decoder-only model with 1 billions parameters has been made available Simoulin and Crabbé (2021). We introduce PAGnol111PAG: Pré-Apprentissage Génératif. Marcel Pagnol was a famous French novelist. in this family, a collection of four French GPT-like models, and make the following contributions:

  • Largest French model. We train on CCNet and publicly release four models, with up to 1.5B parameters for PAGnol-XL. At the time of this work, this is the largest non-sparse French language model available, and we plan to explore increasingly large and powerful models in the future.

  • Optimal scaling. We use scaling laws to inform our training setup, resulting in optimal use of our compute budget. PAGnol-XL is trained with a budget of only 3 PF-days, just as much as the 13 times smaller CamemBERT. From our collection of models, we adjust scaling laws for the French language.

  • Dataset suitability. We highlight the importance of proper dataset pre-processing when training generative autoregressive models. While OSCAR has been relied on for French encoder-only models, we find it is not suited to PAGnol, leading to low-quality offensive outputs.

  • End-task performance. We evaluate on discriminative (FLUE) and generative tasks (question answering on FQuAD and summarization with OrangeSum) in the fine-tuning and prompt tuning regimes. We establish a new state of the art for summarization in French on OrangeSum.

2 Related work

Language models.

The design and training of neural language models able to create and process word embeddings is the cornerstone of modern NLP. Early on, self-supervised learning was identified as an efficient and scalable way to train such models. The use of deeper and more complex neural architectures enabled going from static embeddings (word2vec Mikolov et al. (2013), GloVe Pennington et al. (2014)) to contextual embeddings, allowing models to deal with polysemy. Although approaches such as ELMo Peters et al. (2018) and ULMFiT Howard and Ruder (2018) highlighted that learned representations can be transferred across downstream tasks, the poor scalability of RNNs prevented this vision from being fully realized.

By getting rid of the costly and delicate recurrent processing, attention-based Transformers Vaswani et al. (2017) spurred a wide interest in NLP. GPT Radford et al. (2018)

, a decoder-only variant of Transformers, demonstrated large-scale transfer learning from general language modelling to 12 NLU tasks. Along with the rise of easy-to-use libraries, encoder-only BERT

Devlin et al. (2018), relying on masked language modeling, made NLP a commodity – wherein every practitioner could rely on a pre-trained language model and fine-tune it cheaply to a task of interest. BERT models are limited by their ability to only "fill-in-the-gap" for a span of words: this forbids their use in generative tasks (e.g. summarization).

With sequence to sequence models and pre-training through denoising tasks, the original architecture of Transformers made a comeback with BART Lewis et al. (2019), bridging the gap between the generative capabilities of decoder-only models and the downstream task performance of encoder-only models. Through gradually larger and more powerful architectures, state-of-the-art models are approaching human-level performance on many tasks.

Successive generations of GPT models have questioned the current fine-tuning paradigm. GPT-2 Radford et al. (2019), with 1.5 billion parameters, demonstrated that large language models could tackle entirely new tasks through few-shot learning222

In other areas of machine learning, this has been referred to as

zero-shot learning, as no weight updates are necessary.. Without any fine-tuning, from just a few prompted examples, GPT-2 achieved fair performance on a number of complex downstream tasks. Furthering this endeavour, GPT-3 Brown et al. (2020), with 175 billion parameters, achieved state-of-the-art performance on some tasks, without the need for fine-tuning. This opens new possibilities for low-resources tasks, as well as paths to more natural interactions with these models: recent research suggests the gap between few-shot learning and fine-tuning may even be bridged through so-called prompt programming/tuning Li and Liang (2021); Lester et al. (2021).

Scaling laws.

More specifically to our setting, neural language models have been shown to predictably benefit from increased scale Kaplan et al. (2020). Their training dynamics are size-invariant, allowing test loss, parameter count, and dataset size to be correlated through smooth scaling laws. This is in fact true of all GPT-like autoregressive models, even when applied to image, multimodal, or mathematics modeling Henighan et al. (2020). Gains in autoregressive cross-entropy loss also directly translates to gains in end-task performance after fine-tuning. As they relate to compute budget, these predictions can be used to inform the training of large models.

Non-English generative models.

BERT-like models are now available in a broad number of languages, either as specialized models are as multilingual ones. This is less so the case for generative models, perhaps because of issues in controlling the language used at generation time. For the French language, GPTfr is an autoregressive generative model, and BARThez Eddine et al. (2020a) targets some generative abilities. Smaller-scale efforts exist, such as BelGPT Louis (2020), but they are limited to small models. GPT models have been trained for German Schweter (2020), Chinese Zeng et al. (2021), and the Arabic language Antoun et al. (2021), among others.

3 Efficient training with scaling laws

Scaling.

We use scaling laws to inform the duration of the training of our largest models. Rather than training to convergence, which would be wasteful, we train to optimality, as predicted by the equations provided in Kaplan et al. (2020)

. This is akin to what has been done for GPT-3, and this enables us to keep our computational budget in line with that of CamemBERT, a model 13x smaller than PAGnol-XL. We find that training all of our models for a single epoch on the 30GT of CCNet enables us to reach optimality for the most expensive XL model. Table

3 presents the ratios between the compute budget effectively used and that to optimality () or to convergence (

). While our small model is trained to convergence, others are trained significantly short of it. We find that our training performance matches nicely with the estimated

PF-days for the training of GPTfr-LARGE from Simoulin and Crabbé (2021).

4 PAGnol

In this section, we describe the data, model, and training specifications for PAGnol. In Table 1, we highlight some of the key differences and similarities with other French models, and in Table 2 we present two multilingual models that we consider in the following.

CamemBERT FlauBERT BARThez GPTfr PAGnol (ours) Language French French French French French Parameters 110/335M 138/373M 165M 124M/1B 124M/355M/773M/1.5B Context 512 512 768 1024 1024/2048 Dataset OSCAR Custom333FlauBERT and BARThez use a similar pre-training dataset assembling CommonCrawl, NewsCrawl, Wikipedia, and other smaller corpora. Custom Filtered Common Crawl CCNet 33GT/138GB 13GT/71GB 66GB 1.6/3.11 GT 32GT Tokenization SentencePiece 32k BPE 50k SentencePiece 50k BPE 50k BPE 50k Compute [PF-days] 3/10 444Insufficient data was provided by authors to infer compute budgets properly. ?/2.6 0.3/0.7/2/3
Table 1: Model, data, and training setup for PAGnol and other French models. Data size is reported in gigatokens (GT), and compute in PF-days ( FLOP). PAGnol is the largest French model. Despite being significantly larger than existing models, its compute cost remains reasonable: as recommended by scaling laws, we train models to optimality, and not to convergence.
mBERT mBART
Language 104 languages 25 languages
Parameters 110M 610M
Context 512 768
Dataset Wikipedia CC25
180GT/1369GB
(10GT/57GB French)
Tokenization WordPiece 110k SentencePiece 250k
Compute [PF-days] 2
Table 2: Model, data, and training setup for multilingual models including French that we consider. Data size is reported in gigatokens (GT), and compute in PF-days ( FLOP).

4.1 Pre-training data

Sourcing.

The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the bulk of most NLP datasets nowadays. We consider in our experiments two datasets based on CommonCrawl data: CCNet Wenzek et al. (2020) and OSCAR Ortiz Suárez et al. (2020). Once tokenized, OSCAR contains 33GT and CCNet 32GT. We use CCNet for all our main experiments and released models, and compare with results obtained on OSCAR in Section 5. We validate on the fr-wiki dataset (0.5GT) and French TreeBank (650kT) Abeillé et al. (2003).

CCNet.

CCNet combines the usual fastText Joulin et al. (2017) pre-processing of CC data with an additional filtering step to select high-quality documents. This filtering is done through a language model trained on Wikipedia, ensuring a text quality similar to that of its articles. We use a version of CCNet identical to the one considered in the CamemBERT paper.

Oscar.

OSCAR uses a fastText classifier to select documents and identify their languages, without any additional filtering. OSCAR is thus more varied, but more "noisy", than CCNet. OSCAR has been used to train other French language models such as CamemBERT.

Tokenization.

We use byte-level Byte-Pair Encoding (BPE), with a vocabulary of 50,262 tokens: 256 bytes, 6 special tokens, and 50,000 merges. Paragraphs are separated by an <EOS> token and documents are separated by a <SEP> token. We add a prefix space before tokenization, so that the first word in a paragraph is tokenized in the same way as if it was at any other position. This is similar to the setup of FlauBERT and GPT-2. For the models trained on OSCAR, we use a slightly smaller vocabulary size of 50,000 tokens.

4.2 Model specification

PAGnol [PF-days]
S 124M 12 768 12 0.3 1,3 9,0
M 355M 24 1024 16 0.7 0,5 3,0
L 773M 36 1280 20 2 0,4 2,5
XL 1.5B 48 1600 25 3 0,2 1,3
Table 3: Model and training budgets for PAGnol. All models are trained on a single epoch of our 32GT CCNet-curated data. is the compute budget used for the training of the model. is the ratio between and , the optimal compute budget derived from scaling laws. is the ratio between and , the compute budget derived from scaling laws to train the model to convergence.

PAGnol is a decoder-only autoregressive transformer, reproducing the architectural choices of GPT-3, with up to 1.5 billions parameters.. We evaluate four model sizes: small, medium, large, and extra-large, with architectures detailed in Table 3. We use a context size of 1,024 tokens for the S, M and L models. The XL uses a context size of 2,048, the largest at release for a French model. Additionally, we use Rotary Embeddings Su et al. (2021) in place of Learned Positional Embeddings for the XL model, since they provide much better training stability at the billion parameters regime.

4.3 Pre-training

Training objective.

We use an autoregressive language modelling objective, where the model learns to predict the next word in a sentence. To improve efficiency, we always fill the context with as much text at possible, and inform the model about separate documents through the <SEP> token.

Optimization.

We use the Adam optimizer Kingma and Ba (2014) with a warmup followed by cosine decay learning rate schedule. We find that proper initialization is key to training stability, and reproduce the setup effectively implemented by Megatron-LM Shoeybi et al. (2019)

. We initialize all weights with a normal distribution

and scale the weights under the residual layers by

. We tune hyperparameters over 10k step first, and pick the set with the best train perplexity.

Distributed training.

All training runs were performed on the public GENCI supercomputer Jean Zay

, on nodes with 4x or 8x V100 32GB and a 100Gb/s interconnect. We built our own GPT implementation from scratch in PyTorch

Paszke et al. (2019), leveraging FairScale for distributed training Baines et al. (2021).

Models up to PAGnol-L can be trained using simple distributed data parallelism (DDP). However, PAGnol-XL does not fit in 32GB of memory. We use optimizer and model state sharding, along with activation checkpointing and CPU offloading to fit the model in memory. This results in a setup similar to ZeRO-3 Rajbhandari et al. (2021). It is beneficial to train even small models with this setup, as it allows for a larger batch size, and significantly higher GPU throughput.

PAGnol fr-wiki FTB: whole    (train/val/test)
S 124M 43.38 23.87    (23.90, 24.38, 23.34)
M 355M 37.46 20.90    (20.92, 21.59, 20.46)
L 773M 34.81 19.97    (19.88, 21.19, 20.02)
XL 1.5B 28.85 16.18    (16.11, 16.67, 16.40)
Table 4: Validation perplexity on fr-wiki and on the whole French TreeBank (FTB) for PAGnol models after 1 epoch of training on 30GT of CCNet.

Perplexity.

We report final validation perplexity after 1 epoch over 30GT in Table 4. We use the the official 2019 French Wikipedia dumps and the French TreeBank dataset Abeillé et al. (2003) in its SPMRL instance Seddah et al. (2013) as our validation sets. Because we are running all models for a single epoch on our data, there are limited risks of overfitting and memorization.

Scaling law for PAGnol models

We fit a scaling law with the same functional form of Kaplan et al. (2020), that is the following power law:

where is the validation loss of PAGnol models trained on CCNet, is a constant, is the compute in PF-days, and is the scaling exponent. The fit is performed in log-log, and constrained to remain under the efficient frontier, using cvxpy Agrawal et al. (2018). We exclude the L and XL models from the fit: due to the HPC environment, the training was performed in multiple splits. At restart, the optimizer state is not necessarily available, generating artefacts in the training and validation curves. Additionally, the use of Rotary Embeddings for the XL model would affect the scaling, and make it incomparable with the English models. We therefore trained two smaller models, an XXS and an XS, following the same architectural decisions of the larger ones, on the same datasets, and used these to fit a scaling law. We find a scaling exponent for the French language, to compare to the for the English language from Kaplan et al. (2020). With the relatively important caveats that we are using different datasets, codebase, and hardware, it appears that French is less compute efficient than English, and that for a same improvement in validation loss, therefore we need to spend more compute for French than for English. The increase morphological complexity of French when compared to English Seddah et al. (2010) and its, in average, longer sentences could be a factor explaining this discrepancy.

Figure 1: Scaling law relative to the compute for PAGnol models from XXS to M. We do not include L and XL: the interrupted nature of the training due to the HPC environment and the choice of Rotary Embeddings for the XL pollute validation curves with artefacts that negatively affect the quality of the fit.

5 Influence of the pre-training data

Existing French models (CamemBERT, FlauBERT, BARThez) have been trained on datasets based on a simple filtering pipeline. A fastText classifier is used to isolate content in French, deduplication is applied, and noisy content (phone numbers, code, etc.) is removed. While this has been sufficient for pre-training encoder-only models and sequence-to-sequence models, the lack of quality control may be an issue for free-form generation with a decoder-only model such as PAGnol. Moreover, recent work on the OSCAR dataset (used by CamemBERT) has found that it may contain up to 5% of non-linguistic content, or content in the wrong language, and 0.5% of explicit content for the French language Caswell et al. (2021).

We initially pre-trained PAGnol on the OSCAR dataset, and while experimenting with the model, we observed the model generated offensive and explicit content, even when not prompted for it. For instance, the prompt Bonjour je suis (Hello I am) often resulted in pornographic content, despite being rather naive. This motivated our choice to switch to CCNet instead. For research purposes, we release the small and medium models trained on OSCAR. In future iterations of this document, we will provide a more detailed investigation of the content generated by PAGnol-CCNet and by PAGnol-OSCAR.

6 End-task performance

6.1 Discriminative tasks: FLUE

We evaluate our models on the Sentence Classification task of the FLUE evaluation setup Le et al. (2019). The task is a binary classification problem on reviews of Books, Music and DVD taken from the Amazon website. Each review is assigned a score from 1 to 5, and then labeled as "negative" if the score is lower than 3 and "positive" otherwise. We also evaluate our models on the paraphrasing and natural language inference tasks (PAWS-X and XNLI). PAWS-X consists in a binary classification task where the model has to identify whether two sentences are semantically equivalent or not. XNLI is instead a 3 class problem where we have to determine if a premise contradicts, entails or neither a given hypothesis. For the training, we add a CLS token at the end of the review (but before the EOS token). We then replace the projector at the end of the model with a linear layer and use the embedding of the CLS token to perform the classification.
Table 5 reports the test accuracy of the best hyperparameter configuration along with a comparison with other French language models. All models are fine-tuned for 6 epochs, except the medium OSCAR and the extra-large CC-100 which were trained respectively for 4 and 5 epochs. For each model, we finetune the learning rate and weight decay in the interval and respectively. For the classification task, we use a cosine annealing scheduler that decays down to of the original learning rate in 5 epochs (3 for the medium OSCAR and 4 for the extra-large CC-100). We additionally checked if adding dropout with could improve the performance. For the PAWS-X and XNLI tasks, we finetune the learning rate in the interval . We use the cosine annealing scheduler down to a learning rate equal to of the original value ( for the Small models) over Million tokens respectively. PAWS-X training is over 2 epochs while XNLI training over 1. PAGnol models slightly underperform smaller BERT models, while being better than multilingual alternatives, and their GPTfr counterparts. For PAGnol, performance improves with size but seems to saturate with the XL model, possibly because we had to use a lower batch size to fit on the hardware for fine-tuning. Additionally, while the generation quality of models trained on OSCAR is noticeably worse, they perform as well or better than the corresponding models trained on CCNet on these discriminative tasks.

Model Parameters Books Music DVD PAWS-X XNLI
MultiFiT Not Specified 91.25 89.55 93.40 - -
mBERT 110/340 M* 86.15 86.90 86.65 89.30 76.9
mBART 610 M 93.40 93.13 93.10 89.70 81.07
BARThez 216 M 94.47 94.97 93.17 88.90 80.73
CamemBERT-BASE 110 M 92.30 94.85 93.00 90.14 81.20
CamemBERT-LARGE 335 M 95.47 96.00 95.37 91.83 85.33
Flaubert-BASE 138 M 93.10 92.45 94.10 89.49 80.60
Flaubert-LARGE 373 M 95.00 94.10 95.85 89.34 83.40
GPTfr-BASE 124 M 88.30 86.90 89.30 83.30 75.60
GPTfr-LARGE 1 B 91.60 91.40 92.60 86.30 77.90
PAGnol-S OSCAR 124 M 92.05 92.60 91.70 84.19 76.10
PAGnol-M OSCAR 355 M 94.40 94.90 94.30 87.44 79.46
PAGnol-S CC-100 124 M 92.00 93.00 91.65 87.19 75.67
PAGnol-M CC-100 355 M 94.40 95.20 93.70 89.14 79.00
PAGnol-L CC-100 773 M 94.65 95.25 94.00 90.70 81.48
PAGnol-XL CC-100 1.5 B 94.65 95.35 94.18 89.47 81.83
Table 5: Results on the FLUE Benchmark including classification (Books, Music, DVD), paraphrasing (PAWS-X) and natural language inference (XNLI) tasks. The best overall results are highlighted in bold, and the best results for GPT models are underlined.

6.2 Generative task: FQuAD

FQuAD d’Hoffschmidt et al. (2020) is a native French question answering dataset, comprising more than 25.000 questions fabricated by higher education students from a set of Wikipedia articles, following the same philosophy as the English dataset SQuAD Rajpurkar et al. (2018). Given a document , a question and the corresponding answer , the Question Answering task is casted into this format:

" Question: Réponse: "

where Réponse corresponds to Answer in French.

Given this input format, in a setup similar to pretraining, the likelihood of the sequence corresponding to the answer is maximized using the cross entropy loss on the tokens corresponding to the answer. We use the Adam optimizer and finetune the learning rate and weight decay in the interval and . The different models were trained for 2 epochs. As noted by gpt2, the performance of autoregressive models is still worse than question answering systems based on masked language models. Indeed, we evaluated the finetuning of OpenAI GPT-2 small and medium on SQuAD, and obtained EM and F1 scores in the same range of PAGnol on FQuAD (Table 7).

Model EM F1
CamemBERT-LARGE 82.1 92.2
CamemBERT-BASE 78.4 88.4
PAGnol-S OSCAR 31.7 52.8
PAGnol-M OSCAR 37.1 59.4
PAGnol-S CC-100 33.7 56.0
PAGnol-M CC-100 36.8 59.0
PAGnol-L CC-100 42.8 66.3
PAGnol-XL CC-100 44.4 68.5
Table 6: Question answering on FQuAD.
Model Size EM F1
GPT small 45.5 62.7
GPT medium 50.8 68.1
Table 7: GPT-2 small and medium model performance on SQuaD

6.3 Generative task: OrangeSum

OrangeSum Eddine et al. (2020b) is a summarization dataset, considered to be the French equivalent of the XSum Narayan et al. (2018). It is an abstractive dataset containing summary of news article from the "Orange Actu" website. Each article comes with a professionally-written title and abstract. Hence, the dataset includes two tasks: OrangeSum Title and OrangeSum Abstract. We evaluate PAGnol on the latter.

Similarly to our setup for question answering, given a news article and an abstract , we cast the summarization task in this format:

" Summary: "

We finetune our model on the crossentropy loss computed only on the tokens of the produced summary. We optimize the learning rate and weight decay in the same interval as FLUE, using the same scheduler, and train for epochs. We add a dropout with to improve the performance. We evaluate the fine-tuned model using greedy token generation and the ROUGE metric. This task, more geared towards generation, sees PAGnol-XL establish a new state of the art for summarization on OrangeSum.

Model Parameters R-1 R-2 R-L
BARThez 216 M 31.44 12.77 22.23
PAGnol-S OSCAR 124 M 22.79 6.16 16.03
PAGnol-M OSCAR 355 M 24.89 7.87 17.78
PAGnol-S CC-100 124 M 26.47 9.49 17.12
PAGnol-M CC-100 355 M 28.20 10.80 20.79
PAGnol-L CC-100 773 M 28.12 11.05 20.81
PAGnol-XL CC-100 1.5 B 31.17 12.86 22.50
Table 8: Text summarization on the OrangeSum Abstract task. Best results are highlighted in bold, and second best are underlined.

7 Prompt Tuning

Human prompt engineering to extract good zero- and few-shot performance for large language models has motivated research in prompt tuning

: placing some random vectors in the input sequence and optimizing their values, while keeping the pre-trained model weights fixed. The advantage of this approach is that the model does not change, and only the prompt is optimized. We follow the approach in

Lester et al. (2021), and optimize a certain number of tokens in our soft prompt for the three aforementioned tasks. The best hyperparameters per size per task have been selected through a grid search for the value of , learning rate and dropout. In particular we performed a grid search over , learning rate in , and dropout values in . We show the results for FLUE, FQuAD, and OrangeSum in Tables 9, 10 and 11. We expected a smooth scaling in performance with size and to progressively close the gap with fine-tuning performance, as shown by Lester et al. (2021), however this scaling slows significantly when we reach the XL model. We suspect a bug in our implementation of prompt tuning with Rotary Embeddings, causing the performance hit, therefore we temporarily show the results for the XL model in italic in this setting. This is a work in progress and will be updated in a new release.

PAGnol Books Music DVD
S 88.50 87.95 88.24
M 91.60 92.65 90.69
L 92.60 93.1 91.69
XL 92.50 93.25 92.14
Table 9: Prompt Tuning performance for sentence classification (CLS).
PAGnol EM F1
S 0.243 0.427
M 0.320 0.561
L 0.365 0.526
XL 0.403 0.450
Table 10: Prompt Tuning performance for question answering on fQuad.
PAGnol R-1 R-2 R-L
S 24.54 8.98 18.45
M 27.80 10.56 20.29
L 28.25 11.05 21.03
XL 28.72 11.08 20.89
Table 11: Prompt tuning performance for text summarization on OrangeSum.

8 Discussion

Without fear of aligning an overused cliché, the release of large language neural models have not only revolutionized the NLP field by bringing a major leap in performance in almost every tasks they were applied to, they crucially changed the perception of the risks of their potential misuse. The point is that this dramatic boost of performance has led the field to rely on the capacity of those large models to transfer their, in layman’s terms, “knowledge” to other tasks via various transfer learning modalities. Yet, with this transfer, the potential data biases inherent to large corpus collection used for pre-training are also susceptible to appear. gehman-etal-2020-realtoxicityprompts thoroughly demonstrated that all generative language models they tested (from GPT1 Radford et al. (2018) trained on Book Corpus only to GPT3 Brown et al. (2020) and CTRL Keskar et al. (2019) trained on various corpora, including user-generated content and web-crawled data sets) were capable of producing toxic output in specific conditionsand presented different ways of alleviating this behaviour. Having been pre-trained on Common Crawl-based corpora, our models are certainly not immune from toxic content generation.

gehman-etal-2020-realtoxicityprompts explored two main ways to filter toxic outputs, one being based on the pursuing the pretraining on less toxic data sets or on toxicity-annotated data set; the other focusing on the alteration of the decoding strategy. Krause2020GeDiGD proposed to guide the output with smaller language models. Given the interest in generative models, this is an active line of research that we are currently pursuing.

More generally, the question of knowing whether the pre-training data should be curated more or should the model output, depending on the downstream application in sight, be debiased, or filtered, directly is still the object of vivid debates among the community Bender et al. (2021); Goldberg (2021), while of course there is an agreement toward responsible use of such technology.

In this aspect, the release of a GPT-generated text detector by Antoun2021AraGPT2PT along their Arabic GPT2 model is an interesting step toward this direction.

Regarding the environmental aspects of this model, our model pretraining experiments consumed about 62k gpu hours from the Jean Zay HPC cluster. Being based in France, its energy mix is made of nuclear (65-75%), 20% renewable and the remaining with gas (or more rarely coal when imported from abroad) (S.Requena, Dir. of Jean Zay, P.C).

Regarding the performance of our models which almost constantly outperform their closest French counterparts, the GPTfr models Simoulin and Crabbé (2021), one explaining factor could be the size of our pretraining data set (30B token vs 3B). Given our computing limitation, we choose from the beginning to use an experimental protocol as comparable as possible to the one used for the CamemBERT model evaluation Martin et al. (2019), it would of course be interesting to perform a head to head comparison with the GPTfr model. In terms of raw performance, it has been constantly reported that GPT-based models on the billion parameter scale provide inferior performance when compared to their “regular” transformer counterparts in classic fine-tuning scenarios, our results confirm this for French as well while highlighting the interest of our models in generation-oriented scenarios (such as text summarization where PagnolXL establishes a new state of the art for French). As for English Lester et al. (2021), our encouraging preliminary prompt-tuning results suggest that this approach is promising and could be a way to close this performance gap.

9 Conclusion

We presented the Pagnol model collection, the first released large scale generative model for French555Released on May 4th, 2021. https://twitter.com/LightOnIO/status/1389579858754293761?s=20, to date the largest neural language model for French. Trained on the CCnet corpus, we used scaling laws to inform our training setup, resulting in an optimal use of our training budget. The evaluation of our models on various end-tasks demonstrated first that the CCnet corpus was a better choice than the Oscar French instance when used for generation; second, they showed that our models provide the same range of performance than their English counterparts and established a new state of the art for summarization of French on OrangeSum. PagnolXL and our smaller models are available on https://lair.lighton.ai/pagnol/.

Acknowledgments

This work was granted access to the HPC resources of IDRIS under the allocation 2020-AD011012024 made by GENCI, enabling us to use the Jean Zay supercomputer. We thank Stéphane Réquena and the support team for their valuable help. We also acknowledge Louis Martin for helping us reproducing the CamemBERT settings for CCNet. Djamé Seddah was partly funded by the French Research National Agency via the ANR project ParSiTi (ANR-16-CE33-0021).

References

  • A. Abeillé, L. Clément, and F. Toussenel (2003) Building a treebank for french. In Treebanks, pp. 165–187. Cited by: §4.1, §4.3.
  • A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd (2018) A rewriting system for convex optimization problems. Journal of Control and Decision 5 (1), pp. 42–60. Cited by: §4.3.
  • W. Antoun, F. Baly, and H. Hajj (2021) AraGPT2: pre-trained transformer for Arabic language generation. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Virtual), pp. 196–207. External Links: Link Cited by: §2.
  • M. Baines, S. Bhosale, V. Caggiano, N. Goyal, S. Goyal, M. Ott, B. Lefaudeux, V. Liptchinsky, M. Rabbat, S. Sheiffer, A. Sridhar, and M. Xu (2021) FairScale: a general purpose modular pytorch library for high performance and large scale training. Note: https://github.com/facebookresearch/fairscale Cited by: §4.3.
  • E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021) On the dangers of stochastic parrots: can language models be too big?. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Cited by: §8.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1, §2, §8.
  • I. Caswell, J. Kreutzer, L. Wang, A. Wahab, D. van Esch, N. Ulzii-Orshikh, A. Tapo, N. Subramani, A. Sokolov, C. Sikasote, et al. (2021) Quality at a glance: an audit of web-crawled multilingual datasets. arXiv preprint arXiv:2103.12028. Cited by: §5.
  • M. d’Hoffschmidt, W. Belblidia, T. Brendlé, Q. Heinrich, and M. Vidal (2020) FQuAD: french question answering dataset. arXiv preprint arXiv:2002.06071. Cited by: §6.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.
  • M. K. Eddine, A. J. Tixier, and M. Vazirgiannis (2020a) BARThez: a skilled pretrained french sequence-to-sequence model. arXiv preprint arXiv:2010.12321. Cited by: §1, §2.
  • M. K. Eddine, A. J. Tixier, and M. Vazirgiannis (2020b) BARThez: a skilled pretrained french sequence-to-sequence model. arXiv preprint arXiv:2010.12321. Cited by: §6.3.
  • Y. Goldberg (2021) A criticism of "on the dangers of stochastic parrots: can language models be too big". Note: https://gist.github.com/yoavg/9fc9be2f98b47c189a513573d902fb27 Cited by: §8.
  • T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al. (2020) Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701. Cited by: §1, §2.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Cited by: §4.1.
  • J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020) Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Cited by: §1, §2, §3, §4.3.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) CTRL: a conditional transformer language model for controllable generation. ArXiv abs/1909.05858. Cited by: §8.
  • B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park, S. Kim, S. Kim, D. Seo, et al. (2021) What changes can large-scale language models bring? intensive study on hyperclova: billions-scale korean generative pretrained transformers. arXiv preprint arXiv:2109.04650. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier, and D. Schwab (2019) Flaubert: unsupervised language model pre-training for french. arXiv preprint arXiv:1912.05372. Cited by: §1, §6.1.
  • B. Lester, R. Al-Rfou, and N. Constant (2021) The power of scale for parameter-efficient prompt tuning. External Links: 2104.08691 Cited by: §1, §2, §7, §8.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §1, §2.
  • X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190. Cited by: §1, §2.
  • A. Louis (2020) BelGPT-2: a GPT-2 model pre-trained on French corpora.. Note: https://github.com/antoiloui/belgpt2 Cited by: §2.
  • L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de la Clergerie, D. Seddah, and B. Sagot (2019) Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894. Cited by: §1, §8.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
  • S. Narayan, S. B. Cohen, and M. Lapata (2018)

    Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization

    .
    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Cited by: §6.3.
  • P. J. Ortiz Suárez, L. Romary, and B. Sagot (2020) A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1703–1714. External Links: Link Cited by: §4.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    .
    Advances in neural information processing systems 32, pp. 8026–8037. Cited by: §4.3.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1, §2, §8.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: §1.
  • S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He (2021) ZeRO-infinity: breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857. Cited by: §4.3.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822. Cited by: §6.2.
  • S. Schweter (2020) German gpt-2 model. Zenodo. External Links: Document, Link Cited by: §2.
  • D. Seddah, G. Chrupała, Ö. Çetinoğlu, J. van Genabith, and M. Candito (2010) Lemmatization and lexicalized statistical parsing of morphologically-rich languages: the case of French. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, Los Angeles, CA, USA, pp. 85–93. External Links: Link Cited by: §4.3.
  • D. Seddah, R. Tsarfaty, S. Kübler, M. Candito, J. D. Choi, R. Farkas, J. Foster, I. Goenaga, K. Gojenola Galletebeitia, Y. Goldberg, S. Green, N. Habash, M. Kuhlmann, W. Maier, J. Nivre, A. Przepiórkowski, R. Roth, W. Seeker, Y. Versley, V. Vincze, M. Woliński, A. Wróblewska, and E. Villemonte de la Clergerie (2013) Overview of the SPMRL 2013 shared task: a cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, Seattle, Washington, USA, pp. 146–182. External Links: Link Cited by: §4.3.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §4.3.
  • A. Simoulin and B. Crabbé (2021) Un modèle transformer génératif pré-entrainé pour le _ français. In Traitement Automatique des Langues Naturelles, pp. 245–254. Cited by: §1, §3, §8.
  • J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu (2021) Roformer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864. Cited by: §4.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30, pp. 5998–6008. Cited by: §1, §2.
  • J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le (2021) Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. Cited by: §1.
  • G. Wenzek, M. Lachaux, A. Conneau, V. Chaudhary, F. Guzmán, A. Joulin, and É. Grave (2020) CCNet: extracting high quality monolingual datasets from web crawl data. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 4003–4012. Cited by: §4.1.
  • W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang, et al. (2021) PanGu-: large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369. Cited by: §1, §2.

Appendix A Pretraining details

In this section we provide more details on the hyperparameters used to train PAGnol models.

Models trained on OSCAR

The Small and Medium trained on OSCAR use the Adam optimizer with , with warmup steps, and a cosine decay learning rate schedule over steps. Additionally, we use grad clipping with a value of

, and set the dropout probability

. We do not use weight decay. The Small was trained on a single node with 8 NVIDIA V100, and the Medium on two such nodes. In both cases we use the distributed data parallel (DDP) paradigm, with mixed precision. We show differences in hyperparameters between the SmallOSCAR and MediumOSCAR in Table 12

PAGnol OSCAR S M
Shared Emb. False True
LRmax-min 6e-4 – 6e-5 3e-4 – 3e-5
Batch Size 12 5
Table 12: Hyperparameters differences between the Small and Medium training runs on OSCAR.
PAGnol GPU-hours
S 480
M 1,920
L 6,720
XL 5,200
Total 14,320
Table 13: The Large model required more GPU-hours than expected, however several optimizations allowed us to perform the training of the XL model with a smaller budget.

Models trained on CC-100

We use the Adam optimizer with and warmup and cosine learning rate decay schedule also for the S, M, L, and XL models trained on CC-100. We set dropout probability

and no weight decay for all models. Small, Medium, and Large are trained with DDP, and use tied embeddings, gradient clipping at

, . These are trained on respectively 1, 2 and 4 nodes with 8 NVIDIA V100s. The XL model uses a smaller value for gradient clipping at , a larger context size at , and different for the Adam optimizer. This model is trained on 10 nodes with 4 NVIDIA V100s each. Further differences between the hyperparameters of the training runs are shown in Table 14.

PAGnol CC-100 S M L XL
LRmax-min 5e-4–5e-5 3e-4–3e-5 2.5e-4–2.5e-5 2.5e-4–2.5e-5
Warmup Steps 5000 5000 5000 3600
Decay Steps 400,000 500,000 100,000 400,000
Batch Size 10 5 1 4
Table 14: Hyperparameters differences between Small, Medium, Large and Extra-Large training runs on CC-100.

Budget in GPU-Hours

We report in Table 13 the budget in GPU-hours needed to train each model, and the total budget.

Appendix B Samples from PAGnol XL

We provide in Tables 15, 16, 17, and 18 some examples of completions given by PAGnol XL when prompted with different text.

Prompt Bonjour Madame, est-ce que
Generation le château de la Noue se trouve dans la région ? Est-ce que les gens s’y déplacent en calèche ? Je me rends dans ce coin du monde quelques jours au mois de mai, c’est mon anniversaire. Est-ce que vous pouvez me conseiller quelque chose ? Merci beaucoup. Françoise Vernin.
Table 15: Completion for a random French sentence.
Prompt

Produis un horoscope étant donné un signe astrologique. ———- Bélier: Des nuages sur votre Ciel ! En couple, des moments de désaccord sont possibles. Vénus pourrait en effet compliquer vos relations avec votre conjoint… ———- Taureau: Vénus donnera la priorité à vos amours ! En couple, les événements en cours vous seront favorables à court ou long terme, celui qui vous aime… ———- Lion: Vos amours devraient vous combler. En couple, Jupiter vous offrira le cocooning harmonieux dont vous rêvez. Célibataire, inutile de papillonner:… ———- Vierge: Ambiance astrale morose. En couple, vous serez probablement tenté de chercher un dérivatif à l’ennui qui naît de l’habitude et du train-train… ———- Capricorne:

Generation Il est temps de changer de cap! En couple, Vénus s’opposera à Mars et Mercure. Ces deux planètes vous réjouiront par leur dynamisme. Mais, en tant que conjointe, vous serez plus fatiguée que jamais.
Table 16: Completion for an oroscope found in the magazine ELLE.
Prompt Quelque 180 000 nouveaux parents devraient en bénéficier d’ici au mois d’août 2022 : le gouvernement prévoit de distribuer des « bébés box » par l’intermédiaire des maternités à partir de février 2022. De quoi s’agit-il ?
Generation De boîtes contenant un kit composé de plusieurs couches, quelques vêtements et accessoires adaptés à l’âge et au sexe de l’enfant. Certains parents ont pu déjà recevoir ces kits grâce à une souscription mensuelle proposée par les enseignes de la grande distribution. L’offre a immédiatement rencontré un vif succès, des petits comme des grands, qui jouent à la fois un rôle de « maman » et de « papa » mais surtout avec ce budget…
Table 17: Completion for the beginning of a news article on the "« Bébé box » pour jeunes parents" from Le Monde website on the 14th of October 2021.
Prompt La septième réalisation du monumental Bong Joon Ho est fréquemment présentée comme une critique acerbe des inégalités qui minent la société coréenne. La lutte, d’abord larvée et de plus en plus patente, qui oppose deux familles aux extrêmes de l’échelle sociale semble étayer cette vision.
Generation Il ne s’agira donc pas de confronter les puissants à leurs limites mais plutôt de s’interroger sur le mensonge et la tromperie, les jeux et l’alcool, les privilèges et la corruption. Pour en arriver à cette conclusion, on n’a cependant pas la prétention de se demander comment une famille moyenne aisée pourrait être en mesure de maintenir les apparences dans un pays où il n’y a pas d’eau, ni de gaz, ni d’électricité. En tant que spectateur, nous suivons ce cadre tendu, ce huis clos. Puis, très vite, les événements s’ accélèrent, la mécanique bien huilée du thriller se met en marche pour un final imprévisible.
Table 18: Completion for the beginning of a critique on Parasite from the website Sens Critique. Accessed on the 14th of October 2021.