FlauBERT: Unsupervised Language Model Pre-training for French

12/11/2019 ∙ by Hang Le, et al. ∙ Université Grenoble Alpes ESPCI 0

Language models have become a key step to achieve state-of-the-art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized word representations such as OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2019), or XLNet (Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to complex NLP tasks (natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks are shared with the research community for further reproducible experiments in French NLP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

Flaubert

Unsupervised Language Model Pre-training for French


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A recent game-changing contribution in Natural Language Processing (NLP) was the introduction of deep unsupervised language representations pre-trained using only plain text corpora: GPT [radford2018improving], BERT [devlin2019bert], XLNet [yang2019xlnet], XLM [lample2019cross], RoBERTa [liu2019roberta]. Previous word embedding pre-training approaches, such as word2vec [Mikolov:2013:DRW:2999792.2999959] or GloVe [Pennington14glove:global]

, learn a single vector for each wordform. By contrast, these new models are called

contextual: each token instance has a vector representation that depends on its right and left context. Initially based on RNN architectures [peters2018deep, ELMo], these models quickly converged towards the use of Transformer architectures [vaswani2017attention]. One very popular work [devlin2019bert, BERT]

also proposed to open-source the trained models so that anyone could easily build an NLP system using the pre-trained model, saving time, energy and resources. As a consequence, unsupervised language model pre-training became a

de facto standard to achieve state-of-the-art results in many NLP tasks. However, this was mostly demonstrated for English even if a multilingual version of BERT called mBERT was also released by Google [devlin2019bert], considering more than a hundred languages in a single model.

In this paper, we describe our methodology to build and open source FlauBERT, a French BERT111Three weeks before this submission, we learned of a similar project that resulted in a publication on arXiv [martin2019camembert]. However, we believe that these two works on French LMs are complementary since the NLP tasks we addressed are different, as are the training corpora and preprocessing pipelines. We also point out that our models were trained using the CNRS (French National Centre for Scientific Research) public research computational infrastructure and did not receive any assistance from a private stakeholder. model that significantly outperforms mBERT in several downstream NLP tasks. FlauBERT relies on freely available datasets and is made publicly available in different versions.222https://github.com/getalp/FlaubertDeep Learning For Physical Processes: Incorporating Prior Scientific Knowledge.and is named FLUE (French Language Understanding Evaluation).

2 Related Work

2.1 Pre-trained Language Models

Self-supervised333

Self-supervised learning

is a special case of unsupervised learning where unlabeled data is used as a supervision signal. pre-training on unlabeled text data was first proposed in the task of neural language modeling [bengio2003neural, collobert2008unified]

, where it was shown that a neural network trained to predict next word from prior words can learn useful embedding representations, called

word embeddings (each word is represented by a fixed vector). These representations were shown to play an important role in NLP, yielding state-of-the-art performance on multiple tasks [collobert2011natural], especially after the introduction of word2vec [Mikolov:2013:DRW:2999792.2999959] and GloVe [Pennington14glove:global], efficient and effective algorithms for learning word embeddings.

A major limitation of word embeddings is that a word can only have a single representation, even if it can have multiple meanings (e.g. depending on the context). Therefore, recent works have introduced a paradigm shift from context-free word embeddings to contextual embeddings: the representation for a word is a function of the entire input sequence, which allows encoding complex, high-level syntactic and semantic characteristics of words.

The concurrent release of two papers opened this line of research:444It should be noted that learning contextual embeddings had previously been proposed by [mccann2017learned], but in a supervised fashion as they used annotated machine translation data. ULMFiT [howard2018universal] relies on word embeddings derived from a “simple” LSTM [hochreiter1997long] language model, while in ELMo [peters2018deep]

a bidirectional LSTM encodes the sentence to build the final embedding for each token from the concatenation of the left-to-right and right-to-left representations. Another fundamental difference lies in how each model can be tuned to different downstream tasks: while ELMo delivers different word vectors that can be interpolated, ULMFiT enables robust fine-tuning of the whole network

w.r.t. the downstream tasks. The ability of fine-tuning was shown to significantly boost the performance, and thus this approach has been further developed in the recent works such as MultiFiT [eisenschlos2019multifit] or most prominently Transformer-based [vaswani2017attention] architectures: GPT [radford2018improving], BERT [devlin2019bert], XLNet [yang2019xlnet], XLM [lample2019cross], RoBERTa [liu2019roberta], ALBERT [lan2019albert], T5 [raffel2019exploring]. These methods have one after the other established new state-of-the-art results on various NLP benchmarks (such as GLUE [wang-etal-2018-glue] or SQuAD [rajpurkar2018know]), surpassing previous methods by a large margin.

2.2 Pre-trained Language Models Beyond English

Given the impact of pre-trained LMs on NLP downstream tasks in English, several works have recently released pre-trained models for other languages. For instance, ELMo exists for Portuguese, Japanese, German and Basque,555https://allennlp.org/elmo while BERT was specifically trained for simplified and traditional Chinese8 and German.666https://deepset.ai/german-bert A Portuguese version of MultiFiT is also available.777https://github.com/piegu/language-models For French, this is very recent with the releases of CamemBERT [martin2019camembert] and pre-trained LM for French using the MultiFiT configuration.7

Another trend considers one model estimated for several languages with a shared vocabulary. The release of multilingual BERT for 104 languages pioneered this approach.

888https://github.com/google-research/bert A recent extension of this work leverages parallel data to build a cross-lingual pre-trained version of LASER [artetxe2019massively] for 93 languages, XLM [lample2019cross] and XLM-R [conneau2019unsupervised] for 100 languages.

2.3 Evaluation Protocol for French NLP Tasks

The existence of a multi-task evaluation benchmark such as GLUE [wang-etal-2018-glue] for English is highly beneficial to facilitate research in the language of interest. The GLUE benchmark has become a prominent framework to evaluate the performance of NLP models in English. The recent contributions based on pre-trained language models have led to remarkable performance across a wide range of NLU tasks. The authors of GLUE have therefore introduced SuperGLUE [wang2019superglue]: a new benchmark built on the principles of GLUE, including more challenging and diverse set of tasks. A Chinese version of GLUE999https://github.com/chineseGLUE/chineseGLUE is also developed to evaluate model performance in Chinese NLP tasks. As of now, we have not learned of any such benchmark for French.

3 Building FlauBERT

In this section, we describe the training corpus, the text preprocessing pipeline, the model architecture and training configurations to build FlauBERT.

3.1 Training Data

Data collection

Our French text corpus consists of 24 sub-corpora gathered from different sources, covering diverse topics and writing styles, ranging from formal and well-written text (e.g. Wikipedia and books)101010http://www.gutenberg.org to random text crawled from the Internet (e.g. Common Crawl).111111http://data.statmt.org/ngrams/deduped2017 The data were collected from three main sources: (1) monolingual data for French provided in WMT19 shared tasks [li2019findings, 4 sub-corpora]; (2) French text corpora offered in the OPUS collection [TIEDEMANN12.463, 8 sub-corpora]; and (3) datasets available in the Wikimedia projects [wiki:xxx, 8 sub-corpora].

We used the WikiExtractor tool121212https://github.com/attardi/wikiextractor to extract the text from Wikipedia. For the other sub-corpora, we either used our own tool to extract the text or download them directly from their websites. The total size of the uncompressed text before preprocessing is 270 GB. More details can be found in Appendix A.1.

Data preprocessing

For all sub-corpora, we filtered out very short sentences as well as repetitive and non-meaningful content such as fax/telephone numbers, email addresses, etc. For Common Crawl, which is our largest sub-corpus with 215 GB of raw text, we applied aggressive cleaning to reduce its size to 43.4 GB. All the data were Unicode-normalized in a consistent way before being tokenized using Moses [koehn2007moses] tokenizer. The resulting training corpus is 71 GB in size.

Our code for downloading and preprocessing data will be made publicly available.

3.2 Models and Training Configurations

CamemBERT
Language English English French French
Training data 13 GB 160 GB 138 GB 71 GB
Pre-training objectives NSP and MLM MLM MLM MLM
Total parameters 110 M 125 M 110 M 137 M
Tokenizer WordPiece 30K BPE 50K SentencePiece 32K BPE 50K
Masking strategy Static + Sub-word masking Dynamic + Sub-word masking Dynamic + Whole-word masking Dynamic + Sub-word masking
, : 282 GB, 270 GB before filtering/cleaning.
Table 1: Comparison between FlauBERT and previous work.

Model architecture

FlauBERT has the same model architecture as BERT [devlin2019bert], which consists of a multi-layer bidirectional Transformer [vaswani2017attention]. Following devlin2019bert, we propose two model sizes:

  • : ,

  • :131313This model is still in training at the time of submission. ,

where and respectively denote the number of Transformer blocks, the hidden size, and the number of self-attention heads. As Transformer has become quite standard, we refer to vaswani2017attention for further details.

Training objective and optimization

Pre-training of the original BERT [devlin2019bert] consists of two supervised tasks: (1) a masked language model (MLM) that learns to predict randomly masked tokens; and (2) a next sentence prediction (NSP) task in which the model learns to predict whether B is the actual next sentence that follows A, given a pair of input sentences A,B.

devlin2019bert observed that removing NSP significantly hurts performance on some downstream tasks. However, the opposite was shown in later studies, including [XLNet]yang2019xlnet, [XLM]lample2019cross, and [RoBERTa]liu2019roberta.141414liu2019roberta hypothesized that the original BERT implementation may only have removed the loss term while still retaining a bad input format, resulting in performance degradation. Therefore, we only employed the MLM objective in FlauBERT.

To optimize this objective function, we followed liu2019roberta and used the Adam optimizer [kingma2014adam] with parameters: warmup steps of 24k, peak learning rate of , , , and weight decay of 0.01.

Other training details

A vocabulary of 50K sub-word units is built using the Byte Pair Encoding (BPE) algorithm [sennrich2016neural]. The only difference between our work and RoBERTa is that the training texts are preprocessed and tokenized using a basic tokenizer for French [koehn2007moses, Moses], as in XLM [lample2019cross], before the application of BPE. We use fastBPE, a very efficient implementation151515https://github.com/glample/fastBPE to extract the BPE units and encode the corpora.

is trained on 32 GPUs Nvidia V100 SXM2 32 GB in 410 hours. Each GPU consumes a batch size of 16 sequences and the gradient is accumulated 16 times, making up a total batch size of 8192 sequences. The perplexity on the validation set is used as the stopping criterion, with a patience of 20 consecutive epochs. However, due to time constraints, we stopped the model before it reaches this stopping criterion and used the pre-trained weights of the

trained for 224k steps for evaluation tasks.161616We will update the results with the fully trained model in the camera ready if paper is accepted.

Finally, we summarize the differences between FlauBERT and BERT, RoBERTa, CamemBERT in Table 1.

4 Flue

In this section, we compile a set of existing French language tasks to form an evaluation benchmark for French NLP that we called FLUE (French Language Understanding Evaluation). We select the datasets from different domains, level of difficulty, degree of formality, and amount of training samples. Three out of six tasks (Text Classification, Paraphrase, Natural Language Inference) are from cross-lingual datasets since we also aim to provide results from a monolingual pre-trained model to facilitate future studies of cross-lingual models, which have been drawing much of research interest recently.

Table 2 gives an overview of the datasets, including their domains and training/development/test splits. The details are presented in the next subsections.

Dataset Domain Train Dev Test
CLS-FR Books Product reviews 2 000 - 2 000
DVD 1 999 - 2 000
Music 1 998 - 2 000
PAWS-X-FR General domain 49 401 1 992 1 985
XNLI-FR Diverse genres 392 702 2 490 5 010
French Treebank Daily newspaper 14 759 1 235 2 541
FrenchSemEval Diverse genres 55 206 - 3 199
Noun Sense Disambiguation Diverse genres 818 262 - 1 445
Table 2: Descriptions of the datasets included in our FLUE benchmark.

4.1 Text Classification

Cls

The Cross-Lingual Sentiment dataset CLS [prettenhofer2010cross] dataset consists of Amazon reviews for three product categories: books, DVD, and music in four languages: English, French, German, and Japanese. The English dataset is obtained from the work of blitzer2006domain. Each sample contains a review text and the associated rating from 1 to 5 stars. Following blitzer2006domain and prettenhofer2010cross, ratings with 3 stars are removed. Positive reviews have ratings greater than 3 and negative reviews are those rated less than 3. There is one train and test set for each product category. The train and test sets are balanced, including around 1 000 positive and 1 000 negative reviews for a total of 2 000 reviews in each dataset. We take the French portion to create the binary text classification task in FLUE and report the accuracy on the test set.

4.2 Paraphrasing

Paws-X

The Cross-lingual Adversarial Dataset for Paraphrase Identification PAWS-X [yang2019pawsx] is the extension of the Paraphrase Adversaries from Word Scrambling PAWS [zhang2019paws] for English to six other languages: French, Spanish, German, Chinese, Japanese and Korean. PAWS composes English paraphrase identification pairs from Wikipedia and Quora in which two sentences in a pair have high lexical overlap ratio, generated by LM-based word scrambling and back translation followed by human judgement. The paraphrasing task is to identify whether the sentences in these pairs are semantically equivalent or not. Similar to previous approaches to create multilingual corpora, yang2019pawsx used machine translation to create the training set for each target language in PAWS-X from the English training set in PAWS. The development and test sets for each language are translated by human translators. We take the related datasets for French to perform the paraphrasing task and report the accuracy on the test set.

4.3 Natural Language Inference

Xnli

The Cross-lingual NLI Corpus (XNLI) corpus [conneau2018xnli] extends the development and test sets of the Multi-Genre Natural Language Inference corpus [williams2018broad, MultiNLI] to 15 languages. The development and test sets for each language consist of 7 500 human-annotated examples, making up a total of 112 500 sentence pairs annotated with the labels entailment, contradiction, or neutral. Each sentence pair includes a premise () and a hypothesis (). The Natural Language Inference (NLI) task, also known as recognizing textual entailment (RTE), is to determine whether entails, contradicts or neither entails nor contradicts . We take the French part of the XNLI corpus to form the development and test set for the NLI task in FLUE. The train set is obtained from the machine translated version to French provided in XNLI. Following conneau2018xnli, we report the test accuracy.

4.4 Constituency Parsing and Part-of-Speech Tagging

Constituency parsing consists in assigning a constituency tree to a sentence in natural language. We perform constituency parsing on the French Treebank [Abeille2003], a collection of sentences extracted from French daily newspaper Le Monde, and manually annotated with constituency syntactic trees and part-of-speech tags. Specifically, we use the version of the corpus instantiated for the SPMRL 2013 shared task and described by seddah-etal-2013-overview. This version is provided with a standard split representing 14 759 sentences for the training corpus, and respectively 1 235 and 2 541 sentences for the development and evaluation sets.

4.5 Word Sense Disambiguation Tasks

Word Sense Disambiguation (WSD) is a classification task which aims to predict the sense of words in a given context according to a specific sense inventory. We used two French WSD tasks: the FrenchSemEval task [segonne2019using], which targets verbs only, and a modified version of the French part of the Multilingual WSD task of SemEval 2013 [Navigli2013], which targets nouns.

Verb Sense Disambiguation

We made experiments of sense disambiguation focused on French verbs using FrenchSemEval [segonne2019using, FSE], an evaluation dataset in which verb occurrences were manually sense annotated with the sense inventory of Wiktionary, a collaboratively edited open-source dictionary. FSE includes both the evaluation data and the sense inventory. The evaluation data consists of 3 199 manual annotations among a selection of 66 verbs which makes roughly 50 sense annotated occurrences per verb. The sense inventory provided in FSE is a Wiktionary dump (04-20-2018) openly available via Dbnary [serasset2012dbnary]. For a given sense of a target key, the sense inventory offers a definition along with one or more examples. For this task, we considered the examples of the sense inventory as training examples and tested our model on the evaluation dataset.

Noun Sense Disambiguation

We propose a new challenging task for the WSD of French, based on the French part of the Multilingual WSD task of SemEval 2013 [Navigli2013], which targets nouns only. We adapted the task to use the WordNet 3.0 sense inventory [miller1995wordnet] instead of BabelNet [navigli2010babelnet], by converting the sense keys to WordNet 3.0 if a mapping exists in BabelNet, and removing them otherwise.

The result of the conversion process is an evaluation corpus composed of 306 sentences and 1 445 French nouns annotated with WordNet sense keys, and manually verified.

For the training data, we followed the method proposed by hadjsalahthese2018, and translated the SemCor [Miller1993] and the WordNet Gloss Corpus171717The set of WordNet glosses semi-automatically sense annotated which is released as part of WordNet since version 3.0. into French, using the best English-French Machine Translation system of the fairseq toolkit181818https://github.com/pytorch/fairseq [ott2019fairseq]. Finally, we aligned the WordNet sense annotation from the source English words to the the translated French words, using the alignment provided by the MT system.

We rely on WordNet sense keys instead of the original BabelNet annotations for the following two reasons. First, WordNet is a resource that is entirely manually verified, and widely used in WSD research [Navigli2009WSD]. Secondly, there is already a large quantity of sense annotated data based on the sense inventory of WordNet [vialhal01718237] that we can use for the training of our system.

We publicly release191919https://zenodo.org/record/3549806 both our training data and the evaluation data in the UFSAC format [vialhal01718237].

5 Experiments and Results

In this section, we present FlauBERT fine-tuning results on the FLUE benchmark. We compare the performance of FlauBERT with Multilingual BERT [devlin2019bert, mBERT] and CamemBERT [martin2019camembert] on all tasks. In addition, for each task we also include the best non-BERT model for comparison.

5.1 Text Classification

Model description

We followed the standard fine-tuning process of BERT [devlin2019bert]. Since our task is single-sentence text classification, the input is a degenerate text- pair. The classification head is composed of the following layers, in order: dropout, linear, activation, dropout, and linear. The output dimensions of the linear layers are respectively equal to the hidden size of the Transformer and the number of classes (which is in this case as the task is binary classification). The dropout rate was set to .

We trained for 30 epochs using a batch size of 8 while performing a grid search over 4 different learning rates: , , , and . A random split of 20% of the training data was used as validation set, and the best performing model on this set was then chosen for evaluation on the test set.

Model Books DVD Music
MultiFiT 91.25 89.55 93.40
mBERT 86.15 86.90 86.65
CamemBERT 93.40 92.70 94.15
93.40 92.50 94.30
Results reported in [eisenschlos2019multifit].
Table 3: Accuracy on the CLS dataset for French.

Results

Table 3 presents the final accuracy on the test set for each model. The results highlight the importance of a monolingual French model for text classification: both CamemBERT and outperform mBERT by a large margin. CamemBERT and perform equally well on this task.

5.2 Paraphrasing

Model description

The setup for this task is almost identical to the previous one, except that: (1) the input sequence is now a pair of sentences A,B; and (2) the hyper-parameter search is performed on the development data set (i.e. no validation split is needed).

Results

The final accuracy for each model is reported in Table 4. One can observe that our monolingual French model performs only slightly better than a multilingual model (mBERT), which could be attributed to the characteristics of the PAWS-X dataset. Containing samples with high lexical overlap ratio, this dataset has been proved to be an effective measure of model sensitivity to word order and syntactic structure [yang2019pawsx]. A multilingual model such as mBERT, therefore, could capture these features as well as a monolingual model.

Model Accuracy
ESIM [chen2017enhanced] 66.2
mBERT 89.3
CamemBERT 89.8
89.9
Results reported in [yang2019pawsx].
Table 4: Results on the French PAWS-X dataset.

5.3 Natural Language Inference

Model description

As this task was also considered in [martin2019camembert, CamemBERT ], for a fair comparison, here we replicate the same experimental setup. Similar to paraphrasing, the model input of this task is also a pair of sentences. The classification head, however, consists of only one dropout layer followed by one linear layer.

Results

We report the final accuracy for each model in Table 5. The results confirm the superiority of the French models compared to multilingual models (mBERT) for this task. performs moderately better than CamemBERT. However, neither of them could exceed the performance of XLM-R.

Model Accuracy
XLM-R 85.2
mBERT 76.9
CamemBERT 81.2
81.3
Results reported in [conneau2019unsupervised].
Results reported in [martin2019camembert].
Table 5: Results on the French XNLI dataset.

5.4 Constituency Parsing and POS Tagging

Model description

We use the parser described by kitaev-klein-2018-constituency and kitaev-etal-2019-multilingual. It is an openly available202020https://github.com/nikitakit/self-attentive-parser chart parser based on a self-attentive encoder. We compare (i) a model without any pre-trained parameters, (ii) a model that additionally uses and fine-tunes fastText212121https://fasttext.cc/

pre-trained embeddings, (iii-v) models based on large-scale pre-trained language models: mBERT, CamemBERT, and FlauBERT. We use the default hyperparameters from kitaev-klein-2018-constituency for the first two settings and the hyperparameters from kitaev-etal-2019-multilingual when using pre-trained language models. We jointly perform part-of-speech (POS) tagging based on the same input as the parser, in a multitask setting. For each setting we perform training three times with different random seeds and select the best model according to development F-score.

For final evaluation, we use the evaluation tool provided by the SPMRL shared task organizers222222http://pauillac.inria.fr/~seddah/evalb_spmrl2013.tar.gz and report labelled F-score, the standard metric for constituency parsing evaluation, as well as POS tagging accuracy.

Model Dev Test
F POS F POS
Best published [kitaev-etal-2019-multilingual] 87.42
No pre-training 84.31 97.6 83.85 97.5
fastText pre-trained embeddings 84.09 97.6 83.64 97.7
mBERT 87.25 98.1 87.52 98.1
CamemBERT [martin2019camembert] 88.53 98.1 88.39 98.2
88.95 98.2 89.05 98.1
Ensemble: +CamemBERT 89.32 89.28
Table 6: Constituency parsing and POS tagging results.

Results

We report constituency parsing results in Table 6. Without pre-training, we replicate the result from kitaev-klein-2018-constituency. FastText pre-trained embeddings do not bring improvement over this already strong model. When using pre-trained language models, we observe that CamemBERT, with its language-specific training improves over mBERT by 0.9 absolute F. FlauBERT outperforms CamemBERT by 0.7 absolute F on the test set and obtains the best published results on the task for a single model. Regarding POS tagging, all large-scale pre-trained language models obtain similar results (98.1-98.2), and outperform models without pre-training or with fastText embeddings (97.5-97.7).

In order to assess whether FlauBERT and CamemBERT are complementary for this task, we evaluate an ensemble of both models (last line in Table 6). The ensemble model improves by 0.4 absolute F over FlauBERT on the development set and 0.2 on the test set, obtaining the highest result for the task. This result suggests that both pre-trained language models are complementary and have their own strengths and weaknesses.

5.5 Word Sense Disambiguation

Verb Sense Disambiguation

The disambiguation was performed with the same WSD supervised method used by segonne2019using. First we compute sense vector representations from the examples found in the Wiktionary sense inventory: given a sense and its corresponding examples, we compute the vector representation of

by averaging the vector representations of its examples. Then, we tag each test instance with the sense whose representation is the closest based on cosine similarity. We used the contextual embeddings output by FlauBERT as vector representations for any given instance (from the sense inventory or the test data) of a target word. We proceeded the same way with MultilingualBERT and CamemBERT for comparison. We also compared our model with a simpler context vector representation called averaged word embeddings (AWE) which consists in representing the context of target word by averaging its surrounding words in a given window size. We experimented AWE using fastText word embeddings with a window of size 5. We report results in Table 

7. Monoligual models are significantly better than multilingual BERT. We also observe that FlauBERT has lower performance than CamemBERT and future work will investigate this difference in more details.

Model
fastText 34.90
mBERT 44.93
CamemBERT 51.09
47.40
Table 7: scores (%) on the Verb Disambiguation Task.

Noun Sense Disambiguation

We implemented a neural classifier similar to the classifier presented by vialhal02131872. This classifier forwards the output of a pre-trained language model to a stack of 6 trained Transformer encoder layers and predicts the synset of every input words through softmax. The only difference between our model and vialhal02131872 is that we chose the same hyper-parameter as for

for the and the number of attention heads of the Transformer layers (more precisely, and ).

Model Single Ensemble
Mean Std
No pre-training 45.73 1.91 50.03
fastText 44.90 1.24 49.41
mBERT 53.48 1.44 56.89
CamemBERT 51.52 0.72 55.78
50.78 1.55 54.19
Table 8: scores (%) on the Noun Disambiguation Task.

At prediction time, we take the synset ID which has the maximum value along the softmax layer (no filter on the lemma of the target is performed). We trained 4 models for every experiment, and we give the mean results, and the standard deviation of the individual models, and also the result of an ensemble of models, which averages the output of the softmax layer. Finally, we compared FlauBERT with CamemBERT, MultilingualBERT, fastText and with no input embeddings. We report results in Table 

8.

On this task and with these settings, we first observe an advantage for MultilingualBERT over both CamemBERT and

. We think that it might be due to the fact that the training corpora we used are machine translated from English to French, so the multilingual nature of mBERT makes it probably more fitted for the task. Comparing CamemBERT to

, while we see a small improvement in the former model, we think that this might be due to the difference in the capitalization process of words. Indeed, uses a lowercase vocabulary while CamemBERT does not. However, our upcoming model keeps the true capitalization of words. We will update the paper with results using this model in the final version of the paper should it be accepted.

6 Conclusion

We present and release FlauBERT, a pre-trained-language model for French. FlauBERT is trained on a multiple-source corpus. FlauBERT obtains state-of-the-art results on a number of French NLP tasks, ranging from text classification, paraphrasing to syntactic parsing. FlauBERT is competitive with Camembert – another pre-trained language model for French developped in parallel – despite being trained on almost twice as fewer text data.

In order to make the pipeline entirely reproducible, we not only release preprocessing and training scripts, together with , but also provide a general benchmark for evaluating French NLP systems (FLUE). In the final version of the paper, we will release , a larger version of FlauBERT, still in training at the time of the submission of the paper.

7 Acknowledgements

This work benefited from the ‘Grand Challenge Jean Zay’ program which granted us computing time on the new CNRS Jean Zay supercomputer.

We thank Guillaume Lample and Alexis Conneau for their active technical support on using the XLM code.

A Appendix

a.1 Details on our French text corpus

Table 9 presents the statistics of all sub-corpora in our training corpus. We give the descriptions of each sub-corpus below.

Dataset Post-processed text size Number of Tokens (Moses) Number of Sentences
CommonCrawl [Buck-commoncrawl] 43.4 GB 7.85 B 293.37 M
NewsCrawl [li2019findings] 9.2 GB 1.69 B 63.05 M
Wikipedia [wiki:xxx] 4.2 GB 750.76 M 31.00 M
Wikisource [wiki:xxx] 2.4 GB 458.85 M 27.05 M
EU Bookshop [skadicnvsbillions] 2.3 GB 389.40 M 13.18 M
MultiUN [eisele2010multiun] 2.3 GB 384.42 M 10.66 M
GIGA [TIEDEMANN12.463] 2.0 GB 353.33 M 10.65 M
PCT 1.2 GB 197.48 M 7.13 M
Project Gutenberg 1.1 GB 219.73 M 8.23 M
OpenSubtitles [lison2016opensubtitles2015] 1.1 GB 218.85 M 13.98 M
Le Monde 664 MB 122.97 M 4.79 M
DGT [TIEDEMANN12.463] 311 MB 53.31 M 1.73 M
EuroParl [koehn2005europarl] 292 MB 50.44 M 1.64 M
EnronSent [styler2011enronsent] 73 MB 13.72 M 662.31 K
NewsCommentary [li2019findings] 61 MB 13.40 M 341.29 K
Wiktionary [wiki:xxx] 52 MB 9.68 M 474.08 K
Global Voices [TIEDEMANN12.463] 44 MB 7.88 M 297.38 K
Wikinews [wiki:xxx] 21 MB 3.93 M 174.88 K
TED Talks [TIEDEMANN12.463] 15 MB 2.92 M 129.31 K
Wikiversity [wiki:xxx] 10 MB 1.70 M 64.60 K
Wikibooks [wiki:xxx] 9 MB 1.67 M 65.19 K
Wikiquote [wiki:xxx] 5 MB 866.22 K 42.27 K
Wikivoyage [wiki:xxx] 3 MB 500.64 K 23.36 K
EUconst [TIEDEMANN12.463] 889 KB 148.47 K 4.70 K
Total 71 GB 12.79 B 488.78 M
Table 9: Statistics of sub-corpora after cleaning and pre-processing an initial corpus of 270 GB, ranked in the decreasing order of post-processed text size.

Datasets from WMT19 shared tasks

We used four corpora provided in the WMT19 shared task232323http://www.statmt.org/wmt19/translation-task.html [li2019findings].

  • Common Crawl includes text crawled from billions of pages in the internet.

  • News Crawl contains crawled news collected from 2007 to 2018.

  • EuroParl composes text extracted from the proceedings of the European Parliament.

  • News Commentary consists of text from news-commentary crawl.

Datasets from OPUS

OPUS242424http://opus.nlpl.eu is a growing resource of freely accessible monolingual and parallel corpora [TIEDEMANN12.463]. We collected the following French monolingual datasets from OPUS.

  • OpenSubtitles comprises translated movies and TV subtitles.

  • EU Bookshop includes publications from the European institutions.

  • MultiUN composes documents from the United Nations.

  • GIGA consists of newswire text and is made available in WMT10 shared task252525https://www.statmt.org/wmt10/.

  • DGT contains translation memories provided by the Joint Research Center (JRC).

  • Global Voices encompasses news stories from the website Global Voices.

  • TED Talks includes subtitles from TED talks262626https://www.ted.com videos.

  • Euconst consists of text from the European constitution.

Wikimedia database

includes Wikipedia, Wiktionary, Wikiversity, etc. The content is built collaboratively by volunteers around the world.272727https://dumps.wikimedia.org/other/cirrussearch/current/

  • Wikipedia is a free online encyclopedia including high-quality text covering a wide range of topics.

  • Wikisource includes source texts in the public domain.

  • Wikinews contains free-content news.

  • Wiktionary is an open-source dictionary of words, phrases etc.

  • Wikiversity composes learning resources and learning projects or research.

  • Wikibooks includes open-content books.

  • Wikiquote consists of sourced quotations from notable people and creative works.

  • Wikivoyage includes information about travelling.

Pct

consists of patent documents collected and maintained internally by GETALP282828http://lig-getalp.imag.fr/en/home/ team.

Project Gutenberg

contains free eBooks of diffrent genres which are mostly the world’s older classic works of literature for which copyright has expired.

Le Monde

consists of articles from Le Monde 292929https://www.lemonde.fr collected from 1987 to 2003 collected and maintained internally by GETALP team.

EnronSent

The EnronSent [styler2011enronsent] corpus is a part of the Enron Email Dataset, a massive dataset containing 500K messages from senior management executives at the Enron Corporation.

Bibliographical References

References