Unsupervised Language Model Pre-training for French
Language models have become a key step to achieve state-of-the-art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized word representations such as OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2019), or XLNet (Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to complex NLP tasks (natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks are shared with the research community for further reproducible experiments in French NLP.READ FULL TEXT VIEW PDF
In this paper, we study how the intrinsic nature of pre-training data
ELMo embeddings (Peters et. al, 2018) had a huge impact on the NLP commu...
Recent progress in pre-trained neural language models has significantly
Contextualized representation models such as ELMo (Peters et al., 2018a)...
Since the seminal work of Mikolov et al. (2013a) and Bojanowski et al.
In the last few years, pre-trained neural architectures have provided
This paper introduces the Human Evaluation Datasheet, a template for
Unsupervised Language Model Pre-training for French
A recent game-changing contribution in Natural Language Processing (NLP) was the introduction of deep unsupervised language representations pre-trained using only plain text corpora: GPT [radford2018improving], BERT [devlin2019bert], XLNet [yang2019xlnet], XLM [lample2019cross], RoBERTa [liu2019roberta]. Previous word embedding pre-training approaches, such as word2vec [Mikolov:2013:DRW:2999792.2999959] or GloVe [Pennington14glove:global]
, learn a single vector for each wordform. By contrast, these new models are calledcontextual: each token instance has a vector representation that depends on its right and left context. Initially based on RNN architectures [peters2018deep, ELMo], these models quickly converged towards the use of Transformer architectures [vaswani2017attention]. One very popular work [devlin2019bert, BERT]
also proposed to open-source the trained models so that anyone could easily build an NLP system using the pre-trained model, saving time, energy and resources. As a consequence, unsupervised language model pre-training became ade facto standard to achieve state-of-the-art results in many NLP tasks. However, this was mostly demonstrated for English even if a multilingual version of BERT called mBERT was also released by Google [devlin2019bert], considering more than a hundred languages in a single model.
In this paper, we describe our methodology to build and open source FlauBERT, a French BERT111Three weeks before this submission, we learned of a similar project that resulted in a publication on arXiv [martin2019camembert]. However, we believe that these two works on French LMs are complementary since the NLP tasks we addressed are different, as are the training corpora and preprocessing pipelines. We also point out that our models were trained using the CNRS (French National Centre for Scientific Research) public research computational infrastructure and did not receive any assistance from a private stakeholder. model that significantly outperforms mBERT in several downstream NLP tasks. FlauBERT relies on freely available datasets and is made publicly available in different versions.222https://github.com/getalp/FlaubertDeep Learning For Physical Processes: Incorporating Prior Scientific Knowledge.and is named FLUE (French Language Understanding Evaluation).
Self-supervised333 Self-supervised learning
Self-supervised learningis a special case of unsupervised learning where unlabeled data is used as a supervision signal. pre-training on unlabeled text data was first proposed in the task of neural language modeling [bengio2003neural, collobert2008unified]
, where it was shown that a neural network trained to predict next word from prior words can learn useful embedding representations, calledword embeddings (each word is represented by a fixed vector). These representations were shown to play an important role in NLP, yielding state-of-the-art performance on multiple tasks [collobert2011natural], especially after the introduction of word2vec [Mikolov:2013:DRW:2999792.2999959] and GloVe [Pennington14glove:global], efficient and effective algorithms for learning word embeddings.
A major limitation of word embeddings is that a word can only have a single representation, even if it can have multiple meanings (e.g. depending on the context). Therefore, recent works have introduced a paradigm shift from context-free word embeddings to contextual embeddings: the representation for a word is a function of the entire input sequence, which allows encoding complex, high-level syntactic and semantic characteristics of words.
The concurrent release of two papers opened this line of research:444It should be noted that learning contextual embeddings had previously been proposed by [mccann2017learned], but in a supervised fashion as they used annotated machine translation data. ULMFiT [howard2018universal] relies on word embeddings derived from a “simple” LSTM [hochreiter1997long] language model, while in ELMo [peters2018deep]
a bidirectional LSTM encodes the sentence to build the final embedding for each token from the concatenation of the left-to-right and right-to-left representations. Another fundamental difference lies in how each model can be tuned to different downstream tasks: while ELMo delivers different word vectors that can be interpolated, ULMFiT enables robust fine-tuning of the whole networkw.r.t. the downstream tasks. The ability of fine-tuning was shown to significantly boost the performance, and thus this approach has been further developed in the recent works such as MultiFiT [eisenschlos2019multifit] or most prominently Transformer-based [vaswani2017attention] architectures: GPT [radford2018improving], BERT [devlin2019bert], XLNet [yang2019xlnet], XLM [lample2019cross], RoBERTa [liu2019roberta], ALBERT [lan2019albert], T5 [raffel2019exploring]. These methods have one after the other established new state-of-the-art results on various NLP benchmarks (such as GLUE [wang-etal-2018-glue] or SQuAD [rajpurkar2018know]), surpassing previous methods by a large margin.
Given the impact of pre-trained LMs on NLP downstream tasks in English, several works have recently released pre-trained models for other languages. For instance, ELMo exists for Portuguese, Japanese, German and Basque,555https://allennlp.org/elmo while BERT was specifically trained for simplified and traditional Chinese8 and German.666https://deepset.ai/german-bert A Portuguese version of MultiFiT is also available.777https://github.com/piegu/language-models For French, this is very recent with the releases of CamemBERT [martin2019camembert] and pre-trained LM for French using the MultiFiT configuration.7
Another trend considers one model estimated for several languages with a shared vocabulary. The release of multilingual BERT for 104 languages pioneered this approach.888https://github.com/google-research/bert A recent extension of this work leverages parallel data to build a cross-lingual pre-trained version of LASER [artetxe2019massively] for 93 languages, XLM [lample2019cross] and XLM-R [conneau2019unsupervised] for 100 languages.
The existence of a multi-task evaluation benchmark such as GLUE [wang-etal-2018-glue] for English is highly beneficial to facilitate research in the language of interest. The GLUE benchmark has become a prominent framework to evaluate the performance of NLP models in English. The recent contributions based on pre-trained language models have led to remarkable performance across a wide range of NLU tasks. The authors of GLUE have therefore introduced SuperGLUE [wang2019superglue]: a new benchmark built on the principles of GLUE, including more challenging and diverse set of tasks. A Chinese version of GLUE999https://github.com/chineseGLUE/chineseGLUE is also developed to evaluate model performance in Chinese NLP tasks. As of now, we have not learned of any such benchmark for French.
In this section, we describe the training corpus, the text preprocessing pipeline, the model architecture and training configurations to build FlauBERT.
Our French text corpus consists of 24 sub-corpora gathered from different sources, covering diverse topics and writing styles, ranging from formal and well-written text (e.g. Wikipedia and books)101010http://www.gutenberg.org to random text crawled from the Internet (e.g. Common Crawl).111111http://data.statmt.org/ngrams/deduped2017 The data were collected from three main sources: (1) monolingual data for French provided in WMT19 shared tasks [li2019findings, 4 sub-corpora]; (2) French text corpora offered in the OPUS collection [TIEDEMANN12.463, 8 sub-corpora]; and (3) datasets available in the Wikimedia projects [wiki:xxx, 8 sub-corpora].
We used the WikiExtractor tool121212https://github.com/attardi/wikiextractor to extract the text from Wikipedia. For the other sub-corpora, we either used our own tool to extract the text or download them directly from their websites. The total size of the uncompressed text before preprocessing is 270 GB. More details can be found in Appendix A.1.
For all sub-corpora, we filtered out very short sentences as well as repetitive and non-meaningful content such as fax/telephone numbers, email addresses, etc. For Common Crawl, which is our largest sub-corpus with 215 GB of raw text, we applied aggressive cleaning to reduce its size to 43.4 GB. All the data were Unicode-normalized in a consistent way before being tokenized using Moses [koehn2007moses] tokenizer. The resulting training corpus is 71 GB in size.
Our code for downloading and preprocessing data will be made publicly available.
|Training data||13 GB||160 GB||138 GB||71 GB|
|Pre-training objectives||NSP and MLM||MLM||MLM||MLM|
|Total parameters||110 M||125 M||110 M||137 M|
|Tokenizer||WordPiece 30K||BPE 50K||SentencePiece 32K||BPE 50K|
|Masking strategy||Static + Sub-word masking||Dynamic + Sub-word masking||Dynamic + Whole-word masking||Dynamic + Sub-word masking|
|, : 282 GB, 270 GB before filtering/cleaning.|
FlauBERT has the same model architecture as BERT [devlin2019bert], which consists of a multi-layer bidirectional Transformer [vaswani2017attention]. Following devlin2019bert, we propose two model sizes:
:131313This model is still in training at the time of submission. ,
where and respectively denote the number of Transformer blocks, the hidden size, and the number of self-attention heads. As Transformer has become quite standard, we refer to vaswani2017attention for further details.
Pre-training of the original BERT [devlin2019bert] consists of two supervised tasks: (1) a masked language model (MLM) that learns to predict randomly masked tokens; and (2) a next sentence prediction (NSP) task in which the model learns to predict whether B is the actual next sentence that follows A, given a pair of input sentences A,B.
devlin2019bert observed that removing NSP significantly hurts performance on some downstream tasks. However, the opposite was shown in later studies, including [XLNet]yang2019xlnet, [XLM]lample2019cross, and [RoBERTa]liu2019roberta.141414liu2019roberta hypothesized that the original BERT implementation may only have removed the loss term while still retaining a bad input format, resulting in performance degradation. Therefore, we only employed the MLM objective in FlauBERT.
To optimize this objective function, we followed liu2019roberta and used the Adam optimizer [kingma2014adam] with parameters: warmup steps of 24k, peak learning rate of , , , and weight decay of 0.01.
A vocabulary of 50K sub-word units is built using the Byte Pair Encoding (BPE) algorithm [sennrich2016neural]. The only difference between our work and RoBERTa is that the training texts are preprocessed and tokenized using a basic tokenizer for French [koehn2007moses, Moses], as in XLM [lample2019cross], before the application of BPE. We use fastBPE, a very efficient implementation151515https://github.com/glample/fastBPE to extract the BPE units and encode the corpora.
is trained on 32 GPUs Nvidia V100 SXM2 32 GB in 410 hours. Each GPU consumes a batch size of 16 sequences and the gradient is accumulated 16 times, making up a total batch size of 8192 sequences. The perplexity on the validation set is used as the stopping criterion, with a patience of 20 consecutive epochs. However, due to time constraints, we stopped the model before it reaches this stopping criterion and used the pre-trained weights of thetrained for 224k steps for evaluation tasks.161616We will update the results with the fully trained model in the camera ready if paper is accepted.
Finally, we summarize the differences between FlauBERT and BERT, RoBERTa, CamemBERT in Table 1.
In this section, we compile a set of existing French language tasks to form an evaluation benchmark for French NLP that we called FLUE (French Language Understanding Evaluation). We select the datasets from different domains, level of difficulty, degree of formality, and amount of training samples. Three out of six tasks (Text Classification, Paraphrase, Natural Language Inference) are from cross-lingual datasets since we also aim to provide results from a monolingual pre-trained model to facilitate future studies of cross-lingual models, which have been drawing much of research interest recently.
Table 2 gives an overview of the datasets, including their domains and training/development/test splits. The details are presented in the next subsections.
|CLS-FR||Books||Product reviews||2 000||-||2 000|
|DVD||1 999||-||2 000|
|Music||1 998||-||2 000|
|PAWS-X-FR||General domain||49 401||1 992||1 985|
|XNLI-FR||Diverse genres||392 702||2 490||5 010|
|French Treebank||Daily newspaper||14 759||1 235||2 541|
|FrenchSemEval||Diverse genres||55 206||-||3 199|
|Noun Sense Disambiguation||Diverse genres||818 262||-||1 445|
The Cross-Lingual Sentiment dataset CLS [prettenhofer2010cross] dataset consists of Amazon reviews for three product categories: books, DVD, and music in four languages: English, French, German, and Japanese. The English dataset is obtained from the work of blitzer2006domain. Each sample contains a review text and the associated rating from 1 to 5 stars. Following blitzer2006domain and prettenhofer2010cross, ratings with 3 stars are removed. Positive reviews have ratings greater than 3 and negative reviews are those rated less than 3. There is one train and test set for each product category. The train and test sets are balanced, including around 1 000 positive and 1 000 negative reviews for a total of 2 000 reviews in each dataset. We take the French portion to create the binary text classification task in FLUE and report the accuracy on the test set.
The Cross-lingual Adversarial Dataset for Paraphrase Identification PAWS-X [yang2019pawsx] is the extension of the Paraphrase Adversaries from Word Scrambling PAWS [zhang2019paws] for English to six other languages: French, Spanish, German, Chinese, Japanese and Korean. PAWS composes English paraphrase identification pairs from Wikipedia and Quora in which two sentences in a pair have high lexical overlap ratio, generated by LM-based word scrambling and back translation followed by human judgement. The paraphrasing task is to identify whether the sentences in these pairs are semantically equivalent or not. Similar to previous approaches to create multilingual corpora, yang2019pawsx used machine translation to create the training set for each target language in PAWS-X from the English training set in PAWS. The development and test sets for each language are translated by human translators. We take the related datasets for French to perform the paraphrasing task and report the accuracy on the test set.
The Cross-lingual NLI Corpus (XNLI) corpus [conneau2018xnli] extends the development and test sets of the Multi-Genre Natural Language Inference corpus [williams2018broad, MultiNLI] to 15 languages. The development and test sets for each language consist of 7 500 human-annotated examples, making up a total of 112 500 sentence pairs annotated with the labels entailment, contradiction, or neutral. Each sentence pair includes a premise () and a hypothesis (). The Natural Language Inference (NLI) task, also known as recognizing textual entailment (RTE), is to determine whether entails, contradicts or neither entails nor contradicts . We take the French part of the XNLI corpus to form the development and test set for the NLI task in FLUE. The train set is obtained from the machine translated version to French provided in XNLI. Following conneau2018xnli, we report the test accuracy.
Constituency parsing consists in assigning a constituency tree to a sentence in natural language. We perform constituency parsing on the French Treebank [Abeille2003], a collection of sentences extracted from French daily newspaper Le Monde, and manually annotated with constituency syntactic trees and part-of-speech tags. Specifically, we use the version of the corpus instantiated for the SPMRL 2013 shared task and described by seddah-etal-2013-overview. This version is provided with a standard split representing 14 759 sentences for the training corpus, and respectively 1 235 and 2 541 sentences for the development and evaluation sets.
Word Sense Disambiguation (WSD) is a classification task which aims to predict the sense of words in a given context according to a specific sense inventory. We used two French WSD tasks: the FrenchSemEval task [segonne2019using], which targets verbs only, and a modified version of the French part of the Multilingual WSD task of SemEval 2013 [Navigli2013], which targets nouns.
We made experiments of sense disambiguation focused on French verbs using FrenchSemEval [segonne2019using, FSE], an evaluation dataset in which verb occurrences were manually sense annotated with the sense inventory of Wiktionary, a collaboratively edited open-source dictionary. FSE includes both the evaluation data and the sense inventory. The evaluation data consists of 3 199 manual annotations among a selection of 66 verbs which makes roughly 50 sense annotated occurrences per verb. The sense inventory provided in FSE is a Wiktionary dump (04-20-2018) openly available via Dbnary [serasset2012dbnary]. For a given sense of a target key, the sense inventory offers a definition along with one or more examples. For this task, we considered the examples of the sense inventory as training examples and tested our model on the evaluation dataset.
We propose a new challenging task for the WSD of French, based on the French part of the Multilingual WSD task of SemEval 2013 [Navigli2013], which targets nouns only. We adapted the task to use the WordNet 3.0 sense inventory [miller1995wordnet] instead of BabelNet [navigli2010babelnet], by converting the sense keys to WordNet 3.0 if a mapping exists in BabelNet, and removing them otherwise.
The result of the conversion process is an evaluation corpus composed of 306 sentences and 1 445 French nouns annotated with WordNet sense keys, and manually verified.
For the training data, we followed the method proposed by hadjsalahthese2018, and translated the SemCor [Miller1993] and the WordNet Gloss Corpus171717The set of WordNet glosses semi-automatically sense annotated which is released as part of WordNet since version 3.0. into French, using the best English-French Machine Translation system of the fairseq toolkit181818https://github.com/pytorch/fairseq [ott2019fairseq]. Finally, we aligned the WordNet sense annotation from the source English words to the the translated French words, using the alignment provided by the MT system.
We rely on WordNet sense keys instead of the original BabelNet annotations for the following two reasons. First, WordNet is a resource that is entirely manually verified, and widely used in WSD research [Navigli2009WSD]. Secondly, there is already a large quantity of sense annotated data based on the sense inventory of WordNet [vialhal01718237] that we can use for the training of our system.
We publicly release191919https://zenodo.org/record/3549806 both our training data and the evaluation data in the UFSAC format [vialhal01718237].
In this section, we present FlauBERT fine-tuning results on the FLUE benchmark. We compare the performance of FlauBERT with Multilingual BERT [devlin2019bert, mBERT] and CamemBERT [martin2019camembert] on all tasks. In addition, for each task we also include the best non-BERT model for comparison.
We followed the standard fine-tuning process of BERT [devlin2019bert]. Since our task is single-sentence text classification, the input is a degenerate text- pair. The classification head is composed of the following layers, in order: dropout, linear, activation, dropout, and linear. The output dimensions of the linear layers are respectively equal to the hidden size of the Transformer and the number of classes (which is in this case as the task is binary classification). The dropout rate was set to .
We trained for 30 epochs using a batch size of 8 while performing a grid search over 4 different learning rates: , , , and . A random split of 20% of the training data was used as validation set, and the best performing model on this set was then chosen for evaluation on the test set.
|Results reported in [eisenschlos2019multifit].|
Table 3 presents the final accuracy on the test set for each model. The results highlight the importance of a monolingual French model for text classification: both CamemBERT and outperform mBERT by a large margin. CamemBERT and perform equally well on this task.
The setup for this task is almost identical to the previous one, except that: (1) the input sequence is now a pair of sentences A,B; and (2) the hyper-parameter search is performed on the development data set (i.e. no validation split is needed).
The final accuracy for each model is reported in Table 4. One can observe that our monolingual French model performs only slightly better than a multilingual model (mBERT), which could be attributed to the characteristics of the PAWS-X dataset. Containing samples with high lexical overlap ratio, this dataset has been proved to be an effective measure of model sensitivity to word order and syntactic structure [yang2019pawsx]. A multilingual model such as mBERT, therefore, could capture these features as well as a monolingual model.
|Results reported in [yang2019pawsx].|
As this task was also considered in [martin2019camembert, CamemBERT ], for a fair comparison, here we replicate the same experimental setup. Similar to paraphrasing, the model input of this task is also a pair of sentences. The classification head, however, consists of only one dropout layer followed by one linear layer.
We report the final accuracy for each model in Table 5. The results confirm the superiority of the French models compared to multilingual models (mBERT) for this task. performs moderately better than CamemBERT. However, neither of them could exceed the performance of XLM-R.
|Results reported in [conneau2019unsupervised].|
|Results reported in [martin2019camembert].|
We use the parser described by kitaev-klein-2018-constituency and kitaev-etal-2019-multilingual. It is an openly available202020https://github.com/nikitakit/self-attentive-parser chart parser based on a self-attentive encoder. We compare (i) a model without any pre-trained parameters, (ii) a model that additionally uses and fine-tunes fastText212121https://fasttext.cc/
pre-trained embeddings, (iii-v) models based on large-scale pre-trained language models: mBERT, CamemBERT, and FlauBERT. We use the default hyperparameters from kitaev-klein-2018-constituency for the first two settings and the hyperparameters from kitaev-etal-2019-multilingual when using pre-trained language models. We jointly perform part-of-speech (POS) tagging based on the same input as the parser, in a multitask setting. For each setting we perform training three times with different random seeds and select the best model according to development F-score.
For final evaluation, we use the evaluation tool provided by the SPMRL shared task organizers222222http://pauillac.inria.fr/~seddah/evalb_spmrl2013.tar.gz and report labelled F-score, the standard metric for constituency parsing evaluation, as well as POS tagging accuracy.
|Best published [kitaev-etal-2019-multilingual]||87.42|
|fastText pre-trained embeddings||84.09||97.6||83.64||97.7|
We report constituency parsing results in Table 6. Without pre-training, we replicate the result from kitaev-klein-2018-constituency. FastText pre-trained embeddings do not bring improvement over this already strong model. When using pre-trained language models, we observe that CamemBERT, with its language-specific training improves over mBERT by 0.9 absolute F. FlauBERT outperforms CamemBERT by 0.7 absolute F on the test set and obtains the best published results on the task for a single model. Regarding POS tagging, all large-scale pre-trained language models obtain similar results (98.1-98.2), and outperform models without pre-training or with fastText embeddings (97.5-97.7).
In order to assess whether FlauBERT and CamemBERT are complementary for this task, we evaluate an ensemble of both models (last line in Table 6). The ensemble model improves by 0.4 absolute F over FlauBERT on the development set and 0.2 on the test set, obtaining the highest result for the task. This result suggests that both pre-trained language models are complementary and have their own strengths and weaknesses.
The disambiguation was performed with the same WSD supervised method used by segonne2019using. First we compute sense vector representations from the examples found in the Wiktionary sense inventory: given a sense and its corresponding examples, we compute the vector representation of
by averaging the vector representations of its examples. Then, we tag each test instance with the sense whose representation is the closest based on cosine similarity. We used the contextual embeddings output by FlauBERT as vector representations for any given instance (from the sense inventory or the test data) of a target word. We proceeded the same way with MultilingualBERT and CamemBERT for comparison. We also compared our model with a simpler context vector representation called averaged word embeddings (AWE) which consists in representing the context of target word by averaging its surrounding words in a given window size. We experimented AWE using fastText word embeddings with a window of size 5. We report results in Table7. Monoligual models are significantly better than multilingual BERT. We also observe that FlauBERT has lower performance than CamemBERT and future work will investigate this difference in more details.
We implemented a neural classifier similar to the classifier presented by vialhal02131872. This classifier forwards the output of a pre-trained language model to a stack of 6 trained Transformer encoder layers and predicts the synset of every input words through softmax. The only difference between our model and vialhal02131872 is that we chose the same hyper-parameter as forfor the and the number of attention heads of the Transformer layers (more precisely, and ).
At prediction time, we take the synset ID which has the maximum value along the softmax layer (no filter on the lemma of the target is performed). We trained 4 models for every experiment, and we give the mean results, and the standard deviation of the individual models, and also the result of an ensemble of models, which averages the output of the softmax layer. Finally, we compared FlauBERT with CamemBERT, MultilingualBERT, fastText and with no input embeddings. We report results in Table8.
On this task and with these settings, we first observe an advantage for MultilingualBERT over both CamemBERT and
. We think that it might be due to the fact that the training corpora we used are machine translated from English to French, so the multilingual nature of mBERT makes it probably more fitted for the task. Comparing CamemBERT to, while we see a small improvement in the former model, we think that this might be due to the difference in the capitalization process of words. Indeed, uses a lowercase vocabulary while CamemBERT does not. However, our upcoming model keeps the true capitalization of words. We will update the paper with results using this model in the final version of the paper should it be accepted.
We present and release FlauBERT, a pre-trained-language model for French. FlauBERT is trained on a multiple-source corpus. FlauBERT obtains state-of-the-art results on a number of French NLP tasks, ranging from text classification, paraphrasing to syntactic parsing. FlauBERT is competitive with Camembert – another pre-trained language model for French developped in parallel – despite being trained on almost twice as fewer text data.
In order to make the pipeline entirely reproducible, we not only release preprocessing and training scripts, together with , but also provide a general benchmark for evaluating French NLP systems (FLUE). In the final version of the paper, we will release , a larger version of FlauBERT, still in training at the time of the submission of the paper.
This work benefited from the ‘Grand Challenge Jean Zay’ program which granted us computing time on the new CNRS Jean Zay supercomputer.
We thank Guillaume Lample and Alexis Conneau for their active technical support on using the XLM code.
Table 9 presents the statistics of all sub-corpora in our training corpus. We give the descriptions of each sub-corpus below.
|Dataset||Post-processed text size||Number of Tokens (Moses)||Number of Sentences|
|CommonCrawl [Buck-commoncrawl]||43.4 GB||7.85 B||293.37 M|
|NewsCrawl [li2019findings]||9.2 GB||1.69 B||63.05 M|
|Wikipedia [wiki:xxx]||4.2 GB||750.76 M||31.00 M|
|Wikisource [wiki:xxx]||2.4 GB||458.85 M||27.05 M|
|EU Bookshop [skadicnvsbillions]||2.3 GB||389.40 M||13.18 M|
|MultiUN [eisele2010multiun]||2.3 GB||384.42 M||10.66 M|
|GIGA [TIEDEMANN12.463]||2.0 GB||353.33 M||10.65 M|
|PCT||1.2 GB||197.48 M||7.13 M|
|Project Gutenberg||1.1 GB||219.73 M||8.23 M|
|OpenSubtitles [lison2016opensubtitles2015]||1.1 GB||218.85 M||13.98 M|
|Le Monde||664 MB||122.97 M||4.79 M|
|DGT [TIEDEMANN12.463]||311 MB||53.31 M||1.73 M|
|EuroParl [koehn2005europarl]||292 MB||50.44 M||1.64 M|
|EnronSent [styler2011enronsent]||73 MB||13.72 M||662.31 K|
|NewsCommentary [li2019findings]||61 MB||13.40 M||341.29 K|
|Wiktionary [wiki:xxx]||52 MB||9.68 M||474.08 K|
|Global Voices [TIEDEMANN12.463]||44 MB||7.88 M||297.38 K|
|Wikinews [wiki:xxx]||21 MB||3.93 M||174.88 K|
|TED Talks [TIEDEMANN12.463]||15 MB||2.92 M||129.31 K|
|Wikiversity [wiki:xxx]||10 MB||1.70 M||64.60 K|
|Wikibooks [wiki:xxx]||9 MB||1.67 M||65.19 K|
|Wikiquote [wiki:xxx]||5 MB||866.22 K||42.27 K|
|Wikivoyage [wiki:xxx]||3 MB||500.64 K||23.36 K|
|EUconst [TIEDEMANN12.463]||889 KB||148.47 K||4.70 K|
|Total||71 GB||12.79 B||488.78 M|
We used four corpora provided in the WMT19 shared task232323http://www.statmt.org/wmt19/translation-task.html [li2019findings].
Common Crawl includes text crawled from billions of pages in the internet.
News Crawl contains crawled news collected from 2007 to 2018.
EuroParl composes text extracted from the proceedings of the European Parliament.
News Commentary consists of text from news-commentary crawl.
OPUS242424http://opus.nlpl.eu is a growing resource of freely accessible monolingual and parallel corpora [TIEDEMANN12.463]. We collected the following French monolingual datasets from OPUS.
OpenSubtitles comprises translated movies and TV subtitles.
EU Bookshop includes publications from the European institutions.
MultiUN composes documents from the United Nations.
GIGA consists of newswire text and is made available in WMT10 shared task252525https://www.statmt.org/wmt10/.
DGT contains translation memories provided by the Joint Research Center (JRC).
Global Voices encompasses news stories from the website Global Voices.
TED Talks includes subtitles from TED talks262626https://www.ted.com videos.
Euconst consists of text from the European constitution.
includes Wikipedia, Wiktionary, Wikiversity, etc. The content is built collaboratively by volunteers around the world.272727https://dumps.wikimedia.org/other/cirrussearch/current/
Wikipedia is a free online encyclopedia including high-quality text covering a wide range of topics.
Wikisource includes source texts in the public domain.
Wikinews contains free-content news.
Wiktionary is an open-source dictionary of words, phrases etc.
Wikiversity composes learning resources and learning projects or research.
Wikibooks includes open-content books.
Wikiquote consists of sourced quotations from notable people and creative works.
Wikivoyage includes information about travelling.
consists of patent documents collected and maintained internally by GETALP282828http://lig-getalp.imag.fr/en/home/ team.
contains free eBooks of diffrent genres which are mostly the world’s older classic works of literature for which copyright has expired.
consists of articles from Le Monde 292929https://www.lemonde.fr collected from 1987 to 2003 collected and maintained internally by GETALP team.
The EnronSent [styler2011enronsent] corpus is a part of the Enron Email Dataset, a massive dataset containing 500K messages from senior management executives at the Enron Corporation.