CamemBERT: a Tasty French Language Model

11/10/2019 ∙ by Louis Martin, et al. ∙ Facebook Inria 0

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. Aiming to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pretrained word representations have a long history in Natural Language Processing (NLP), from non-neural methods [16, 6, 13] to neural word embeddings [53, 57] and to contextualised representations [58, 3]. Approaches shifted more recently from using these representations as an input to task-specific architectures to replacing these architectures with large pretrained language models. These models are then fine-tuned to the task at hand with large improvements in performance over a wide range of tasks [25, 64, 50, 65].

These transfer learning methods exhibit clear advantages over more traditional task-specific approaches, probably the most important being that they can be trained in an

unsupervised manner. They nevertheless come with implementation challenges, namely the amount of data and computational resources needed for pretraining that can reach hundreds of gigabytes of uncompressed text and require hundreds of GPUs [82, 50]. The latest transformer architecture has gone uses as much as 750GB of plain text and 1024 TPU v3111An ASIC capable of 420 teraflops with 128 GB of high bandwidth memory https://cloud.google.com/tpu/ for pretraining [65]. This has limited the availability of these state-of-the-art models to the English language, at least in the monolingual setting. Even though multilingual models give remarkable results, they are often larger and their results still lag behind their monolingual counterparts [47]. This is particularly inconvenient as it hinders their practical use in NLP systems as well as the investigation of their language modeling capacity, something that remains to be investigated in the case of, for instance, morphologically rich languages.

We take advantage of the newly available multilingual corpus OSCAR [55] and train a monolingual language model for French using the RoBERTa architecture. We pretrain the model - which we dub CamemBERT- and evaluate it in four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI). CamemBERT improves the state of the art for most tasks over previous monolingual and multilingual approaches, which confirms the effectiveness of large pretrained language models for French.

We summarise our contributions as follows:

  • We train a monolingual BERT model on the French language using recent large-scale corpora.

  • We evaluate our model on four downstream tasks (POS tagging, dependency parsing, NER and natural language inference (NLI)), achieving state-of-the-art results in most tasks, confirming the effectiveness of large BERT-based models for French.

  • We release our model in a user-friendly format for popular open-source libraries so that it can serve as a strong baseline for future research and be useful for French NLP practitioners.

    222Model released at https://camembert-model.fr

2 Related Work

From non-contextual to contextual word embeddings

The first neural word vector representations were non-contextualised word embeddings, most notably word2vec

[53], GloVe [57] and fastText [52], which were designed to be used as input to task-specific neural architectures. Contextualised word representations such as ELMo [58] and flair [3], improved the expressivity of word embeddings by taking context into account. They improved the performance of downstream tasks when they replaced traditional word representations. This paved the way towards larger contextualised models that replaced downstream architectures in most tasks. These approaches, trained with language modeling objectives, range from LSTM-based architectures such as ULMFiT [33] to the successful transformer-based architectures such as GPT2 [64], BERT [25], RoBERTa [50] and more recently ALBERT [48] and T5 [65].

Non-contextual word embeddings for languages other than English

Since the introduction of word2vec [53], many attempts have been made to create monolingual models for a wide range of languages. For non-contextual word embeddings, the first two attempts were by [4] and [14] who created word embeddings for a large number of languages using Wikipedia. Later [28] trained fastText word embeddings for 157 languages using Common Crawl and showed that using crawled data significantly increased the performance of the embeddings relatively to those trained only on Wikipedia.

Contextualised models for languages other than English

Following the success of large pretrained language models, they were extended to the multilingual setting with multilingual BERT 333https://github.com/google-research/bert/blob/master/multilingual.md, a single multilingual model for 104 different languages trained on Wikipedia data, and later XLM [47], which greatly improved unsupervised machine translation. A few monolingual models have been released: ELMo models for Japanese, Portuguese, German and Basque444https://allennlp.org/elmo and BERT for Simplified and Traditional Chinese and German555https://deepset.ai/german-bert.

However, to the best of our knowledge, no particular effort has been made toward training models for languages other than English, at a scale similar to the latest English models (e.g. RoBERTa trained on more than 100GB of data).

3 CamemBERT

Our approach is based on RoBERTa [50], which replicates and improves the initial BERT by identifying key hyper-parameters for more robust performance.

In this section, we describe the architecture, training objective, optimisation setup and pretraining data that was used for CamemBERT.

CamemBERT differs from RoBERTa mainly with the addition of whole-word masking and the usage of SentencePiece tokenisation [42].

Architecture

Similar to RoBERTa and BERT, CamemBERT is a multi-layer bidirectional Transformer [76]. Given the widespread usage of Transformers, we do not describe them in detail here and refer the reader to [76]. CamemBERT uses the original BERT configuration: 12 layers, 768 hidden dimensions, 12 attention heads, which amounts to 110M parameters.

Pretraining objective

We train our model on the Masked Language Modeling (MLM) task. Given an input text sequence composed of tokens , we select of tokens for possible replacement. Among those selected tokens, 80% are replaced with the special mask token, 10% are left unchanged and 10% are replaced by a random token. The model is then trained to predict the initial masked tokens using cross-entropy loss.

Following RoBERTa we dynamically mask tokens instead of fixing them statically for the whole dataset during preprocessing. This improves variability and makes the model more robust when training for multiple epochs.

Since we segment the input sentence into subwords using SentencePiece, the input tokens to the models can be subwords. An upgraded version of BERT 666https://github.com/google-research/bert/blob/master/README.md and [35] have shown that masking whole words instead of individual subwords leads to improved performance. Whole-word masking (WWM) makes the training task more difficult because the model has to predict a whole word instead of predicting only part of the word given the rest. As a result, we used WWM for CamemBERT by first randomly sampling 15% of the words in the sequence and then considering all subword tokens in each of these 15% words for candidate replacement. This amounts to a proportion of selected tokens that is close to the original 15%. These tokens are then either replaced by mask tokens (80%), left unchanged (10%) or replaced by a random token.

Subsequent work has shown that the next sentence prediction task (NSP) originally used in BERT does not improve downstream task performance [47, 50], we do not use NSP as a consequence.

Optimisation

Following [50], we optimise the model using Adam [38] (, ) for 100k steps. We use large batch sizes of 8192 sequences. Each sequence contains at most 512 tokens. We enforce each sequence to only contain complete sentences. Additionally, we used the DOC-SENTENCES scenario from [50], consisting of not mixing multiple documents in the same sequence, which showed slightly better results.

Segmentation into subword units

We segment the input text into subword units using SentencePiece [42]. SentencePiece is an extension of Byte-Pair encoding (BPE) [72] and WordPiece [43] that does not require pre-tokenisation (at the word or token level), thus removing the need for language-specific tokenisers. We use a vocabulary size of 32k subword tokens. These are learned on sentences sampled from the pretraining dataset. We do not use subword regularisation (i.e. sampling from multiple possible segmentations) in our implementation for simplicity.

Pretraining data

Pretrained language models can be significantly improved by using more data [50, 65]. Therefore we used French text extracted from Common Crawl777https://commoncrawl.org/about/, in particular, we use OSCAR [55]

a pre-classified and pre-filtered version of the November 2018 Common Craw snapshot.

OSCAR is a set of monolingual corpora extracted from Common Crawl, specifically from the plain text WET format distributed by Common Crawl, which removes all HTML tags and converts all text encodings to UTF-8. OSCAR follows the same approach as [28] by using a language classification model based on the fastText linear classifier [36, 29] pretrained on Wikipedia, Tatoeba and SETimes, which supports 176 different languages.

OSCAR performs a deduplication step after language classification and without introducing a specialised filtering scheme, other than only keeping paragraphs containing 100 or more UTF-8 encoded characters, making OSCAR quite close to the original Crawled data.

We use the unshuffled version of the French OSCAR corpus, which amounts to 138GB of uncompressed text and 32.7B SentencePiece tokens.

4 Evaluation

4.1 Part-of-speech tagging and dependency parsing

We fist evaluate CamemBERT on the two downstream tasks of part-of-speech (POS) tagging and dependency parsing. POS tagging is a low-level syntactic task, which consists in assigning to each word its corresponding grammatical category. Dependency parsing consists in predicting the labeled syntactic tree capturing the syntactic relations between words.

We run our experiments using the Universal Dependencies (UD) paradigm and its corresponding UD POS tag set [59] and UD treebank collection version 2.2 [75], which was used for the CoNLL 2018 shared task. We perform our work on the four freely available French UD treebanks in UD v2.2: GSD, Sequoia, Spoken, and ParTUT.

GSD [51] is the second-largest treebank available for French after the FTB (described in subsection 4.2), it contains data from blogs, news articles, reviews, and Wikipedia. The Sequoia treebank888https://deep-sequoia.inria.fr [22, 21] comprises more than 3000 sentences, from the French Europarl, the regional newspaper L’Est Républicain, the French Wikipedia and documents from the European Medicines Agency. Spoken is a corpus converted automatically from the Rhapsodie treebank999https://www.projet-rhapsodie.fr [44, 10] with manual corrections. It consists of 57 sound samples of spoken French with orthographic transcription and phonetic transcription aligned with sound (word boundaries, syllables, and phonemes), syntactic and prosodic annotations. Finally, ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts, and Wikipedia articles, among others; ParTUT data is derived from the already-existing parallel treebank Par(allel)TUT [70] . Table 1 contains a summary comparing the sizes of the treebanks101010https://universaldependencies.org.

Treebank N. Tokens N. Words N. Sentences
GSD 389,363 400,387 16,342
Sequoia 68,615 70,567 3,099
Spoken 34,972 34,972 2,786
ParTUT 27,658 28,594 1,020
Table 1: Sizes in Number of tokens, words and phrases of the 4 treebanks used in the evaluations of POS-tagging and dependency parsing.

We evaluate the performance of our models using the standard UPOS accuracy for POS tagging, and Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS) for dependency parsing. We assume gold tokenisation and gold word segmentation as provided in the UD treebanks.

Baselines

To demonstrate the value of building a dedicated version of BERT for French, we first compare CamemBERT to the multilingual cased version of BERT (designated as mBERT). We then compare our models to UDify [40]. UDify is a multitask and multilingual model based on mBERT that is near state-of-the-art on all UD languages including French for both POS tagging and dependency parsing.

It is relevant to compare CamemBERT to UDify on those tasks because UDify is the work that pushed the furthest the performance in fine-tuning end-to-end a BERT-based model on downstream POS tagging and dependency parsing. Finally, we compare our model to UDPipe Future [73], a model ranked 3rd in dependency parsing and 6th in POS tagging during the CoNLL 2018 shared task [71]. UDPipe Future provides us a strong baseline that does not make use of any pretrained contextual embedding.

We will compare to the more recent cross-lingual language model XLM [47], as well as the state-of-the-art CoNLL 2018 shared task results with predicted tokenisation and segmentation in an updated version of the paper.

4.2 Named Entity Recognition

Named Entity Recognition (NER) is a sequence labeling task that consists in predicting which words refer to real-world objects, such as people, locations, artifacts and organisations. We use the French Treebank111111This dataset has only been stored and used on Inria’s servers after signing the research-only agreement. (FTB) [2] in its 2008 version introduced by cc-clustering:09short and with NER annotations by sagot2012annotation. The NER-annotated FTB contains more than 12k sentences and more than 350k tokens extracted from articles of the newspaper Le Monde published between 1989 and 1995. In total, it contains 11,636 entity mentions distributed among 7 different types of entities, namely: 2025 mentions of “Person”, 3761 of “Location”, 2382 of “Organisation”, 3357 of “Company”, 67 of “Product”, 15 of “POI” (Point of Interest) and 29 of “Fictional Character”.

A large proportion of the entity mentions in the treebank are multi-word entities. For NER we therefore report the 3 metrics that are commonly used to evaluate models: precision, recall, and F1 score. Here precision measures the percentage of entities found by the system that are correctly tagged, recall measures the percentage of named entities present in the corpus that are found and the F1 score combines both precision and recall measures giving a general idea of a model’s performance.

Baselines

Most of the advances in NER haven been achieved on English, particularly focusing on the CoNLL 2003 [68] and the Ontonotes v5 [61, 60] English corpora. NER is a task that was traditionally tackled using Conditional Random Fields (CRF) [45] which are quite suited for NER; CRFs were later used as decoding layers for Bi-LSTM architectures [34, 46] showing considerable improvements over CRFs alone. These Bi-LSTM-CRF architectures were later enhanced with contextualised word embeddings which yet again brought major improvements to the task [58, 3]. Finally, large pretrained architectures settled the current state of the art showing a small yet important improvement over previous NER-specific architectures [25, 8].

In non-English NER the CoNLL 2002 shared task included NER corpora for Spanish and Dutch corpora [69] while the CoNLL 2003 included a German corpus [68]. Here the recent efforts of [74] settled the state of the art for Spanish and Dutch, while [3] did it for German.

In French, no extensive work has been done due to the limited availability of NER corpora. We compare our model with the strong baselines settled by [27]

, who trained both CRF and BiLSTM-CRF architectures on the FTB and enhanced them using heuristics and pretrained word embeddings.

4.3 Natural Language Inference

We also evaluate our model on the Natural Language Inference (NLI) task, using the French part of the XNLI dataset [23]. NLI consists in predicting whether a hypothesis sentence is entailed, neutral or contradicts a premise sentence.

The XNLI dataset is the extension of the Multi-Genre NLI (MultiNLI) corpus [79] to 15 languages by translating the validation and test sets manually into each of those languages. The English training set is also machine translated for all languages. The dataset is composed of 122k train, 2490 valid and 5010 test examples. As usual, NLI performance is evaluated using accuracy.

To evaluate a model on a language other than English (such as French), we consider the two following settings:

  • TRANSLATE-TEST: The French test set is machine translated into English, and then used with an English classification model. This setting provides a reasonable, although imperfect, way to circumvent the fact that no such data set exists for French, and results in very strong baseline scores.

  • TRANSLATE-TRAIN: The French model is fine-tuned on the machine-translated English training set and then evaluated on the French test set. This is the setting that we used for CamemBERT.

Baselines

For the TRANSLATE-TEST setting, we report results of the English RoBERTa to act as a reference.

In the TRANSLATE-TRAIN setting, we report the best scores from previous literature along with ours. BiLSTM-max is the best model in the original XNLI paper, mBERT which has been reported in French in [81] and XLM (MLM+TLM) is the best-presented model from [23].

5 Experiments

Model GSD Sequoia Spoken ParTUT
UPOS UAS LAS UPOS UAS LAS UPOS UAS LAS UPOS UAS LAS
UDPipe Future 97.63 90.65 88.06 98.79 92.37 90.73 95.91 82.90 77.53 96.93 92.17 89.63
UDify 97.83 93.60 91.45 97.89 92.53 90.05 96.23 85.24 80.01 96.12 90.55 88.06
mBERT 97.48 92.72 89.73 98.41 93.24 91.24 96.02 84.65 78.63 97.35 94.18 91.37
CamemBERT 98.19 94.82 92.47 99.21 95.56 94.39 96.68 86.05 80.07 97.63 95.21 92.90
Table 2: Final POS and dependency parsing scores of CamemBERT and mBERT (fine-tuned in the exact same conditions as CamemBERT), UDify as reported in the original paper on 4 French treebanks (French GSD, Spoken, Sequoia and ParTUT), reported on test sets (4 averaged runs) assuming gold tokenisation. Best scores in bold, second to best underlined.

In this section, we measure the performance of CamemBERT by evaluating it on the four aforementioned tasks: POS tagging, dependency parsing, NER and NLI.

5.1 Experimental Setup

Pretraining

We use the RoBERTa implementation in the fairseq library [56]. Our learning rate is warmed up for 10k steps up to a peak value of instead of the original given our large batch size (8192). The learning rate fades to zero with polynomial decay. We pretrain our model on 256 Nvidia V100 GPUs (32GB each) for 100k steps during 17h.

Fine-tuning

For each task, we append the relevant predictive layer on top of CamemBERT’s Transformer architecture. Following the work done on BERT [25], for sequence tagging and sequence labeling we append a linear layer respectively to the s special token and to the first subword token of each word. For dependency parsing, we plug a bi-affine graph predictor head as inspired by [26] following the work done on multilingual parsing with BERT by [40]. We refer the reader to these two articles for more details on this module.

We fine-tune independently CamemBERT for each task and each dataset. We optimise the model using the Adam optimiser [38] with a fixed learning rate. We run a grid search on a combination of learning rates and batch sizes. We select the best model on the validation set out of the 30 first epochs.

Although this might push the performances even further, for all tasks except NLI, we don’t apply any regularisation techniques such as weight decay, learning rate warm-up or discriminative fine-tuning. We show that fine-tuning CamemBERT in a straight-forward manner leads to state-of-the-art results on most tasks and outperforms the existing BERT-based models in most cases.

The POS tagging, dependency parsing, and NER experiments are run using hugging face’s Transformer library extended to support CamemBERT and dependency parsing [80]. The NLI experiments use the fairseq library following the RoBERTa implementation.

5.2 Results

Part-of-Speech tagging and dependency parsing

For POS tagging and dependency parsing, we compare CamemBERT to three other near state-of-the-art models in Table 2. CamemBERT outperforms UDPipe Future by a large margin for all treebanks and all metrics. Despite a much simpler optimisation process, CamemBERT beats UDify performances on all the available French treebanks.

CamemBERT also demonstrates higher performances than mBERT on those tasks. We observe a larger error reduction for parsing than for tagging. For POS tagging, we observe error reductions of respectively 0.71% for GSD, 0.81% for Sequoia, 0.7% for Spoken and 0.28% for ParTUT. For parsing, we observe error reductions in LAS of 2.96% for GSD, 3.33% for Sequoia, 1.70% for Spoken and 1.65% for ParTUT.

Natural Language Inference: XNLI

Model Acc. #Params
TRANSLATE-TEST
RoBERTa (en) 82.9 355M
TRANSLATE-TRAIN
BiLSTM-max [23] 68.3 -
mBERT 76.9 175M
XLM (MLM+TLM) [47] 80.2 250M
CamemBERT 81.2 110M
Table 3: Accuracy of models for French on the XNLI test set. Best scores in bold, second to best underlined.

On the XNLI benchmark, CamemBERT obtains improved performance over multilingual language models on the TRANSLATE-TRAIN setting (81.2 vs. 80.2 for XLM) while using less than half the parameters (110M vs. 250M). However, its performance still lags behind models trained on the original English training set in the TRANSLATE-TEST setting, 81.2 vs. 82.91 for RoBERTa. It should be noted that CamemBERT uses far fewer parameters than RoBERTa (110M vs. 355M parameters).

Model Precision Recall F1
SEM (CRF) [27] 87.89 82.34 85.02
LSTM-CRF [27] 87.23 83.96 85.57
mBERT 81.80 83.73 82.75
CamemBERT (subword masking)121212We used subword masking here instead of WWM because of better performance on the validation set, see appendix table 5 for more details. 88.35 87.46 87.93
Table 4: Results for NER on the FTB. Best scores in bold, second to best underlined.

Named-Entity Recognition

For named entity recognition, our experiments show that CamemBERT achieves a slightly better precision than the traditional CRF-based SEM architectures described above in Section 4.2 (CRF and Bi-LSTM+CRF), but shows a dramatic improvement in finding entity mentions, raising the recall score by 3.5 points. Both improvements result in a 2.36 point increase in the F1 score with respect to the best SEM architecture (BiLSTM-CRF), giving CamemBERT the state of the art for NER on the FTB. One other important finding is the results obtained by mBERT. Previous work with this model showed increased performance in NER for German, Dutch and Spanish when mBERT is used as contextualised word embedding for an NER-specific model [74], but our results suggest that the multilingual setting in which mBERT was trained is simply not enough to use it alone and fine-tune it for French NER, as it shows worse performance than even simple CRF models, suggesting that monolingual models could be better at NER.

5.3 Discussion

CamemBERT displays improved performance compared to prior work for the 4 downstream tasks considered. This confirms the hypothesis that pretrained language models can be effectively fine-tuned for various downstream tasks, as observed for English in previous work. Moreover, our results also show that dedicated monolingual models still outperform multilingual ones. We explain this point in two ways. First, the scale of data is possibly essential to the performance of CamemBERT. Indeed, we use 138GB of uncompressed text vs. 57GB131313Based on the size of English Wikipedia (12GB) and the claim on GitHub that “after concatenating all of the Wikipedias together, 21% of our data is English” for mBERT. Second, with more data comes more diversity in the pretraining distribution. Reaching state-of-the-art performances on 4 different tasks and 6 different datasets requires robust pretrained models. Our results suggest that the variability in the downstream tasks and datasets considered is handled more efficiently by a general language model than by Wikipedia-pretrained models such as mBERT.

6 Conclusion

CamemBERT improves the state of the art for multiple downstream tasks in French. It is also lighter than other BERT-based approaches such as mBERT or XLM. By releasing our model, we hope that it can serve as a strong baseline for future research in French NLP, and expect our experiments to be reproduced in many other languages. We will publish an updated version in the near future where we will explore and release models trained for longer, with additional downstream tasks, baselines (e.g. XLM) and analysis, we will also train additional models with potentially cleaner corpora such as CCNet [78] for more accurate performance evaluation and more complete ablation.

Acknowledgments

This work was partly funded by three French National grants from the Agence Nationale de la Recherche, namely projects PARSITI (ANR-16-CE33-0021), SoSweet (ANR-15-CE38-0011) and BASNUM (ANR-18-CE38-0003), as well as by the last author’s chair in the PRAIRIE institute.141414http://prairie-institute.fr/

References

7 Appendix

7.1 Impact of Whole-Word Masking

Model Masking Steps UPOS UAS LAS NER XNLI
CamemBERT subword 100k 97.68 92.93 89.79 89.60 80.79
CamemBERT WWM 100k 97.75 93.06 89.90 88.39 81.71
Table 5: Comparing subword and whole-word masking procedures on the validation sets of each task. Each score is an average of 4 runs with different random seeds. For POS tagging and Dependency parsing, we average the scores on the 4 treebanks.)

We analyze the addition of whole-word masking on the downstream performance of CamemBERT. As reported for English on other downstream tasks, whole word masking improves downstream performances for all tasks but NER as seen in Table 5. NER is highly sensitive to capitalisation, prefixes, suffixes and other subword features that could be used by a model to correctly identify entity mentions. Thus the added information by learning the masking at a subword level rather than at whole-word level seems to have a detrimental effect on downstream NER results.