Data-Efficient French Language Modeling with CamemBERTa

06/02/2023
by   Wissam Antoun, et al.
0

Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and computational power, they also benefit from the development of better training methods and architectures. In this paper, we introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective. We evaluate our model's performance on a variety of French downstream tasks and datasets, including question answering, part-of-speech tagging, dependency parsing, named entity recognition, and the FLUE benchmark, and compare against CamemBERT, the state-of-the-art monolingual model for French. Our results show that, given the same amount of training tokens, our model outperforms BERT-based models trained with MLM on most tasks. Furthermore, our new model reaches similar or superior performance on downstream tasks compared to CamemBERT, despite being trained on only 30 results, we also publicly release the weights and code implementation of CamemBERTa, making it the first publicly available DeBERTaV3 model outside of the original paper and the first openly available implementation of a DeBERTaV3 training objective. https://gitlab.inria.fr/almanach/CamemBERTa

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2021

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Large pretrained masked language models have become state-of-the-art sol...
research
04/19/2023

BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer

Retrieval-based language models are increasingly employed in question-an...
research
07/11/2023

Vacaspati: A Diverse Corpus of Bangla Literature

Bangla (or Bengali) is the fifth most spoken language globally; yet, the...
research
11/21/2022

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

The monolingual Hindi BERT models currently available on the model hub d...
research
01/14/2022

A Warm Start and a Clean Crawled Corpus – A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that ...
research
04/08/2021

AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With

Large Pre-trained Language Models (PLMs) have become ubiquitous in the d...
research
03/23/2020

Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks

Word2Vec is a prominent tool for Natural Language Processing (NLP) tasks...

Please sign up or login with your details

Forgot password? Click here to reset