From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

02/18/2022
by   Simon Gabay, et al.
0

Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16^th to the 18^th centuries). We present the FreEM_max corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEM_max. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEM_max corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2021

SlovakBERT: Slovak Masked Language Model

We introduce a new Slovak masked language model called SlovakBERT in thi...
research
12/01/2021

DPRK-BERT: The Supreme Language Model

Deep language models have achieved remarkable success in the NLP domain....
research
10/24/2018

Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Universal Language Model for Fine-tuning [arXiv:1801.06146] (ULMFiT) is ...
research
09/12/2017

Language Models of Spoken Dutch

In Flanders, all TV shows are subtitled. However, the process of subtitl...
research
06/02/2021

belabBERT: a Dutch RoBERTa-based language model applied to psychiatric classification

Natural language processing (NLP) is becoming an important means for aut...
research
07/06/2023

Can ChatGPT's Responses Boost Traditional Natural Language Processing?

The employment of foundation models is steadily expanding, especially wi...
research
06/12/2023

Contrastive Attention Networks for Attribution of Early Modern Print

In this paper, we develop machine learning techniques to identify unknow...

Please sign up or login with your details

Forgot password? Click here to reset