RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

11/15/2022
by   Pieter Delobelle, et al.
0

Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a language model to account for evolving language use.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2021

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Recent work in language modeling has shown that training large-scale Tra...
research
06/11/2021

Dynamic Language Models for Continuously Evolving Content

The content on the web is in a constant state of flux. New entities, iss...
research
09/08/2019

Back to the Future -- Sequential Alignment of Text Representations

Language evolves over time in many ways relevant to natural language pro...
research
09/20/2023

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new st...
research
10/13/2022

Spontaneous Emerging Preference in Two-tower Language Model

The ever-growing size of the foundation language model has brought signi...
research
10/22/2020

UniCase – Rethinking Casing in Language Models

In this paper, we introduce a new approach to dealing with the problem o...
research
02/20/2023

Can discrete information extraction prompts generalize across language models?

We study whether automatically-induced prompts that effectively extract ...

Please sign up or login with your details

Forgot password? Click here to reset