DeepAI
Log In Sign Up

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

11/15/2022
by   Pieter Delobelle, et al.
0

Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a language model to account for evolving language use.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/09/2021

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Recent work in language modeling has shown that training large-scale Tra...
06/11/2021

Dynamic Language Models for Continuously Evolving Content

The content on the web is in a constant state of flux. New entities, iss...
09/08/2019

Back to the Future -- Sequential Alignment of Text Representations

Language evolves over time in many ways relevant to natural language pro...
10/13/2022

Spontaneous Emerging Preference in Two-tower Language Model

The ever-growing size of the foundation language model has brought signi...
09/16/2021

MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection

Much of natural language processing is focused on leveraging large capac...
05/24/2021

Neural Language Models for Nineteenth-Century English

We present four types of neural language models trained on a large histo...
09/16/2021

Regularized Training of Nearest Neighbor Language Models

Including memory banks in a natural language processing architecture inc...

1 Introduction

Large pre-trained transformer-based languge models have become the standard in recent years Devlin et al. (2019); Brown et al. (2020); Liu et al. (2019). BERT-like models often achieve state-of-the-art results for classification and regression tasks, both on document-level as well as token-level Devlin et al. (2019); Liu et al. (2019). Multilingual BERT models are BERT models trained on multiple languages at once, thus being able to generalize between languages, and useful for multilingual tasks. However, monolingual models often outperform multilingual for monolingual tasks Martin et al. (2019); Delobelle et al. (2020).

BERT-like models are usually pre-trained once, and then finetuned for many different tasks for the years to come. However, since language evolves and the meaning of concepts begin to drift Wang et al. (2010), these models are continuously getting more and more outdated with respect to new concepts, trends and meanings. To counteract this, one can pre-train a BERT model further on a more recent dataset when it emerges Jin et al. (2021). This is similar to how BERT models are already often adapted to new domains by pre-training on large unlabeled corpora in the new domain before being finetuned to increase their performance on downstream tasks within that domain Gururangan et al. (2020). The new domain could of course also just be a more recent dataset to further pre-train the large language model.

In this work, we use the Dutch portion of the 2022 version of the OSCAR corpus Abadji et al. (2022) to further pre-train the state-of-the-art Dutch BERT model RobBERT Delobelle et al. (2020), which was trained on the 2019 OSCAR corpus Ortiz Suárez et al. (2019). We first extend the tokenizer to include new common (sub)words, and then further pre-train RobBERT v2 to capture the recent evolution of the Dutch language.

2 Background and Related Work

2.1 BERT Language Model

Large pre-trained transformer-based language models have dominated natural language processing leaderboards for the last couple of years. BERT-based models (e.g. BERT, BART and RoBERTa) perform exceptionally well for classification and regression tasks, such as sentiment analysis, part-of-speech detection, coreference-resolution, natural language inference, question-answering, etc

Rogers et al. (2020); Liu et al. (2019). Generally speaking, these models are only pre-trained once using a large unlabeled training dataset, and then finetuned on different downstream tasks using smaller datasets. During the pre-training phase, the main task of the model is a form of the Cloze task called masked-language modeling (MLM). For RoBERTa-based models, this is even the only pre-training task, since the original BERT next sentence prediction task was found to not contribute much to the training Liu et al. (2019). In the MLM task, the model has to predict the right tokens for the masked and swapped tokens of texts from the original pre-training corpus. Using this task, the model can learn all sorts of linguistic features that are useful for downstream tasks.

An important part of the design of a BERT-based model is the tokenizer. The tokenizer uses the calculated vocabulary to map words to integers, which can then be used as input for the language model. The original BERT model used a top-down WordPiece tokenizer, which segments words into subwords. When constructing a WordPiece tokenizer, tokens are decided by combining word units in a way that increase the likelihood of the training data the most. RoBERTa on the other hand uses byte-pair encodings (BPE), a similar method that chooses the most frequent symbol pairs Sennrich et al. (2016); Wang et al. (2020); Liu et al. (2019). BPE and WordPiece tokenization only differs in the way symbol pairs are added to the vocabulary.

BPE tokenizers are built by looking for the most common substrings to optimally compress the dataset into as few tokens as possible. Starting from character tokens, they continuously merge existing tokens if their combination is frequent enough, regardless of whether it makes morphological sense. This way, larger text can on average be encoded and passed to the model. Note that while BPE tokenizers are the most popular type of tokenizer for current language models, they have the surprising artifact of adding words multiple times, namely with and without word boundary separator (Ġ). Different capitalizations of the first letter also cause duplicate tokens and tokens that seemingly miss their first letter.

While multilingual models generally perform quite well, monolingual models tend to outperform them for monolingual language tasks of that particular language Martin et al. (2019); Delobelle et al. (2020). For Dutch, the state-of-the-art model is the RobBERT model, which was released in the first month of 2020 (Delobelle et al., 2020). The model was trained on scraped internet data from the Dutch OSCAR corpus Ortiz Suárez et al. (2019). Parallel works that were released only a couple of weeks before the RobBERT paper was released, are the BERTje and BERT-NL models. They are based on the standard BERT architecture and trained on a more formal but smaller dataset de Vries et al. (2019); Brandsen et al. (2019). Thanks to the larger dataset and being based on the optimized RoBERTa architecture, the RobBERT model outperforms the other Dutch BERT models on most tasks.

2.2 Domain Adaptation

Further pre-training a large language model can help the model better understand the new domain of a dataset better if a large unlabeled corpus from within the domain is available. Large language models, like BERT models, that are domain-adapted in this way increases their performance on downstream tasks. When adapting the domain, the vocabulary of the tokenizer of the model gets extended to account for new common (sub-)words of the domain Yao et al. (2021); Röttger and Pierrehumbert (2021) Adding tokens has been shown to help reliable increase the performance Gu et al. (2021) and adapting the domain by pre-training in multiple phases also generally offers large performance increases Gururangan et al. (2020). For Dutch, researchers successfully adapted the RobBERT model to the medicine Verkijk and Vossen (2021) and law Boer et al. (2022) domains.

Röttger and Pierrehumbert (2021) experimented with adapting language models to more recent datasets. They found that language models performed better on test sets of the past than the future, showing the importance of adapting to newer language use. Jin et al. (2021) also showed that further pre-training language models on more recent emerging datasets is effective for some downstream tasks. However, it is hard to predict if an updated language model that was further pre-trained has improved or degraded performance on downstream tasks. For example, the language model might suffer a form of catastrophic forgetting of some crucial information for particular downstream tasks, leading to unexpected degraded performance Xu et al. (2020); De Lange et al. (2022).

3 Method

We propose a new RobBERT model to address the need for a more up-to-date Dutch model that takes into account many of the changes that have occurred in language usage over the past three years, including the COVID-19 pandemic. New terms and meanings for words have been introduced, and there is new world knowledge that RobBERT is not aware of. By retraining the model on a more recent dataset, we can ensure that it is better equipped to handle these changes. This section will discuss new training data (§ 3.1), propose a method to extend a tokenizer’s vocabulary (§ 3.2) and outline the pre-training regime (§ 3.3).

3.1 Pre-training Data

The Dutch section of the OSCAR corpus (Ortiz Suárez et al., 2019)

is based on automatically language-classified Common Crawl documents.

111https://oscar-corpus.com/ The corpus has a release with data up until January 2022, which in contrast with previous releases is now a full documents corpus instead of the previous pre-shuffled lines. This allows models to learn longer-range dependencies, which was shown to be beneficial for some tasks (Delobelle et al., 2021).

Similar to the RoBERTa document pre-processing Liu et al. (2019), we pre-processed the documents by splitting the documents into sentences and then maximizing the number of complete sentences that fit within the 512-token input limit. Compared to simply splitting the document into 512 token chunks, this pre-processing avoids ending and starting inputs on arbitrary tokens. This results in a training dataset of 77GB with 122,364,485 separate documents.

3.2 Extending BPE Vocabulary

Before the pre-training, we extended the vocabulary used by RobBERT’s tokenizer to reflect the evolution of Dutch in the last three years. For example, words like “corona” or “COVID-19” were not in the original tokenizer since they were barely used before the pandemic. Encoding such words allows the tokenizer to more compactly represent such words within the input size of the model.

To add these tokens to the vocabulary of the original tokenizer , we first created a new BPE tokenizer using the process described in section 2.1 Sennrich et al. (2016); Liu et al. (2019). After creating the new tokenizer , we add the tokens that are newly added compared to the previous tokenizer to create the merged tokenizer . BPE tokenizers also use a dependency graph that specifies how tokens are merged into large tokens, which also needs to be updated. For example, Figure 1 shows how “Coron” is composed of “Cor” and “on”, and is used in larger tokens. By iterating over all the merges of , we perform the merger if the token is in the set difference between and . All other merges are already covered by . This results in a new vocabulary size of 42774 for , meaning we added 2774 tokens to .

Interestingly, these new tokens also give insight into what (sub-)words society started using more, with some categorized examples in Table 3. Some of the new tokens are just added because they now cross the threshold, others just because these are now much more common words, e.g. COVID-19-related words and new brands etc.

Figure 1: Illustration of how tokens related to the COVID-19 pandemic are added based on existing tokens. Notice the compound word with ‘maatregelen’ (measures) is not following Dutch morphological splitting rules due to how BPE token calculation works.
Benchmark scores
Model Params Year SA CR NER POS NLI PPPL
RobBERT (v2) 116 M 2019 35.9
BERTje 109 M 2019 93.0 98.3 88.3 96.3 83.94 144.2
RobBERTje (Non-shuffled) 74 M 2021 92.1
RobBERT-2022 119M 2022
Table 1: Results on the benchmarks tasks from Delobelle et al. (2020) as well as the pseudo-perplexity (PPPL, Salazar et al., 2019) on the latest Dutch OSCAR corpus. Indicated results are from their respective papers.

3.3 Continuing Pre-training

To make language models account for evolved language usage, one could either pre-train from scratch, continue the pre-training on a more recent dataset or distill the model using a more recent dataset Yao et al. (2021). While RobBERT has been successfully distilled already Delobelle et al. (2021), it is not trivial to use such a distillation process in this case due to the enlarged vocabulary. Since the new corpus is a superset of the previously used 2019 Dutch OSCAR and still includes most scraped data of the older version, we take the pre-trained weights from the original RobBERT model and continue training on the expanded corpus. To account for the newly introduced tokens, we increased the embeddings matrix with 2774 tokens.

The pre-training was performed using the same training regime as Delobelle et al. (2020), which is equivalent to RoBERTa’s regime (Liu et al., 2019). We use gradient accumulation over 128 batches for an effective batch size of 1024 with one 3080 Ti, we use ADAM (Kingma and Ba, 2014) as optimizer with and warmup of 1k batches. We terminate our pre-training with early stopping based on the validation set performance, which we test every 100k steps using an MLM-specific version of perplexity (PPPL, Salazar et al., 2019).

Model ACC
Domain-adapted models
BERTje+ 77.7%
CoNTACT+ 77.9%
General-purpose models
BERTje 74.7%
RobBERT v2 74.9% 77.2%
RobBERT-2022 76.3% 79.3%
Table 2: Results on VaccinChat (Buhmann et al., 2022), a task on FAQ about COVID-19 vaccines. Results without score are reported by Buhmann et al. (2022).

4 Evaluation and Results

We first evaluated the model on a variety of tasks to verify that minimal concept drift and catastrophic forgetting happened (§ 4.1). We then also evaluated its performance on two recent COVID-related tasks and performed an embedding analysis (§ 4.2).

4.1 No or minimal shift for existing concepts

We first performed an evaluation on the same RobBERT benchmark tasks as the original model to analyze the effect on existing concepts and to ensure there was minimal concept drift and no catastrophic forgetting. A detailed description of the benchmarks can be found in Delobelle et al. (2021) and our training setup is in Appendix A. The results of the comparisons in Table 1 show that there is no decrease in accuracy or utility on most benchmarks and even some improvement for two tasks.

Second, we test if the pseudo-perplexity on the original corpus from 2019 increased over pseudo-perplexity of the original RobBERT model, since this would indicate that the original dataset is not modelled anymore by RobBERT-2022. The pseudo-perplexity of 9.40 is slightly higher compared to 7.76 for RobBERT, which can be an indication that our assumption—that the new dataset is a superset of the original one—is not entirely correct. However, it is still reasonably low and shows that the new RobBERT-2022 model is still able to model the original data distribution.

4.2 Meaningful embeddings for new concepts

We evaluated the model’s embeddings for new concepts by first testing its performance on a task to classify tweets about COVID-19 measures in Belgium (Scott et al., 2021). While both the original RobBERT Delobelle et al. (2020) and multilingual BERT Devlin et al. (2019) both achieve 73% test set accuracy on this task, RobBERT-2022 achieves 75% accuracy. This indicates that having these new tokens with meaningful embeddings helped RobBERT-2022 better classify this recent dataset.

Secondly, we compared the predictive performance of RobBERT-2022 on VaccinChat (Buhmann et al., 2022) to the original RobBERT model, BERTje (de Vries et al., 2019), another Dutch model with data from 2019, and CoNTACT (Lemmens et al., 2022), a domain-specific model for Dutch COVID-19 tweets based on RobBERT. The results in Table 2 demonstrate that the updated model benefits from the COVID-19-related tokens and training data, with an score of 79.3% for the new model, compared to 77.2% for the original model. However, the performance of domain-adapted models BERTje+ and CoNTACT+ (Buhmann et al., 2022) highlight that RobBERT-2022 is still a general-purpose language model and further domain-adaptation is beneficial.

Figure 2: t-SNE visualisation of sentence embeddings for unseen sentences from the COVID-19 and OSCAR 2022 datasets. Notice the shift between both models.

Thirdly, we evaluated the model’s embeddings by performing an embeddings analysis to test if they can capture the same relevant information using sentences of the validation set (Figure 2). Interestingly, this indicates that RobBERT-2022’s sentence embeddings drifted from the original.

5 Model

RobBERT-2022 is available on HuggingFace as DTAI-KULeuven/robbert-2022-dutch-base222https://huggingface.co/DTAI-KULeuven/robbert-2022-dutch-base.

6 Conclusion & Future Work

We created an updated version of the state-of-the-art Dutch BERT-like model RobBERT by extending its vocabulary and further pre-training the model. Our analysis shows that further pre-training on a recent dataset increases performance on more recent tasks while maintaining performance on older tasks. This indicates that updating the model in the future on more recent training corpora is a fruitful endeavor. Since RobBERT was already successfully distilled in the RobBERTje model Delobelle et al. (2021), it would be interesting evaluate the performance difference of a distilled RobBERT-2022 model with a further pre-trained RobBERTje model. We hope that like its predecessor, this new RobBERT-2022 model helps to provide the Dutch NLP community with a model that can be used on datasets relying on more recent information.

References

Appendix A Experimental setup

a.1 Sentiment Analysis (SA)

We evaluate sentiment analysis on the Dutch Book Review Dataset (van der Burgh and Verberne, 2019)

with standard splits. Our experiment consists of one run with the following hyperparameters:

  • Number of gpus: 1 (1080 Ti)

  • adafactor: False

  • adam beta1: 0.9

  • adam beta2: 0.999

  • adam epsilon: 1e-08

  • deepspeed: None

  • fp16: False

  • gradient acc. steps: 8

  • lr:

  • lr scheduler type: LINEAR

  • num train epochs: 10

  • optimizer: ADAMW

  • batch size: 4

  • seed: 1

  • warmup ratio: 0.0

  • warmup steps: 20

  • weight decay: 0.05

a.2 Co-reference Resolution (CR)

We run five randomized training run on the ‘die’-‘dat’ disambiguation task by Allein et al. (2020), where we vary the learning rate, number of gradient accumulation steps and weight decay. We limit our datasets for computational reasons, to 50k (training), 3k (validation) and 30k (testing).

  • Number of gpus: 1 (1080 Ti)

  • adafactor: False

  • adam beta1: 0.9

  • adam beta2: 0.999

  • adam epsilon: 1e-08

  • deepspeed: None

  • fp16: False

  • gradient acc. steps: . Best: 2

  • lr: . Best:

  • lr scheduler type: LINEAR

  • num train epochs: 1

  • optimizer: ADAMW

  • batch size: 8

  • seed: 1

  • warmup ratio: 0.0

  • warmup steps: 20

  • weight decay: . Best: 0.09

a.3 Named Entity Recognition (NER)

We evaluate NER on the CoNLL dataset with an experiment that consists of 10 runs with Bayesian optimisation (TPE) with the following hyperparameters, where we vary the learning rate, number of gradient accumulation steps and weight decay. We select the best-performing model based on the score on a separate validation set before testing this model the test set.

  • Number of gpus: 1 (1080 Ti)

  • adafactor: False

  • adam beta1: 0.9

  • adam beta2: 0.999

  • adam epsilon: 1e-08

  • deepspeed: None

  • fp16: False

  • gradient acc. steps: .

    Best: 4

  • lr: . Best:

  • lr scheduler type: LINEAR

  • num train epochs: 10

  • optimizer: ADAMW

  • batch size: 8

  • seed: 1

  • warmup ratio: 0.0

  • warmup steps: 20

  • weight decay: . Best: 0.08

a.4 Part-of-speech (POS) tagging

We also perform 10 runs with Bayesian optimisation (TPE) with the following hyperparameters, where we vary the learning rate, number of gradient accumulation steps and weight decay. We select the best-performing model based on the score on a separate validation set before testing this model the test set.

  • Number of gpus: 1 (1080 Ti)

  • adafactor: False

  • adam beta1: 0.9

  • adam beta2: 0.999

  • adam epsilon: 1e-08

  • deepspeed: None

  • fp16: False

  • gradient acc. steps: .

    Best: 1

  • lr: . Best:

  • lr scheduler type: LINEAR

  • num train epochs: 10

  • optimizer: ADAMW

  • batch size: 8

  • seed: 1

  • warmup ratio: 0.0

  • warmup steps: 20

  • weight decay: . Best: 0.06

a.5 Natural Language Inference (NLI)

Our experiment consists of 5 runs with the following hyperparameters, where most are fixed and the learning rate, weight decay and the number of gradient accumulation steps are randomly selected from the specified ranges.

  • Number of gpus: 1 (1080 Ti)

  • adafactor: False

  • adam beta1: 0.9

  • adam beta2: 0.999

  • adam epsilon: 1e-08

  • deepspeed: None

  • fp16: False

  • gradient acc. steps: . Best: 8

  • lr: . Best:

  • lr scheduler type: LINEAR

  • num train epochs: 10

  • optimizer: ADAMW

  • batch size: 8

  • seed: 1

  • warmup ratio: 0.0

  • warmup steps: 20

  • weight decay: . Best: 0.025

a.6 COVID-19 topics

We take the task of extracting the topic of a Tweet, as categorised and labeled by Scott et al. (2021), with the following topics as possible labels:

  • closing-horeca

  • testing

  • schools

  • lockdown

  • quarantine

  • curfew

  • masks

  • vaccine

  • other-measure

  • not-applicable

Our experiment consists of 5 runs with the following hyperparameters, where most are fixed and the learning rate, weight decay and the number of gradient accumulation steps are randomly selected from the specified ranges.

  • Number of gpus: 1 (1080 Ti)

  • adafactor: False

  • adam beta1: 0.9

  • adam beta2: 0.999

  • adam epsilon: 1e-08

  • deepspeed: None

  • fp16: False

  • gradient acc. steps: . Best: 2

  • lr: . Best:

  • lr scheduler type: LINEAR

  • num train epochs: 10

  • optimizer: ADAMW

  • batch size: 8

  • seed: 1

  • warmup ratio: 0.0

  • warmup steps: 20

  • weight decay: . Best: 0.065

a.7 COVID-19 topics

We follow the train, validation and test sets as introduced by Buhmann et al. (2022) on https://huggingface.co/datasets/clips/VaccinChatNL. Our experiment consists of 5 runs with the following hyperparameters for both RobBERT v2 and RobBERT-2022, where most are fixed and the learning rate, weight decay and the number of gradient accumulation steps are randomly selected from the specified ranges.

  • Number of gpus: 1 (1080 Ti)

  • adafactor: False

  • adam beta1: 0.9

  • adam beta2: 0.999

  • adam epsilon: 1e-08

  • deepspeed: None

  • fp16: False

  • gradient acc. steps: .
    Best RobBERT v2: 2
    Best RobBERT-2022: 4

  • lr: .
    Best RobBERT v2:
    Best RobBERT-2022:

  • lr scheduler type: LINEAR

  • num train epochs: 5

  • optimizer: ADAMW

  • batch size: 8

  • seed: 1

  • warmup ratio: 0.0

  • warmup steps: 20

  • weight decay: .
    Best RobBERT v2:
    Best RobBERT-2022: