The advent of neural networks in natural language processing (NLP) has significantly improved state-of-the-art results within the field. While recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) initially dominated the field, recent models started incorporating attention mechanisms and then later dropped the recurrent part and just kept the attention mechanisms in so-called transformer models(vaswaniAttention2017)
. This latter type of model caused a new revolution in NLP and led to popular language models like GPT-2(radford2018gpt; radford2019gpt2) and ELMo (peters2018elmo). BERT (devlinBERT2019a) improved over previous transformer models and recurrent networks by allowing the system to learn from input text in a bidirectional way, rather than only from left-to-right or the other way around. This model was later re-implemented, critically evaluated and improved in the RoBERTa model (liuRoBERTa2019).
These large-scale transformer models provide the advantage of being able to solve NLP tasks by having a common, expensive pre-training phase, followed by a smaller fine-tuning phase. The pre-training happens in an unsupervised way by providing large corpora of text in the desired language. The second phase only needs a relatively small annotated data set for fine-tuning to outperform previous popular approaches in one of a large number of possible language tasks.
While language models are usually trained on English data, some multilingual models also exist. These are usually trained on a large quantity of text in different languages. For example, Multilingual-BERT is trained on a collection of corpora in 104 different languages (devlinBERT2019a), and generalizes language components well across languages (pires2019multilingual). However, models trained on data from one specific language usually improve the performance of multilingual models for this particular language (martinCamemBERT2019; devriesBERTje2019). Training a RoBERTa model (liuRoBERTa2019) on a Dutch dataset thus has a lot of potential for increasing performance for many downstream Dutch NLP tasks. In this paper, we introduce RobBERT111The model named itself RobBERT when it was prompted with “Ik heet maskBERT.” (“My name is maskBERT.”), which we found quite a suitable name., a Dutch RoBERTa-based pre-trained language model, and critically test its performance using natural language tasks against other Dutch languages models.
2 Related Work
Transformer models have been successfully used for a wide range of language tasks. Initially, transformers were introduced for use in machine translation, where they vastly improved state-of-the-art results for English to German in an efficient manner (vaswaniAttention2017). This transformer model architecture resulted in a new paradigm in NLP with the migration from sequence-to-sequence recurrent neural networks to transformer-based models by removing the recurrent component and only keeping attention. This cornerstone was used for BERT, a transformer model that obtained state-of-the-art results for eleven natural language processing tasks, such as question answering and natural language inference (devlinBERT2019a). BERT is pre-trained with large corpora of text using two unsupervised tasks. The first task is word masking (also called the Cloze task (taylor1953cloze) or masked language model (MLM)), where the model has to guess which word is masked in certain position in the text. The second task is next sentence prediction. This is done by predicting if two sentences are subsequent in the corpus, or if they are randomly sampled from the corpus. These tasks allowed the model to create internal representations about a language, which could thereafter be reused for different language tasks. This architecture has been shown to be a general language model that could be fine-tuned with little data in a relatively efficient way for a very distinct range of tasks and still outperform previous architectures devlinBERT2019a.
Transformer models are also capable of generating contextualized word embeddings. These contextualized embeddings were presented by peters2018elmo and addressed the well known issue with a word’s meaning being defined by its context (e.g. “a stick” versus “let’s stick to”). This lack of context is something that traditional word embeddings like word2vec (mikolovEfficient2013) or GloVe (penningtonGlove2014) lack, whereas BERT automatically incorporates the context a word occurs in.
Another advantage of transformer models is that attention allows them to better resolve coreferences between words (joshi2019spanbert). A typical example for the importance of coreference resolution is “The trophy doesn’t fit in the brown suitcase because it’s too big.”, where the word “it” would refer to the the suitcase instead of the trophy if the last word was changed to “small” (levesque2012winograd). Being able to resolve these coreferences is for example important for translating to languages with gender, as suitcase and trophy have different genders in French.
Although BERT has been shown to be a useful language model, it has also received some scrutiny on the training and pre-processing of the language model. As mentioned before, BERT uses next sentence prediction (NSP) as one of its two training tasks. In NSP, the model has to predict whether two sentences follow each other in the training text, or are just randomly selected from the corpora. The authors of RoBERTa (liuRoBERTa2019) showed that while this task made the model achieve a better performance, it was not due to its intended reason, as it might merely predict relatedness rather than subsequent sentences. That devlinBERT2019a trained a better model when using NSP than without NSP is likely due to the model learning long-range dependencies in text from its inputs, which are longer than just the single sentence on itself. As such, the RoBERTa model uses only the MLM task, and uses multiple full sentences in every input. Other research improved the NSP task by instead making the model predict the correct order of two sentences, where the model thus has to predict whether the sentences occur in the given order in the corpus, or occur in flipped order (lan2019albert).
devlinBERT2019a also presented a multilingual model (mBERT) with the same architecture as BERT, but trained on Wikipedia corpora in 104 languages. Unfortunately, the quality of these multilingual embeddings is often considered worse than their monolingual counterparts. ronnqvistMultilingual2019 illustrated this difference in quality for German and English models in a generative setting. The monolingual French CamemBERT model (martinCamemBERT2019) also compared their model to mBERT, which performed poorer on all tasks. More recently, devriesBERTje2019 also showed similar results for Dutch using their BERTje model, outperforming multilingual BERT in a wide range of tasks, such as sentiment analysis and part-of-speech tagging. Since this work is concurrent with ours, we compare our results with BERTje in this paper.
3 Pre-training RobBERT
This section describes the data and training regime we used to train our Dutch RoBERTa-based language model called RobBERT.
We pre-trained our model on the Dutch section of the OSCAR corpus, a large multilingual corpus which was obtained by language classification in the Common Crawl corpus (ortizsuarezAsynchronous2019). This Dutch corpus has 6.6 billion words, totalling 39 GB of text. It contains 126,064,722 lines of text, where each line can contain multiple sentences. Subsequent lines are however not related to each other, due to the shuffled nature of the OSCAR data set. For comparison, the French RoBERTa-based language model CamemBERT (martinCamemBERT2019) has been trained on the French portion of OSCAR, which consists of 138 GB of scraped text.
Our data differs in several ways from the data used to train BERTje, a BERT-based Dutch language model (devriesBERTje2019). Firstly, they trained the model on an assembly of multiple Dutch corpora totalling only 12 GB. Secondly, they used WordPiece as subword embeddings, since this is what the original BERT architecture uses. RobBERT on the other hand uses Byte Pair Encoding (BPE), which is also used by GPT-2 radford2019gpt2 and RoBERTa (liuRoBERTa2019).
RobBERT shares its architecture with RoBERTa’s base model, which itself is a replication and improvement over BERT (liuRoBERTa2019). The architecture of our language model is thus equal to the original BERT model with 12 self-attention layers with 12 heads (devlinBERT2019a). One difference with the original BERT is due to the different pre-training task specified by RoBERTa, using only the MLM task and not the NSP task. The training thus only uses word masking, where the model has to predict which words were masked in certain positions of a given line of text. The training process uses the Adam optimizer (kingmaAdam2017) with polynomial decay of the learning rate and a ramp-up period of 1000 iterations, with parameters (a common default) and RoBERTa’s default . Additionally, we also used a weight decay of 0.1 as well as a small dropout of 0.1 to help prevent the model from overfitting (srivastavaDropout2014).
We used a computing cluster in order to efficiently pre-train our model. More specifically, the pre-training was executed on a computing cluster with 20 nodes with 4 Nvidia Tesla P100 GPUs (16 GB VRAM each) and 2 nodes with 8 Nvidia V100 GPUs (having 32 GB VRAM each). This pre-training happened in fixed batches of 8192 sentences by rescaling each GPUs batch size depending on the number of GPUs available, in order to maximally utilize the cluster without blocking it entirely for other users. The model trained for two epochs, which is over 16k batches in total. With the large batch size of 8192, this equates to 0.5M updates for a traditional BERT model. At this point, the perplexity did not decrease any further.
We evaluated RobBERT in several different settings on multiple downstream tasks. First, we compare its performance with other BERT-models and state-of-the-art systems in sentiment analysis, to show its performance for classification tasks. Second, we compare its performance in a recent Dutch language task, namely the disambiguation of demonstrative pronouns, which allows us to additionally compare the zero-shot performance of our and other BERT models, i.e. using only the pre-trained model without any fine-tuning.
|Task + model||ACC (95% CI) [%]||F1 [%]||ACC (95% CI) [%]||F1 [%]|
|Sentiment Analysis (DBRD)|
|RobBERT (ours)||86.730 (85.32, 88.14)||86.729||94.422 (93.47,95.38)||94.422|
|mBERT devlinBERT2019a||92.157 (92.06,92.25)||90.898||98.285 (98.24,98.33)||98.033|
|BERTje (devriesBERTje2019)||93.096 (92.84, 93.36)||91.279||98.268 (98.22,98.31)||98.014|
|RobBERT (ours)||97.006 (96.95, 97.07)||96.571||98.406 (98.36, 98.45)||98.169|
Results of RobBERT fine-tuned on several downstream tasks compared to the state of the art on the tasks. For accuracy, we also report the 95% confidence intervals.(Results annotated with * from vanderburghMerits2019, ** = from devriesBERTje2019, *** from alleinBinary2020)
4.1 Sentiment Analysis
We replicated the high-level sentiment analysis task used to evaluate BERTje (devriesBERTje2019) to be able to compare our methods. This task uses a dataset called Dutch Book Reviews Dataset (DBRD), in which book reviews scraped from hebban.nl are labeled as positive or negative vanderburghMerits2019. Although the dataset contains 118,516 reviews, only 22,252 of these reviews are actually labeled as positive or negative. The DBRD dataset is already split in a balanced 10% test and 90% train split, allowing us to easily compare to other models trained for solving this task. This dataset was released in a paper analysing the performance of an ULMFiT model (Universal Language Model Fine-tuning for Text Classification model) (vanderburghMerits2019).
We fine-tuned RobBERT on the first 10,000 training examples as well as on the full data set. While the ULMFiT model is first fine-tuned using the unlabeled reviews before training the classifier(vanderburghMerits2019), it is unclear whether BERTje also first fine-tuned on the unlabeled reviews or only used the labeled data for fine-tuning the pretrained model. It is also unclear how it dealt with reviews being longer than the maximum number of tokens allowed as input in BERT models, as the average book review length is 547 tokens, with 40% of the documents being longer than our RobBERT model can handle. For a safe comparison, we thus decided to discard the unlabeled data and only use the labeled data for training and test purposes (20,028 and 2,224 examples respectively), and compare approaches for dealing with too long input sequences. We trained our model for 2000 iterations with a batch size of 128 and a warm-up of 500 iterations, reaching a learning rate of . We found that our model performed better when trained on the last part of the book reviews than on the first part. This is likely due to this part containing concluding remarks summarizing the overall sentiment. While BERTje was slightly outperformed by ULMFiT devriesBERTje2019; vanderburghMerits2019, we can see that RobBERT achieves better performance than both on the test set, although the performance difference is not statistically significantly better than the ULMFiT model, as can be seen in Table 1.
4.2 Die/Dat Disambiguation
Aside from classic natural language processing tasks in previous subsections, we also evaluated its performance on a task that is specific to Dutch, namely disambiguating “die” and “dat” (= “that” in English). In Dutch, depending on the sentence, both terms can be either demonstrative or relative pronouns; in addition they can also be used in a subordinating conjunction, i.e. to introduce a clause. The use of either of these words depends on the gender of the word it refers to. Distinguishing these words is a task introduced by alleinBinary2020, who presented multiple models trained on the Europarl (koehnEuroparl2005a) and SoNaR corpora (oostdijkConstruction2013). The results ranged from an accuracy of 75.03% on Europarl to 84.56% on SoNaR.
For this task, we use the Dutch version of the Europarl corpus (koehnEuroparl2005a), which we split in 1.3M utterances for training, 319k for validation, and 399k for testing. We then process every sentence by checking if it contains “die” or “dat”, and if so, add a training example for every occurrence of this word in the sentence, where a single occurrence is masked. For the test set for example, this resulted in about 289k masked sentences. We then test two different approaches for solving this task on this dataset. The first approach is making the BERT models use their MLM task and guess which word should be filled in this spot, and check if it has more confidence in either “die” or “dat” (by checking the first 2,048 guesses at most, as this seemed sufficiently large). This allows us to compare the zero-shot BERT models, i.e. without any fine-tuning after pre-training, for which the results can be seen in Table 2. The second approach uses the same data, but creates two sentences by filling in the mask with both “die” and “dat”, appending both with the [SEP] token and making the model predict which of the two sentences is correct. The fine-tuning was performed using 4 Nvidia GTX 1080 Ti GPUs and evaluated against the same test set of 399k utterances. As before, we fine-tuned the model twice: once with the full training set and once with a subset of 10k utterances from the training set for illustrating the benefits of pre-training on low-resource tasks.
|ZeroR (majority class)||66.70|
RobBERT outperforms previous models as well as other BERT models both with as well as without fine-tuning (see Table 1 and Table 2). It is also able to reach similar performance using less data. The fact that zero-shot RobBERT outperforms other zero-shot BERT models is also an indication that the base model has internalised more knowledge about Dutch than the other two have. The reason RobBERT and other BERT models outperform the previous RNN-based approach is likely the transformers ability to deal better with coreference resolution (joshi2019spanbert), and by extension better in deciding which word the “die” or “dat” belongs to.
The training and evaluation code of this paper as well as the RobBERT model and the fine-tuned models are publicly available for download on https://github.com/iPieter/RobBERT.
6 Future Work
There are several possible improvements as well as interesting future directions for this research, for example in training similar models. First, as BERT-based models are a very active field of research, it is interesting to experiment with change the pre-training tasks with new unsupervised tasks when they are discovered, such as the sentence order prediction (lan2019albert). Second, while RobBERT is trained on lines that contain multiple sentences, it does not put subsequent lines of the corpus after each other due to the shuffled nature of the OSCAR corpus (ortizsuarezAsynchronous2019). This is unlike RoBERTa, which does put full sentences next to each other if they fit, in order to learn the long-range dependencies between words that the original BERT learned using its controversial NSP task. It could be interesting to use the processor used to create OSCAR in order to create an unshuffled version to train on, such that this technique can be used on the data set. Third, RobBERT uses the same tokenizer as RoBERTa, meaning it uses a tokenizer built for the English language. Training a new model using a custom Dutch tokenizer, e.g. using the newly released HuggingFace tokenizers library (wolfHuggingFace2019)
, could increase the performance even further. On the same note, incorporating more Unicode glyphs as separate tokens can also be beneficial for example for tasks related to conversational agents(delobelleTime2019).
RobBERT itself could also be used in new settings to help future research. First, RobBERT could be used in different settings thanks to the renewed interest of sequence-to-sequence models due to their results on a vast range of language tasks (raffelExploring2019; lewis2019bart). These models use a BERT-like transformer stack for the encoder and depending on the task a generative model as a decoder. These advances once again highlight the flexibility of the self-attention mechanism and it might be interesting to research the re-usability of RobBERT in these type of architectures. Second, there are many Dutch language tasks that we did not examine in this paper, for which it may also be possible to achieve state-of-the-art results when fine-tuned on this pre-trained model.
We introduced a new language model for Dutch based on RoBERTa, called RobBERT, and showed that it outperforms earlier approaches for Dutch language tasks, as well as other BERT-based language models. We thus hope this model can serve as a base for fine-tuning on other tasks, and thus help foster new models that might advance results for Dutch language tasks.
Pieter Delobelle was supported by the Research Foundation - Flanders under EOS No. 30992574 and received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme. Thomas Winters is a fellow of the Research Foundation-Flanders (FWO-Vlaanderen). Most computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government – department EWI. We are especially grateful to Luc De Raedt for his guidance as well as for providing the facilities to complete this project. We are thankful to Liesbeth Allein and her supervisors for inspiring us to use the die/dat task. We are also grateful to ott2019fairseq; paszke2019pytorch; Haghighi2018; wolfHuggingFace2019 for their software packages.