RoBERTa models for Polish
Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.READ FULL TEXT VIEW PDF
RoBERTa models for Polish
Unsupervised pre-training for Natural Language Processing (NLP) has gained popularity in recent years. The goal of this approach is to train a model on a large corpus of unlabeled text, and then use the representations the model generates as an input for downstream linguistic tasks. The initial popularization of these methods was related to the successful applications of pre-trained word vectors (embeddings), the most notable of which include Word2Vec[mikolov2013distributed], GloVe [pennington2014glove], and FastText [bojanowski2017enriching]
. These representations have contributed greatly to the development of NLP. However, one of the main drawbacks of such tools was that the static word vectors did not encode contextual information. The problem was addressed in later studies by proposing context-dependent representations of words based on pre-trained neural language models. For this purpose, several language model architectures which utilize bidirectional long short-term memory (LSTM) layers have been introduced. The popular models such as ELMo[peters2018deep], ULMFiT [howard2018universal], and Flair [akbik2018contextual], have led to significant improvements in a wide variety of linguistic tasks. Shortly after, devlin-etal-2019-bert introduced BERT - a different type of language model based on transformer [vaswani2017attention] architecture. Instead of predicting the next word in a sequence, BERT is trained to reconstruct the original sentence from one in which some tokens have been replaced by a special mask token. Since the text representations generated by BERT have proved to be effective for NLP problems - even those which were previously considered challenging, such as question answering or common sense reasoning - more focus has been put on transformer-based language models. As a result, in the last two years we have seen a number of new methods based on that idea, with some modifications in the architecture or the training objectives. The approaches that have gained wide recognition include RoBERTa [liu2019roberta], Transformer-XL [dai-etal-2019-transformer], XLNet [yang2019xlnet], Albert [lan2019albert], and Reformer [kitaev2019reformer].
The vast majority of research on both transformer-based language models and transfer learning for NLP is targeted toward the English language. This progress does not translate easily to other languages. In order to benefit from recent advancements, language-specific research communities must adapt and replicate studies conducted in English to their native languages. Unfortunately, the cost of training state-of-the-art language models is growing rapidly[peng_2019]
, which makes not only individual scientists, but also some research institutions unable to reproduce experiments in their own languages. Therefore, we believe that it is particularly important to share the results of research - especially pre-trained models, datasets, and source code of the experiments - for the benefit of the whole scientific community. In this article, we describe our methodology for training two language models for Polish language based on BERT architecture. The smaller model follows the hyperparameters of an English-language BERT-base model, and the larger version follows the BERT-large model. To the best of our knowledge, the latter is the largest language model for Polish available to date, both in terms of the number of parameters (355M) and the size of the training corpus (135GB). We have released both pre-trained models publicly111https://github.com/sdadas/polish-roberta. We evaluate our models on several linguistic tasks in Polish, including nine from the KLEJ benchmark [rybak2020klej]
, and four additional tasks. The evaluation covers a set of typical NLP problems, such as binary and multi-class classification, textual entailment, semantic relatedness, ranking, and Named Entity Recognition (NER).
In this section we provide an overview of models based on the transformer architecture for languages other than English. Apart from English, the language on which NLP research is most focused currently is Chinese. This is reflected in the number of pre-trained models available [xu2020cluecorpus2020, chinese-bert-wwm, devlin-etal-2019-bert, sun2019ernie]. Other languages for which we found publicly available pre-trained models included: Arabic [antoun2020arabert], Dutch [de2019bertje, delobelle2020robbert], Finnish [virtanen2019multilingual], French [martin2019camembert, le2019flaubert], German, Greek, Italian, Japanese, Korean, Malaysian, Polish, Portuguese [souza2019portuguese], Russian [kuratov2019adaptation], Spanish [CaneteCFP2020], Swedish, Turkish, and Vietnamese [nguyen2020phobert]. Models covering a few languages of the same family are also available, such as SlavicBERT (Bulgarian, Czech, Polish, and Russian) [arkhipov-etal-2019-tuning] and NordicBERT222https://github.com/botxo/nordic_bert (Danish, Norwegian, Swedish, and Finnish). The topic of massive multilingual models covering tens, or in some cases more than a hundred languages, has attracted more attention in recent years. The original BERT model [devlin-etal-2019-bert] was released along with a multilingual version covering 104 languages. XLM [NIPS2019_8928] (fifteen, seventeen and 100 languages) and XLM-R [conneau2019unsupervised] (100 languages) were released in 2019. Although it was possible to use these models for languages in which no monolingual models were available, language-specific pre-training usually leads to better performance. To date, two BERT-base models have been made available for Polish: HerBERT [rybak2020klej] and Polbert333https://github.com/kldarek/polbert, both of which utilize BERT-base architecture.
Our contributions are as follows: 1) We trained two transformer-based language models for Polish, consistent with the BERT-base and BERT-large architectures. To the best of our knowledge, the second model is the largest language model trained for Polish to date, both in terms of the number of parameters and the size of the training corpus. 2) We proposed a method for collecting and pre-processing the data from the Common Crawl database to obtain clean, high-quality text corpora. 3) We conducted a comprehensive evaluation of our models on thirteen Polish linguistic tasks, comparing them to other available transformer-based models, as well as recent state-of-the-art approaches. 4) We made the source code of our experiments available to the public, along with the pre-trained models.
In this section, we describe our methodology for collecting and pre-processing the data used for training BERT-base language models. We then present the details of the training, explaining our procedure and the selection of hyperparameters used in both models.
Transformer-based models are known for their high capacity [jawahar:hal-02131630, kovaleva-etal-2019-revealing], which means that they can benefit from large quantities of text. An important step in the process of creating a language model, therefore, is to collect a sufficiently large text corpus. We have taken into account that the quality of the text used for training will also affect the final performance of the model. The easiest way to collect a large language-specific corpus is to extract it from Common Crawl - a public web archive containing petabytes of data crawled from web pages. The difficulty with this approach is that web-based data is often noisy and unrepresentative of typical language use, which could eventually have a negative impact on the quality of the model. In response to this, we have developed a procedure for filtering and cleaning the Common Crawl data to obtain a high-quality web corpus. The procedure is as follows:
We download full HTML pages (WARC files in Common Crawl), and use the resulting metadata to filter the documents written in Polish language.
In the next step, we use a simple statistical language model (KenLM [heafield2011kenlm]), trained on a small Polish language corpus to assess the quality of each extracted document. For each text, we compute the perplexity value and discard all texts with perplexity higher than 1000.
Finally, we remove all duplicated texts.
The full training corpus we collected is approximately 135GB in size, and is composed of two components: the web part and the base part. For the web part, which amounts to 115GB of the corpus, we downloaded three monthly dumps of Common Crawl data, from November 2019 to January 2020, and followed the pre-processing steps described above. The base part, which comprises the remaining 20GB, is composed of publicly available Polish text corpora: the Polish language version of Wikipedia (1.5GB), the Polish Parliamentary Corpus (5GB), and a number of smaller corpora from the CLARIN (http://clarin-pl.eu) and OPUS (http://opus.nlpl.eu) projects, as well as Polish books and articles.
The authors of the original BERT paper [devlin-etal-2019-bert] proposed two versions of their transformer-based language model: BERT-large (more parameters and higher computational cost), and BERT-base (fewer parameters, more computationally efficient). To train the models for Polish language, we adapted the same architectures. Let denote the number of encoder blocks, denote the hidden size of the token representation, and denote the number of attention heads. Specifically, we used for the base model, and for the large model. The large model was trained on the full 135GB text corpus, and the base model on only the 20GB base part. The training procedure we employed is similar to the one suggested in the RoBERTa pre-training approach [liu2019roberta]. Originally, BERT utilized two training objectives - Masked Language Modeling (MLM), and Next Sentence Prediction (NSP). We trained our models with the MLM objective, since it has been shown that NSP fails to improve the performance of the pre-trained models on downstream tasks [liu2019roberta]
. We also used dynamic token masking, and trained the model with a larger batch size than the original BERT. The base model was trained with a batch size of 8000 sequences for 125 000 training steps: the large model was trained with a batch size of 30 000 sequences for 50 000 steps. The reason for using such a large batch size for the bigger model is to stabilize the training process. During our experiments, we observed significant variations in training loss for smaller batch sizes, indicating that the initial combination of learning rate and batch size had caused an exploding gradient problem. To address the issue, we increased the batch size until the loss stabilized.
Both models were pre-trained with the Adam optimizer using the following optimization hyperparameters: . We utilized a learning rate scheduler with linear decay. The learning rate is first increased for a warm-up phase of 10 000 update steps to reach a peak of , and then linearly decreased for the remainder of the training. We also mimicked the dropout approach of the original BERT model: a dropout of 0.1 is applied on all layers and attention weights. The maximum length of a sequence was set to 512 tokens. We do not combine sentences from the training corpus: each is treated as a separate training sample. To encode input sequences into tokens, we employed SentencePiece [kudo-richardson-2018-sentencepiece] Byte Pair Encoding (BPE) algorithm, and set the maximum vocabulary size to 50 000 tokens.
In this section, we discuss the process and results of evaluating our language models on thirteen Polish downstream tasks. Nine of these tasks constitute the recently developed KLEJ benchmark [rybak2020klej]; three of them have already been introduced in dadas-etal-2020-evaluation; and the last, named entity recognition, was a part of the PolEval555http://2018.poleval.pl/index.php/tasks evaluation challenge. First, we compare the performance of our models with other Polish and multilingual language models evaluated on the KLEJ benchmark. Next, we present detailed per-task results, comparing our models with the previous state-of-the-art solutions for each of the tasks.
NKJP (The National Corpus of Polish (Narodowy Korpus Języka Polskiego)) [przep-2012] is one of the largest text corpora of the Polish language, consisting of texts from Polish books, news articles, web content, and transcriptions of spoken conversations. A part of the corpus, known as the ‘one million subcorpus’, contains annotations of named entities from six categories: ‘persName’, ‘orgName’, ‘geogName’, ‘placeName’, ‘date’, and ‘time’. The authors of the KLEJ benchmark used this subset to create a named entity classification task [rybak2020klej]. First, they filtered out all sentences containing entities of more than one type. Next, they randomly assigned sentences to train development and test sets according to the rule that each named entity mentioned appears in only one of these splits. They undersample the ‘persName’ class, and merge the ‘date’ and ‘time’ classes to increase class balance. Finally, they selected sentences without any named entity, and assigned them the ‘noEntity’ label. The resulting dataset consisted of 20 000 sentences belonging to six classes. The task is to predict the presence and type of each named entity. Classification accuracy is also reported.
8TAGS is a corpus created by dadas-etal-2020-evaluation for their study on the subject of sentence representations in Polish language. This dataset was created automatically by extracting sentences from headlines and short descriptions of articles posted on the Polish social network, wykop.pl. It contains approximately 50 000 sentences, all longer than thirty characters, from eight popular categories: film, history, food, medicine, automotive, work, sport, and technology. The task is to assign a sentence to one of these classes in which classification accuracy is the measure.
CBD (Cyberbullying Detection) [ptaszynski2019results] is a binary classification task, the goal of which is to determine whether a Twitter message constitutes a case of cyberbullying or not. This was a sub-task of task 6 in the PolEval 2019 competition. The dataset prepared by the competition’s organizers contains 11 041 tweets, extracted from nineteen of the most popular Polish Twitter accounts in 2017. The F1-score was used to measure the performance of the models.
DYK ‘Did you know?’ (‘Czy wiesz?’) [MarPtaRadzPia:13] is a dataset used for the evaluation and development of Polish language question answering systems. It consists of 4721 question-answer pairs obtained from the Czy wiesz… Polish Wikipedia project. The answer to each question was found in the linked Wikipedia article. rybak2020klej used this dataset to devise a binary classification task, the goal of which is to predict whether the answer to the given question is correct or not [rybak2020klej]. Positive responses were additionally marked within larger fragments of responded text. Negative samples were selected by the BPE token overlap between a question and a possible answer. The F1-score was also reported for this task.
PSC The Polish Summaries Corpus [OGRODNICZUK14.1211] is a corpus of manually created summaries of Polish language news articles. The dataset contains both abstract free-word summaries and extraction-based summaries created by selecting text spans from the original documents. Based on PSC, [rybak2020klej] formulated a text-similarity task [rybak2020klej]. They generate positive pairs by matching each extractive summary with the two least similar abstractive ones in the same article. Negative pairs were obtained by finding the two most similar abstractive summaries for each extractive summary, but from different articles. To calculate the similarity between summaries, they used the BPE token overlap. The F1-score was used for evaluation.
PolEmo2.0 [kocon-etal-2019-multi-level] is a corpus of consumer reviews obtained from four domains: medicine, hotels, products, and school. Each of the reviews is annotated with one of four labels: positive, negative, neutral, or ambiguous. In general, the task is to choose the correct label, although here two special versions of the task are distinguished: PolEmo2.0-IN and PolEmo2.0-OUT. In PolEmo2.0-IN, both the training and test sets come from the same domains, namely medicine and hotels. In PolEmo2.0-OUT, however, the test set comes from the product and school domains. In both cases, accuracy was used for evaluation.
Allegro Reviews (AR) [rybak2020klej]
is a sentiment analysis dataset of product reviews from the e-commerce marketplace, allegro.pl. Each review has a rating on a five-point scale, in which one is negative, and five is positive. The task is to predict the rating of a given review. The macro-average of the mean absolute error per class (wMAE) is applied for evaluation.
CDSC (The Compositional Distributional Semantics Corpus) [wroblewska-krasnowska-kieras-2017-polish]
is a corpus of 10 000 human-annotated sentence pairs for semantic relatedness and entailment, in which image captions from forty-six thematic groups were used as sentences. Two tasks are proposed based on this dataset. The CDSC-R problem involves predicting the relatedness between a pair of sentences, on a scale of zero to five, in which zero indicates that the sentences are not related, and five indicates that they are highly related. In this task, the Spearman correlation is used as an evaluation measure. CDSC-E’s task is to classify whether the premise entails the hypothesis (entailment), negates the hypothesis (contradiction), or is unrelated (neutral). For this task, accuracy is reported.
SICK [dadas-etal-2020-evaluation] is a manually translated Polish language version of the English Natural Language Inference (NLI) corpus, SICK (Sentences Involving Compositional Knowledge) [marelli-etal-2014-sick]
, and consists of 10 000 sentence pairs. As with the CDSC dataset, two tasks can also be distinguished here. SICK-R is the task of predicting the probability distribution of relatedness scores (ranging from 1 to 5) for the sentence pair, in which the Spearman correlation is used for evaluation. SICK-E is a multiclass classification problem in which the relationship between two sentences is classified as entailment, contradiction, or neutral. Accuracy is used once again to measure performance.
PolEval-NER 2018 [poleval2018-ner] was task 2 in the PolEval 2018 competition, the goal of which was to detect and assign the correct category and subcategory (if applicable) to a found named entity. In this study the task was simplified, as only the main categories had to be found. The effectiveness of the models is verified by the F1-score measure. This task was prepared on the basis of the NKJP dataset previously presented.
To evaluate our language models on downstream tasks, we fine-tuned them separately for each task. In our experiments, we encounter three types of problem: classification, regression, and Named Entity Recognition (NER). In classification tasks, the model is expected to predict a label from a set of two or more classes. Regression concerns the prediction of a continuous numerical value. NER is a special case of sequence tagging, i.e. predicting a label for each element in a sequence. The dataset for each problem consists of training and test parts, and in most cases also includes a validation part. The general fine-tuning procedure is as follows: we train our model on the training part of the dataset for a specific number of epochs. If the validation set is available, we compute the validation loss after each epoch, and select the model checkpoint with the best validation loss. For datasets without a validation set, we select the last epoch checkpoint. Then, we perform an evaluation on the test set using the selected checkpoint.
In the case of classification and regression tasks, we attach an additional fully-connected layer to the output of the [CLS]
token, which always remains in the first position of a sequence. For classification, the number of outputs for this layer is equal to the number of classes, and the softmax activation function is used. For regression, it is a linear layer with a single output. The models are fine-tuned with the Adam optimizer using the following hyperparameters:. A learning rate scheduler with polynomial decay is utilized. The first 6% of the training steps are reserved for the warm-up phase, in which the learning rate is gradually increased to reach a peak of . By default, we train for ten epochs with a batch size of sixteen sequences. The specific fine-tuning steps and exceptions to the procedure are discussed below:
[leftmargin=*, labelwidth=0pt, labelindent=0pt]
Classification on imbalanced datasets – Some of the binary classification datasets considered in the evaluation, such as CBD, DYK, and PSC, are imbalanced, which means that they contain significantly fewer samples of the first class than of the second class. To counter this imbalance, we utilize a simple resampling technique: samples for the minority class in the training set are duplicated, and some samples for the majority class are randomly discarded. We set the resampling factor to 3 for the minority class, and 1 (DYK, PSC) or 0.75 (CBD) respectively for the majority class. Additionally, we increase the batch size for those tasks to thirty-two.
Regression - In many cases, a regression task is restricted to a specific range of values for which the prediction is valid. For example, Allegro Reviews contains user reviews with ratings between one and five stars. For fine-tuning, we scale all the outputs of regression models to be within the range of , and then rescale them to their original range during evaluation. Before rescaling, any negative prediction is set to 0, and any prediction greater that 1 is limited to 1.
Named entity recognition - Since sequence tagging, in which the model is expected to generate per-token predictions, is different from simple classification or regression tasks, we decided to adapt an existing named entity recognition approach for fine-tuning using our language models. For this purpose, we employed a method from shibuya2019nested, who proposed a transformer-based named entity recognition model with a Conditional Random Fields (CRF) inference layer, and multiple Viterbi-decoding steps to handle nested entities. In our experiments, we used the same hyperparameters as the authors.
In this section, we demonstrate the results of evaluating our language models on downstream tasks. We repeated the fine-tuning of the models for each task five times. The scores reported are the median values of those five runs. Table 1 demonstrates the evaluation results on the KLEJ benchmark, in comparison with other available Polish and multilingual transformer-based models. The results of other approaches are taken from the KLEJ leaderboard. We split the table into two sections, comparing the BERT-base and BERT-large architectures separately. We can observe that there is a wider selection of base models, and most of them are multilingual, such as the original multilingual BERT (mBERT) [devlin-etal-2019-bert], SlavicBERT [arkhipov-etal-2019-tuning], XLM [NIPS2019_8928], and XLM-R [conneau2019unsupervised]. The only models pre-trained specifically for Polish language are HerBERT [rybak2020klej] and Polbert. Among the base models, our approach outperforms others by a significant margin. In the case of large models, only the XLM-RoBERTa (XLM-R) pre-trained model has been available until now. XLM-RoBERTa is a recently published multilingual transformer trained on 2.5TB of data in 100 languages. It has been shown to be highly competitive against monolingual models. A direct comparison with our Polish language model demonstrates a consistent advantage of our model - it has achieved better results in seven of the nine tasks included in the KLEJ benchmark.
Table 2 shows a more detailed breakdown of the evaluation results, and includes all the tasks from the KLEJ benchmark, and four additional tasks: SICK-R, SICK-R, 8TAGS, and PolEval-NER 2018. For each task, we define the task type (classification, regression, or sequence tagging), the metric used for evaluation, the previous state-of-the-art, and our results including the absolute difference to the SOTA. The competition between XLM-R and our large model dominates the results, since both models have led to significant improvements in linguistic tasks for Polish language. In some cases, the improvement over previous approaches is greater than 10%. For example, the CDB task was a part of the PolEval 2019 competition, in which the winning solution by czapla2019universal achieved an F1-score of 58.6. Both our model and the XLM-R large model outperform that by at least twelve points, achieving an F1-score of over 70. The comparison for the named entity recognition task is also interesting. The previous state-of-the-art solution by dadas2019combining
is a model that combined neural architecture with external knowledge sources, such as entity lexicons or a specialized entity linking module based on data from Wikipedia. Our language model managed to outperform this method by 3.8 points without using any structured external knowledge. In summary, our model has demonstrated an improvement over existing methods in eleven of the thirteen tasks.
We have presented two transformer-based language models for Polish, pre-trained using a combination of publicly available text corpora and a large collection of methodically pre-processed web data. We have shown the effectiveness of our models by comparing them with other transformer-based approaches and recent state-of-the-art approaches. We conducted a comprehensive evaluation on a wide set of Polish linguistic tasks, including binary and multi-class classification, regression, and sequence labeling. In our experiments, the larger model performed better than other methods in eleven of the thirteen cases. To accelerate research on NLP for Polish language, we have released the pre-trained models publicly.
|Nordic BERT||Nordic (4)||-||github.com/botxo/nordic_bert|