FinEst BERT and CroSloEngual BERT: less is more in multilingual models

06/14/2020 ∙ by Matej Ulčar, et al. ∙ University of Ljubljana 0

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In natural language processing (NLP), a lot of research focuses on numeric word representations. Static pretrained word embeddings like word2vec

[11] are recently replaced by dynamic, contextual embeddings, such as ELMo [13] and BERT [3]

. These generate a word vector based on the context the word appears in, mostly using the sentence as the context.

Large pretrained masked language models like BERT [3] and its derivatives achieve state-of-the-art performance when fine-tuned for specific NLP tasks. The research into these models has been mostly limited to English and a few other well-resourced languages, such as Chinese Mandarin, French, German, and Spanish. However, two massively multilingual masked language models have been released: a multilingual BERT (mBERT) [3], trained on 104 languages, and newer even larger XLM-RoBERTa (XLM-R) [2], trained on 100 languages. While both, mBERT and XLM-R, achieve good results, it has been shown that monolingual models significantly outperform multilingual models [20, 10]. In our work, we reduced the number of languages in multilingual models to three, two similar less-resourced languages from the same language family, and English. The main reasons for this choice are to better represent each language, and keep sensible sub-word vocabulary, as shown by Virtanen et al. [20]. We decided against production of monolingual models, because we are interested in using the models in multilingual sense and for cross-lingual knowledge transfer. By including English in each of the two models, we expect to better transfer existing prediction models from English to involved less-resourced languages. Additional reason against purely monolingual models for less-resourced languages is the size of training corpora, i.e. BERT-like models use transformer architecture which is known to be data hungry.

We thus trained two multilingual BERT models: FinEst BERT was trained on Finnish, Estonian, and English, while CroSloEngual BERT was trained on Croatian, Slovenian, and English. In the paper, we present the creation and evaluation of these models, which required considerable computational resources, unavailable to most NLP researchers. We make the models which are valuable resources for the involved less-resourced languages publicly available111CroSloEngual BERT:
FinEst BERT:

2 Training data and preprocessing

BERT models require large quantities of monolingual data. In Section 2.1 we first describe the corpora used, followed by a short description of their preprocessing in Section 2.2.

2.1 Datasets

We trained two new BERT models from five languages: Finnish, Estonian, Slovenian, Croatian and English. To obtain high-quality models, we used large monolingual corpora for each language, some of them unavailable to the general public. For English, large corpora are readily available and they are much larger than for other languages. However, high-quality English language models already exist and English is not the main focus of this research, we therefore did not use all available English corpora in order to prevent English from overwhelming the other languages in our models. Some corpora are available online under permissive licences, others are available only for research purposes or have limited availability. The corpora used in training are a mix of news articles and general web crawl, which we preprocessed and deduplicated. Details about the training set sizes are presented in Table 2, while their description can be found in works on the involved less-resourced languages, e,g., [18].

Model CroSloEngual FinEst
Croatian 31% 0%
Slovenian 23% 0%
English 47% 63%
Estonian 0% 13%
Finnish 0% 25%
Table 2: The sizes of corpora subsets in millions of tokens used to create wordpiece vocabularies.
Language FinEst CroSloEngual
Croatian / 27
Slovenian / 28
English 157 23
Estonian 75 /
Finnish 97 /
Table 1: The training corpora sizes in number of tokens and the ratios for each language.

2.2 Preprocessing

Before using the corpora, we deduplicated them for each language separately, using the Onion (ONe Instance ONly) tool222 We applied the tool on sentence level for those corpora that did have sentences shuffled, and on paragraph level for the rest. As parameters, we used 9-grams with duplicate content threshold of 0.9.

BERT models are trained on subword (wordpiece) tokens. We created a wordpiece vocabulary using bert-vocab-builder tool333, which is built upon tensor2tensor library [19]. We did not process the whole corpora in creating the wordpiece vocabulary, but only a smaller subset. To balance the language representation in vocabulary, we used samples from each language. The sizes of corpora subsets are shown in Table 2. The created wordpiece vocabularies contain 74,986 tokens for FinEst and 49,601 tokens for CroSloEngual model.

3 Architecture and training

We trained two BERT multilingual models. FinEst BERT was trained on Finnish, Estonian, and English corpora, with altogether billion tokens. CroSloEngual BERT was trained on Croatian, Slovenian, and English corpora with together billion tokens.

Both models use bert-base architecture [3], which is a 12-layer bidirectional transformer encoder with the hidden layer size of 768 and altogether 110 million parameters. We used the whole word masking for the masked language model training task. Both models are cased, i.e. the case information was preserved. We followed the hyper-parameters settings of Devlin et al. [3]

, except for the batch size and total number of steps. We trained the models for approximately 40 epochs with maximum sequence length of 128 tokens, followed by approximately 4 epochs with maximum sequence length of 512 tokens. The exact number of steps was calculated using the expression:

, where is the number of steps the models were trained for, is the number of tokens in the train corpora, is the desired number of epochs (in our case 40 and 4), is the batch size, and is the maximum sequence length.

We trained FinEst BERT on a single Google Cloud TPU v3 for a total of 1.24 million steps where the first 1.13 million steps used the batch size of 1024 and sequence length 128, and the last 113 thousand steps used the batch size 256 and sequence length 512. Similarly, CroSloEngual BERT was trained on a single Google Cloud TPU v2 for a total of 3.96 million steps, where the first 3.6 million steps used the batch size of 512 and sequence length 128, and the last 360 thousand steps were trained with the batch size 128 and sequence length 512. Training took approximately 2 weeks for FinEst BERT and approximately 3 weeks for CroSloEngual BERT.

4 Evaluation

We evaluated the two new BERT models on three downstream evaluation tasks available for the four involved less-resourced languages: named entity recognition (NER), part-of-speech tagging (POS), and dependency parsing (DP). We compared both models with BERT-base-multilingual-cased model (mBERT) on sensible languages, i.e. FinEst BERT was compared with mBERT on Finnish, Estonian, and English, while CroSloEngual BERT was compared with mBERT on Croatian, Slovenian, and English.

4.1 Named Entity Recognition

Named entity recognition (NER) task is a sequence labeling task, which tries to correctly identify and classify each token from an unstructured text into one of the predefined named entity (NE) classes or, if the token is not part of a NE, to classify it as not a named entity. Most common named entity classes are personal names, locations and organizations. We used various datasets, which do not cover the same set of classes. We therefore adapted the datasets to allow a more direct comparison between languages, by reducing them to the four labels they all have in common: PER (person), LOC (location), ORG (organization), and O (other). All tokens, which are not named entities or belong to any NE class other than person, location or organization, were labeled as ’O’.

For Croatian and Slovenian, we used data from hr500k [9] and ssj500k [7], respectively. Not all sentences in ssj500k are annotated, so we excluded those that are not annotated. English dataset comes from CoNLL 2013 shared task [17]. For Finnish we used Finnish News Corpus for NER [15], and for Estonian dataset we used Nimeüksuste korpus [8]. The statistics of each dataset are shown in Table 3.

Language PER LOC ORG Density N
Croatian 10241 7445 11216 0.057 506457
English 17050 12316 14613 0.146 301418
Estonian 8490 6326 6149 0.096 217272
Finnish 3402 2173 11258 0.087 193742
Slovenian 4478 2460 2667 0.049 194667
Table 3: The number of tokens labeled with each label (PER, LOC, ORG), the density of these labels (their sum divided by the number of all tokens) and the number of all tokens (N) for datasets in all languages.

To evaluate the performance of BERT embeddings on the NER task we trained NER models using Huggingface’s Transformer library, basing the code on their NER example444 We fine-tuned each of our BERT models with an added token classification head for 3 epochs on the NER data. We compared the results with BERT-base-multilingual-cased (mBERT) model, which we fine-tuned with exactly the same parameters on the same data.

Train lang Test lang mBERT CroSloEngual
Croatian Croatian 0.795 0.894
Slovenian Slovenian 0.903 0.917
English English 0.940 0.949
Croatian English 0.793 0.866
English Croatian 0.638 0.798
Slovenian English 0.781 0.833
English Slovenian 0.736 0.843
Croatian Slovenian 0.825 0.908
Slovenian Croatian 0.755 0.847
Table 4: The results of NER evaluation task on Croatian, Slovenian, and English. The scores are average scores of the three named entity classes. A NER model was trained on ”train language” dataset and tested on ”test language” dataset using two different BERT models for all possible combinations of train and test languages.

We evaluated the models in a monolingual setting (training and testing on the same language) and a crosslingual setting (training on one language, testing on another). We present the results as macro average scores of the three NE classes, excluding ’O’ label. Comparison between CroSloEngual BERT and mBERT is shown in Table 4, comparison between FinEst BERT and mBERT is shown in Table 5.

The difference in performance of each BERT on English data is negligible. In other languages, our models outperform the multilingual BERT, the difference is especially large in Croatian. In crosslingual setting, both FinEst BERT and CroSloEngual BERT show a significant improvement over mBERT, especially when one of the two languages is English. This leads us to believe that multilingual BERT models with fewer languages are more suitable for crosslingual knowledge transfer.

Train lang Test lang mBERT FinEst
Finnish Finnish 0.922 0.959
Estonian Estonian 0.906 0.930
English English 0.940 0.942
Finnish English 0.692 0.810
English Finnish 0.770 0.901
Estonian English 0.765 0.815
English Estonian 0.762 0.839
Finnish Estonian 0.795 0.879
Estonian Finnish 0.839 0.912

Table 5: The results of NER evaluation task on Finnish, Estonian, and English. The scores are average scores of the three named entity classes. A NER model was trained on ”train language” dataset and tested on ”test language” dataset using two different BERT models for all possible combinations of train and test languages.

4.2 Part-of-speech tagging and dependency parsing

We evaluated BERT models on two more classification tasks: part-of-speech (POS) tagging and dependency parsing. In the POS tagging task we attempt to correctly classify each token within a given set of grammatical categories (verb, adjective, punctuation, adverb, noun, etc.) Dependency parsing task attempts to predict the tree structure, representing the syntactic relations between words in a given sentence.

We trained classifiers on universal dependencies (UD) treebank datasets, using universal part-of-speech (UPOS) tag set. For Croatian, we used treebank by Agić and Ljubešić [1]. For English, we used A Gold Standard Dependency Corpus [16]. For Estonian, we used Estonian Dependency Treebank [12], converted to UD. Finnish treebank used is based on the Turku Dependency Treebank [5], which was also converted to UD [14]. Slovenian treebank [4] is based on the ssj500k corpus [7].

We used Udify tool [6] to train both POS tagger and dependency parsing classifiers at the same time. We finetuned each BERT model for 80 epochs on the treebank data. We kept the tool parameters at default values, except for ”warmup_steps” and ”start_step” values, which we changed to equal the number of training batches in one epoch.

We present the results of POS tagging as UPOS accuracy score in Table 6 and Table 7. The difference in performance between BERT models is very small on this task. FinEst and CroSloEngual BERTs perform slightly better than mBERT on all languages in monolingual setting, except Croatian, where mBERT and CroSloEngual BERT are equal. The differences are more pronounced in cross-lingual setting. When training on Slovenian, Finnish or Estonian data and testing on English data CroSloEngual and FinEst BERT significantly outperform mBERT. On the other hand, when training on English and testing Croatian, mBERT outperforms CroSloEngual BERT.

Train lang. Test lang. mBERT CroSloEngual
Croatian Croatian 0.983 0.983
English English 0.969 0.972
Slovenian Slovenian 0.987 0.991
English Croatian 0.876 0.869
English Slovenian 0.857 0.859
Croatian English 0.750 0.756
Croatian Slovenian 0.917 0.934
Slovenian English 0.686 0.723
Slovenian Croatian 0.920 0.935
Table 6: The embeddings quality measured on the UPOS tagging task, using UPOS accuracy score for FinEst BERT, CroSloEngual BERT and BERT-base-multilingual-cased (mBERT).
Train lang. Test lang. mBERT FinEst
English English 0.969 0.970
Estonian Estonian 0.972 0.978
Finnish Finnish 0.970 0.981
English Estonian 0.852 0.878
English Finnish 0.847 0.872
Estonian English 0.688 0.808
Estonian Finnish 0.872 0.913
Finnish English 0.535 0.701
Finnish Estonian 0.888 0.919
Table 7: The embeddings quality measured on the UPOS tagging task, using UPOS accuracy score for FinEst BERT, CroSloEngual BERT and BERT-base-multilingual-cased (mBERT).

We present the results of dependency parsing task as unlabeled attachement score (UAS) and labeled attachment score (LAS). In monolingual setting CroSloEngual BERT shows improvement over mBERT on all three languages (Table 8) with the highest improvement on Slovenian and only a marginal improvement on English. FinEst BERT outperforms mBERT on Estonian and Finnish, with the biggest margin being on the Finnish data (Table 9). FinEst BERT and mBERT perform equally on English data.

In crosslingual setting, the results are similar to those seen on the POS tagging task. Major improvements of FinEst BERT and CroSloEngual BERT over mBERT in English-Estonian, English-Finnish and English-Slovenian pairs, minor improvements in Estonian-Finnish and Croatian-Slovenian pairs. Again, mBERT outperformed CroSloEngual BERT when dependency parser was trained on English data and tested on Croatian data.

Train Test mBERT CroSloEngual
language language UAS LAS UAS LAS
Croatian Croatian 0.930 0.891 0.940 0.903
English English 0.917 0.894 0.922 0.899
Slovenian Slovenian 0.938 0.922 0.957 0.947
English Croatian 0.824 0.724 0.822 0.725
English Slovenian 0.830 0.719 0.848 0.736
Croatian English 0.759 0.627 0.782 0.657
Croatian Slovenian 0.880 0.802 0.912 0.840
Slovenian English 0.741 0.578 0.794 0.648
Slovenian Croatian 0.861 0.773 0.891 0.810
Table 8: The embeddings quality measured on the dependency parsing task. Results are given as UAS and LAS for CroSloEngual BERT and BERT-base-multilingual-cased (mBERT).
Train Test mBERT FinEst
language language UAS LAS UAS LAS
English English 0.917 0.894 0.918 0.895
Estonian Estonian 0.880 0.848 0.909 0.882
Finnish Finnish 0.898 0.867 0.933 0.915
English Estonian 0.697 0.531 0.768 0.591
English Finnish 0.706 0.561 0.781 0.624
Estonian English 0.633 0.492 0.726 0.567
Estonian Finnish 0.784 0.695 0.864 0.801
Finnish English 0.543 0.433 0.684 0.558
Finnish Estonian 0.782 0.691 0.852 0.778
Table 9: The embeddings quality measured on the dependency parsing task. Results are given as UAS and LAS for FinEst BERT and BERT-base-multilingual-cased (mBERT).

5 Conclusion

We built two large pretrained trilingual BERT-based masked language models, Croatian-Slovenian-English and Finnish-Estonian-English. We showed that the new CroSloEngual and FinEst BERTs perform substantially better than massively multilingual mBERT on the NER task in both monolingual and cross-lingual setting. The results on POS tagging and DP tasks show considerable improvement of the proposed models for several monolingual and cross-lingual pairs, while they are never worse than mBERT.

In future, we plan to investigate different combinations and proportions of less-resourced languages in creation of pretrained BERT-like models, and use the newly trained BERT models on the problems of news media industry.


The work was partially supported by the Slovenian Research Agency (ARRS) core research programme P6-0411. This paper is supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media). Research was supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).


  • [1] Ž. Agić and N. Ljubešić (2015) Universal dependencies for Croatian (that work for Serbian, too). In The 5th Workshop on Balto-Slavic Natural Language Processing, pp. 1–8. Cited by: §4.2.
  • [2] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §1.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §1, §3.
  • [4] K. Dobrovoljc, T. Erjavec, and S. Krek (2017) The universal dependencies treebank for Slovenian. In Proceeding of the 6th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2017), Cited by: §4.2.
  • [5] K. Haverinen, J. Nyblom, T. Viljanen, V. Laippala, S. Kohonen, A. Missilä, S. Ojala, T. Salakoski, and F. Ginter (2013) Building the essential resources for Finnish: the Turku dependency treebank. LREC. Cited by: §4.2.
  • [6] D. Kondratyuk and M. Straka (2019) 75 languages, 1 model: parsing universal dependencies universally. In Proceedings of the 2019 EMNLP-IJCNLP, pp. 2779–2795. Cited by: §4.2.
  • [7] S. Krek, K. Dobrovoljc, T. Erjavec, S. Može, N. Ledinek, N. Holz, K. Zupan, P. Gantar, T. Kuzman, J. Čibej, Š. Arhar Holdt, T. Kavčič, I. Škrjanec, D. Marko, L. Jezeršek, and A. Zajc (2019) Training corpus ssj500k 2.2. Note: Slovenian language resource repository CLARIN.SI Cited by: §4.1, §4.2.
  • [8] S. Laur (2013) Nimeüksuste korpus. Note: Center of Estonian Language Resources Cited by: §4.1.
  • [9] N. Ljubešić, F. Klubička, Ž. Agić, and I. Jazbec (2016)

    New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian

    In Proceedings of the LREC 2016, Cited by: §4.1.
  • [10] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de la Clergerie, D. Seddah, and B. Sagot (2019) CamemBERT: a tasty French language model. arXiv preprint arXiv:1911.03894. Cited by: §1.
  • [11] T. Mikolov, Q. V. Le, and I. Sutskever (2013) Exploiting similarities among languages for machine translation. arXiv preprint 1309.4168. Cited by: §1.
  • [12] K. Muischnek, K. Müürisep, and T. Puolakainen (2016) Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies. In Proceedings of LREC 2016, Cited by: §4.2.
  • [13] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1.
  • [14] S. Pyysalo, J. Kanerva, A. Missilä, V. Laippala, and F. Ginter (2015) Universal dependencies for Finnish. In Proceedings of NoDaLiDa 2015, Cited by: §4.2.
  • [15] T. Ruokolainen, P. Kauppinen, M. Silfverberg, and K. Lindén (2020) A Finnish news corpus for named entity recognition. Lang Resources & Evaluation 54 (1), pp. 247–272. Cited by: §4.1.
  • [16] N. Silveira, T. Dozat, M. de Marneffe, S. Bowman, M. Connor, J. Bauer, and C. D. Manning (2014) A gold standard dependency corpus for English. In Proceedings of LREC-2014, Cited by: §4.2.
  • [17] E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of CoNLL-2003, W. Daelemans and M. Osborne (Eds.), pp. 142–147. Cited by: §4.1.
  • [18] M. Ulčar and M. Robnik-Šikonja (2020-05) High quality elmo embeddings for seven less-resourced languages. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4731–4738. Cited by: §2.1.
  • [19] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, et al. (2018)

    Tensor2Tensor for neural machine translation

    In Proceedings of the AMT, pp. 193–199. Cited by: §2.2.
  • [20] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo (2019) Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076. Cited by: §1.