What the [MASK]? Making Sense of Language-Specific BERT Models

03/05/2020 ∙ by Debora Nozza, et al. ∙ Università Bocconi 1

Recently, Natural Language Processing (NLP) has witnessed an impressive progress in many areas, due to the advent of novel, pretrained contextual representation models. In particular, Devlin et al. (2019) proposed a model, called BERT (Bidirectional Encoder Representations from Transformers), which enables researchers to obtain state-of-the art performance on numerous NLP tasks by fine-tuning the representations on their data set and task, without the need for developing and training highly-specific architectures. The authors also released multilingual BERT (mBERT), a model trained on a corpus of 104 languages, which can serve as a universal language model. This model obtained impressive results on a zero-shot cross-lingual natural inference task. Driven by the potential of BERT models, the NLP community has started to investigate and generate an abundant number of BERT models that are trained on a particular language, and tested on a specific data domain and task. This allows us to evaluate the true potential of mBERT as a universal language model, by comparing it to the performance of these more specific models. This paper presents the current state of the art in language-specific BERT models, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks). Our aim is to provide an immediate and straightforward overview of the commonalities and differences between Language-Specific (language-specific) BERT models and mBERT. We also provide an interactive and constantly updated website that can be used to explore the information we have collected, at https://bertlang.unibocconi.it.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In all natural languages, word meaning varies with and is determined by context, and one of the main challenges of Natural Language Processing (NLP) has been (and remains) to model this property of meaning. Embedding-based language models [17] have been shown to capture word meaning more efficiently than previous methods, allowing for both qualitative analysis of similarities and improved performance when used as input to predictive models. However, while embeddings represent word types based on their general contextual co-occurrences, they do not learn context-specific representations for each word token.

Recently, NLP has witnessed the advent of a groundbreaking new language model developed by Google researchers, called Bidirectional Encoder Representations from Transformers (BERT) [11]. It learns contextual representations for word tokens, thereby getting at their contextual variation in meaning. Contextualized BERT embeddings have since also dominated the leaderboards in a wide variety of NLP tasks.

The power of BERT representations lies in the fact that it is essentially a pretrained model that can be fine-tuned over specific downstream tasks, which enables it to achieve state-of-the-art results. The fundamental underlying component of this architecture is the Transformer model [26], an attention-based mechanism that has been shown to be effective in many different tasks. Both the Transformer and BERT have gathered much attention, and there is now a wealth of research articles and blog posts describing the inner workings of these models [22, among others].

Given the overwhelming success of BERT, a multilingual BERT model (mBERT)111https://github.com/google-research/bert/blob/master/multilingual.md has been proposed, supporting over 100 languages, including Arabic, Dutch, French, German, Italian, or Portuguese. The model is trained on different domains, like social media posts or newspaper articles. mBERT has shown great capabilities in zero-shot cross-lingual tasks [20].

Due to the remarkable results of these models, an abundant number of BERT model extension has recently been introduced by researchers and industry practitioners from several countries: Currently, there are around 5k repositories mentioning “bert” on GitHub.com, and we can expect further demand for BERT extensions. These models are trained on a particular language and tested on a specific data domain and task, with the promise of maximizing performance across more tasks in that language, saving other users further fine-tuning.

However, it has so far not been clearly demonstrated whether the advantage of training a language-specific model is worth the expense in terms of computational resources222These models require a large amount of computational resources unaffordable for many users, and comes with severe ecological costs: training BERT on a GPU is roughly equivalent to a trans-American flight in terms of CO2 emissions [24])., rather than using the unspecific multilingual model.

Moreover, the NLP community is now facing a problem organizing the plethora of models that are being released. These models are not only trained on different data sets, but also use different configurations and architectural variants. To give a concrete example, the original BERT model was trained using the WordPiece tokenizer [29], however, a recent language-specific model (CamemBERT [16]) used the SentencePiece tokenizer (available as OSS software) [12].

Identifying which model is the best for a specific task, and whether the mBERT model is better than language-specific models is a key step future progress in NLP, and will impact the use of computational resources. Surveying both GitHub and the literature, we identified 30 different pretrained language-specific BERT models, covering 18 Languages and tested on 29 tasks, resulting in 177 different performance results [14, 3, 16, 1, 13, 4, 27, 21, 9, 8]. We outline some of the parameters here, and introduce the associated website for up-to-date searches. We hope to give NLP researchers and practitioners a clear overview of the tradeoffs before approaching any NLP task with such a model.

The contributions of this paper are the following:

  1. we present an overall picture of language-specific BERT models from an architectural, task- and domain-related point of view;

  2. we summarize the performance of language-specific BERT models and compare with the performance of the multilingual BERT model (if available);

  3. we introduce a website to interactively explore state-of-the-art models. We hope this can serve as a shared repository for researchers to decide which model best suits their needs.

2 Bidirectional Encoder Representations from Transformers

We assume that most readers who are interested in the topic have a basic understanding of BERT and its components. However, for completeness’ sake, we include a brief and high-level overview of the most important aspects here.

2.1 Bert

BERT uses the Transformer [26] architecture to learn word embeddings. The Transformer is a recent architectural advancement that can be included in deep networks for sequence modeling. Instead of modeling sequences as RNNs or LSTMs, the Transformer learns global dependencies between input and output, using only attention mechanisms.

Transformers greatly shifted the focus of the research community towards attention-based architectures. The encoder-decoder structure based on transformers is also incorporated into BERT.

devlin2018bert introduced BERT in 2018 as a context-sensitive alternative to previous word embeddings (which assume a word always has the same representation, independent of its context). The model essentially stacks several encoder-decoder structures based on transformers together. It uses masks to blank out individual words, forcing the model to “fill in the blanks”, thereby increasing its context-sensitivity. Two key elements in the BERT pretraining process are the masked language model and the next sentence prediction. In the former process, a random subsample (in the BERT paper, 15%) of the words in a text are replaced by a [MASK] token, and the task is to predict the correct token. The latter process instead is the task of predicting how likely one sentence is to follow another one in text. See Figure 1 for a schematic view on BERT. Other than traditional word embeddings, BERT representations are not a fixed lookup table, but require the full context to produce a word representation. The vocabulary is defined in advance and it is based on WordPiece [29], a tokenization algorithm that generates sub-word tokens.

Due to its size in terms of parameters, the model usually comes in a pretrained format, which can be fine-tuned on the task or data set. Simple classification layers can be stacked on top of the pretrained BERT to provide predictions for several tasks such as sentiment analysis or text classification.

Figure 1: A schematic representation of BERT, masked language model and next sentence prediction. Different words have different meanings and BERT looks at the word context to generate contextual representations.

2.2 Multilingual BERT, ALBERT and RoBERTA

Subsequently, BERT was extended to include several languages. Multilingual (mBERT) was part of the original paper [11], and is pretrained over several languages using Wikipedia data. This allows for zero-shot learning across languages, i.e., training on data from one language and applying the model to data in another language.

Along the same lines, lan2019albert introduced A Lite BERT (ALBERT), to reduce the computational needs of BERT. ALBERT includes two parameters reduction techniques, that reduce the number of trainable parameters without significantly compromising performance. Moreover, the authors introduce another self-supervised loss, related to sentence order prediction that is meant to address the limits of next sentence prediction used in the BERT model. Another recent paper [15] has shown that BERT is sensitive to different training strategies; the authors introduce RoBERTA [15] as a well-optimized version of BERT.

3 Making-Sense of Language-Specific BERT Models

While multi- and cross-lingual BERT representations allow for zero-shot learning and capture universal semantic structures, they do gloss over language-specific differences. Consequently, a number of language-specific BERT models have been developed to fill that need. These models almost always showed better performance on the language they were trained for than the universal model.

In order to navigate this wealth of constantly changing information, a simple overview paper is no longer sufficient. While we aim to give a general overview here, we refer the interested reader to the constantly updated online resource, BertLang.

3.1 BertLang

We introduce BertLang (https://bertlang.unibocconi.it), a website where we have gathered different language-specific models that have been introduced on a variety of tasks and data sets. Most of the models are available as GitHub links, and some of them are described in research papers, but very few have been published in peer-reviewed conferences333We do not include resources that feature only a model without reporting any performance results.. In addition to providing a searchable interface, BertLang also provides the possibility to add new information While we hope to independently verify the reported results in the future, for now, we only list the various models and conditions.

We open-source both data and code to build the website

444https://github.com/MilaNLProc/bertlang, this will make it possible for other researchers to contribute to the collection of language-specific BERT models.

Figure 2: The BertLang website front-end interface.

Figure 2

shows the frontend page of our website, showing a table that contains languages, tasks, and performances of different models. We also provide links to the references and code from which we retrieved that information. Beyond this information, we report the performance evaluation metric, the average performance obtained by the language-specific, and – where available – the corresponding performance of mBERT model and their difference.

Task Metric Avg. lang-specific BERT Avg. mBERT Diff.
Named Entity Recognition F1 85.26 80.87 4.39
Natural Language Inference Accuracy 78.35 74.60 3.75
Paraphrase Identification Accuracy 88.44 87.74 0.70
Part of Speech Tagging Accuracy 97.06 95.87 1.19
Part of Speech Tagging UPOS 98.28 97.33 0.95
Sentiment Analysis Accuracy 90.17 83.80 6.37
Text Classification Accuracy 88.96 85.22 3.75
Table 1: Summary of average performance of different language-specific BERT models on various tasks.

3.2 Language-Specific BERT Models

The models we index vary along with a number of dimensions, which we discuss below. The main distinction, however, is the specific language the model was trained on. The availability of data sets in that language determines the tasks and domains this model was applied to.

Table 1 shows a summary of the results for the most frequent NLP tasks investigated across several languages. The results clearly show that on average, language-specific BERT models obtain higher results with respect to mBERT in all the considered tasks. However, while this holds for averages, with the proliferation of languages, tasks, and data sets, there is a huge variation in the individual performances. In the following, we analyze the possible views of the collected results in more detail.

Languages Covered

The language-specific BERT models proposed range from languages that have a high number of resources available on the web for training (e.g., French, Italian) to low-resource languages, such as Yorùbá and Mongolian. At the current date, we are covering 18 languages.

Interestingly, from the results it is possible to grasp that low-resources languages (e.g., Yorùbá and Arabic) are actually the ones with the highest improvement with respect to mBERT. Since mBERT is trained on Wikipedia, this finding can probably be explained by the fact that developers of language-specific BERT models are more likely to be experts on other resources for that language, or to collect more data. This makes a greater difference for low-resource languages.


The most popular architecture is the standard BERT one, but lately, the introduction (and the good performances) of both ALBERT and RoBERTA has made researchers consider those two latter models as well to pretrain language models.

RoBERTA has been used as the base model for the French CamemBERT[16], as well as the Italian Gilberto555https://github.com/idb-ita/GilBERTo and Umberto666https://github.com/musixmatchresearch/umberto.

mBERT was used to initialize and fine-tune models for languages such as Russian [13], Slavic languages [4]777Here “Slavic” includes Russian, Bulgarian, Czech and Polish. and Yorùbá [1]. The latter is a noteworthy example of how the scarcity of available data in low resource languages can be overcome. Fine-tuning mBERT instead of pretraining from scratch allowed the authors to produce a model without access to large amounts of data.

NLP tasks

We currently index results for 29 NLP tasks. Table 1 reports the results for the most popular tasks in the collected data, with Named Entity Recognition (NER) the most frequent task (22 entries). Looking at the source of the test data (see the released website for the complete information), we observe that there are some multilingual benchmark data sets that are used for the same NLP task in different languages. Some of them have been released by research group publishing in well-known NLP conferences [30, 23, 7, 28], while others have been released in conjunction with shared tasks such as SemEval or CoNLL [31, 18, 6, 5]. The latter group shows the effect shared task have on providing the NLP community with benchmark references.

Remarkably, the noun sense disambiguation task is the only task where language-specific BERT performances are lower than the mBERT ones. As stated by the authors [14], this could be due to the fact that the training corpora have been machine-translated from English to French, making mBERT probably better suited for the task than a model trained on native French.

Sentiment analysis is the task where language-specific BERT models obtain the highest improvements with respect to mBERT. Following the previous intuition, for Arabic [3] this can be explained by considering the peculiar language of the test data set, which demonstrates the ability of the language-specific AraBERT model to handle dialects — even if they were not explicitly included in the training set.

Beyond the well-known NLP tasks, it is interesting to note that language-specific tasks have been investigated as well, e.g., the Die/Dat (gendered determiners) disambiguation task in Dutch [10], obtaining impressive improvements with respect to state-of-the-art [2] ( 23% points accuracy improvement).


There is a huge variety of domains considered in language-specific BERT models. We need to make a distinction, though, between data sets used to pretrain the models and data sets used to evaluate the models.

Data used for training mainly varies across three source corpora: (i) Wikipedia, (ii) OPUS Corpora [25] and (iii) OSCAR [19]. Wikipedia is currently comprising more than 40 million articles created and maintained as an open collaboration project in 301 different languages, making it the largest and most popular multilingual online encyclopedia. mBERT, for example, was trained over 100 different language-specific Wikipedia versions. OPUS is a freely available collection of parallel corpora, covering over 90 languages. The largest domains covered by OPUS are legislative and administrative texts, translated movie subtitles and localization data from open-source software projects [25]. OSCAR (Open Super-large Crawled Almanach coRpus) [19] is a huge multilingual corpus obtained by filtering the Common Crawl corpus, which is a parallel multilingual corpus comprised of crawled documents from the internet.

Several models concatenate more sources to have enough data to pretrain BERT, for example BERTje (Dutch BERT), which concatenates news, book data, and Wikipedia data and other text. Languages with more limited availability of data, such as Yorùbá, have brought researchers to fine-tune mBERT instead of pretraining from scratch. A notable case is the Italian BERT model ALBERTO [21], which is the only one that has been trained only on social media data (specifically, on 2 million Twitter posts in Italian language).

On the other hand, different domain data sets have been used to evaluate the models; these range from review data for sentiment analysis tasks to transcripts and news for more traditional tasks, such as part of speech tagging. News data are the most common domain, presumably because they are easier to retrieve, and because their more formal register makes them more suited for tasks such as part of speech tagging, dependency parsing, and named entity recognition. Similarly, social media posts from Twitter are mostly used in tasks like sentiment analysis and identification of offensive language.

4 Conclusions

BERT [11] has greatly improved results in many different NLP tasks and has become a mainstay of the community. Following this development, a multilingual BERT and several language-specific versions have been developed and contributed even more to the success of NLP applications.

In this paper, we have analyzed the current state-of-the-art, showing languages are covered, which tasks tackled, and which domains considered in pretrained language-specific BERT models. Moreover, we have underlined the huge variability models and the difficulty for researchers to find the best model for a specific task, language, and domain. To this end, we have introduced BertLang, a website that allows researchers to search and explore the current state-of-the-art with respect to language-specific BERT models.

In the future, we plan to provide independent verification of reported results and direct comparisons of language-specific BERT models on specific domains and tasks. We plan to use the same data to fine-tune the models providing comparable performance values for the models. We believe these comparisons will be beneficial to the community of both researchers and beginning practitioners in NLP.


  • [1] J. O. Alabi, K. Amponsah-Kaakyire, D. I. Adelani, and C. España-Bonet (2019) Massive vs. curated word embeddings for low-resourced languages. The case of Yorùbá and Twi. arXiv preprint arXiv:1912.02481. Cited by: §1, §3.2.
  • [2] L. Allein, A. Leeuwenberg, and M. Moens (2020) Binary and multitask classification model for Dutch anaphora resolution: Die/Dat prediction. arXiv preprint arXiv:2001.02943. Cited by: §3.2.
  • [3] W. Antoun, F. Baly, and H. Hajj (2020) AraBERT: transformer-based model for Arabic language understanding. arXiv preprint arXiv:2003.00104. Cited by: §1, §3.2.
  • [4] M. Arkhipov, M. Trofimova, Y. Kuratov, and A. Sorokin (2019) Tuning multilingual transformers for language-specific named entity recognition. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93. External Links: Link, Document Cited by: §1, §3.2.
  • [5] D. Benikova, C. Biemann, M. Kisselew, and S. Pado (2014) GermEval 2014 named entity recognition shared task: companion paper. In Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition, pp. 104–112. Cited by: §3.2.
  • [6] C. Bosco, T. Fabio, B. Andrea, and A. Mazzei (2016) Overview of the evalita 2016 part of speech on twitter for italian task. In Proceedings of the 5th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2016, Vol. 1749, pp. 1–7. Cited by: §3.2.
  • [7] A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pp. 2475–2485. External Links: Link, Document Cited by: §3.2.
  • [8] Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, S. Wang, and G. Hu (2019) Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101. Cited by: §1.
  • [9] W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord, and M. Nissim (2019) BERTje: a Dutch BERT model. arXiv preprint arXiv:1912.09582. Cited by: §1.
  • [10] P. Delobelle, T. Winters, and B. Berendt (2020) RobBERT: a dutch RoBERTa-based language model. arXiv preprint arXiv:2001.06286. Cited by: §3.2.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.2, §4.
  • [12] T. Kudo and J. Richardson (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, pp. 66–71. External Links: Link, Document Cited by: §1.
  • [13] Y. Kuratov and M. Arkhipov (2019) Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213. Cited by: §1, §3.2.
  • [14] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier, and D. Schwab (2019) FlauBERT: unsupervised language model pre-training for French. arXiv preprint arXiv:1912.05372. Cited by: §1, §3.2.
  • [15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.2.
  • [16] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de la Clergerie, D. Seddah, and B. Sagot (2019) CamemBERT: a tasty French language model. arXiv preprint arXiv:1911.03894. Cited by: §1, §1, §3.2.
  • [17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119. Cited by: §1.
  • [18] R. Navigli, D. Jurgens, and D. Vannella (2013) SemEval-2013 task 12: multilingual word sense disambiguation. In Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval 2013, pp. 222–231. External Links: Link Cited by: §3.2.
  • [19] P. J. Ortiz Suárez, B. Sagot, and L. Romary (2019) Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In Proceedings of the 7th Workshop on the Challenges in the Management of Large Corpora, CMLC-7, External Links: Link Cited by: §3.2.
  • [20] T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual BERT?. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pp. 4996–5001. External Links: Link, Document Cited by: §1.
  • [21] M. Polignano, P. Basile, M. de Gemmis, G. Semeraro, and V. Basile (2019) ALBERTO: italian BERT language understanding model for NLP challenging tasks based on tweets. In Proceedings of the 6th Italian Conference on Computational Linguistics, CLiC-it 2019, Cited by: §1, §3.2.
  • [22] A. Rogers, O. Kovaleva, and A. Rumshisky (2020) A primer in BERTology: what we know about how BERT works. arXiv preprint arXiv:2002.12327. Cited by: §1.
  • [23] M. Sanguinetti and C. Bosco (2015) PartTUT: the Turin university parallel treebank. In Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project, Studies in Computational Intelligence, Vol. 589, pp. 51–69. External Links: Link Cited by: §3.2.
  • [24] E. Strubell, A. Ganesh, and A. McCallum (2019)

    Energy and policy considerations for deep learning in NLP

    In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pp. 3645–3650. External Links: Link, Document Cited by: footnote 2.
  • [25] J. Tiedemann (2012) Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, pp. 2214–2218. External Links: Link Cited by: §3.2.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. External Links: Link Cited by: §1, §2.1.
  • [27] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo (2019) Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076. Cited by: §1.
  • [28] E. B. Völker, M. Wendt, F. Hennig, and A. Köhn (2019) HDT-UD: a very large universal dependencies treebank for German. In Proceedings of the 3rd Workshop on Universal Dependencies (UDW, SyntaxFest 2019), pp. 46–57. Cited by: §3.2.
  • [29] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §1, §2.1.
  • [30] Y. Yang, Y. Zhang, C. Tar, and J. Baldridge (2019) PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, pp. 3685–3690. External Links: Link, Document Cited by: §3.2.
  • [31] D. Zeman, J. Hajic, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S. Petrov (2018) CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1–21. External Links: Link, Document Cited by: §3.2.