Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

by   Shijie Wu, et al.
Johns Hopkins University

Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.


Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

This paper investigates the problem of learning cross-lingual representa...

XeroAlign: Zero-Shot Cross-lingual Transformer Alignment

The introduction of pretrained cross-lingual language models brought dec...

What the [MASK]? Making Sense of Language-Specific BERT Models

Recently, Natural Language Processing (NLP) has witnessed an impressive ...

Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction

Zero-shot cross-lingual information extraction (IE) describes the constr...

Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer

There is an increasing amount of evidence that in cases with little or n...

Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Cross-lingual transfer between a high-resource language and its dialects...

Por Qué Não Utiliser Alla Språk? Mixed Training with Gradient Optimization in Few-Shot Cross-Lingual Transfer

The current state-of-the-art for few-shot cross-lingual transfer learnin...

1 Introduction

Pretrained language representations with self-supervised objectives have become standard in a variety of NLP tasks Peters et al. (2018); Howard and Ruder (2018); Radford et al. (2018); Devlin et al. (2018), including sentence-level classification Wang et al. (2018), sequence tagging (e.g. NER) Sang and Meulder (2003) and SQuAD question answering Rajpurkar et al. (2016). Self-supervised objectives include language modeling, the cloze task Taylor (1953) and next sentence classification. These objectives continue the key ideas behind word embedding objectives like CBOW and skip-gram Mikolov et al. (2013a), but with deep neural context.

At the same time, cross-lingual embedding models have reduced the amount of cross-lingual supervision required to produce reasonable models; conneau2017word,artetxe2018arobust use only identical strings between languages as a pseudo bilingual dictionary to learn a mapping between two monolingual-trained embeddings. For contextual embedding models, can jointly training over multiple languages without any explicit mapping produce an effective cross-lingual representation? Surprisingly, the answer is (a partial) yes. BERT, a recently introduced pretrained model Devlin et al. (2018), now supports a multilingual model (mBERT) pretrained on concatenated Wikipedia data for 104 languages without any cross-lingual alignment or supervision Devlin (2018). The model does surprisingly well compared to cross-lingual word embeddings on zero-shot cross-lingual transfer in XNLI Conneau et al. (2018), a natural language inference dataset. Zero-shot cross-lingual transfer, also known as single source transfer, refers to training and selecting a model in a source language, often a high resource language like English, then transferring directly to a target language.

While this limited experiment is promising, it begs the question: does mBERT learn a cross-lingual space that supports zero-shot transfer? In this work, we evaluate mBERT as a zero-shot cross-lingual transfer model on five different NLP tasks: natural language inference, document classification, named entity recognition, part-of-speech tagging, and dependency parsing. We show that it achieves competitive or even state-of-the-art performance with the recommended fine-tune all parameters scheme

Devlin et al. (2018)

. Additionally, we explore different fine-tuning and feature extraction schemes and demonstrate that with parameter freezing, we further outperform the suggested fine-tune all approach. Furthermore, we explore the extent to which mBERT generalizes away from language specific features by measuring accuracy on a language ID task using each layer of mBERT. Finally, we measure how subword tokenization influences cross-lingual transfer by measuring subword overlap between languages.

2 Background

(Zero-shot) Cross-lingual Transfer

Cross-lingual transfer learning is a type of transductive transfer learning with different source and target domain

(Pan and Yang, 2010). A cross-lingual representation space is assumed to perform the cross-lingual transfer. Prior to the widespread use of cross-lingual word embeddings, task-specific models assumed coarse-grain representation like part-of-speech tags, in support of a delexicalized parser Zeman and Resnik (2008). More recently cross-lingual word embeddings have been used in conjunction with task-specific neural architectures for tasks like named entity recognition Xie et al. (2018), part-of-speech tagging Kim et al. (2017) and dependency parsing Ahmad et al. (2018).

Cross-lingual Word Embeddings.

The quality of the cross-lingual space is essential for zero-shot cross-lingual transfer. ruder2017survey surveys methods for learning cross-lingual word embeddings by either joint training or post-training mappings of monolingual embeddings. conneau2017word and artetxe2018arobust first show two monolingual embeddings can be aligned by learning an orthogonal mapping with only identical strings as an initial heuristic bilingual dictionary.

Contextual Word Embeddings

ELMo Peters et al. (2018), a deep LSTM Hochreiter and Schmidhuber (1997) pretrained with a language modeling objective, learns contextual word embeddings. This contextualized representation outperforms stand-alone word embeddings, e.g. Word2Vec Mikolov et al. (2013b) and Glove Pennington et al. (2014), with the same task-specific architecture in various downstream tasks. Instead of taking the representation from a pretrained model, GPT Radford et al. (2018) and howard2018universal also fine-tune all the parameters of the pretrained model for a specific task. In addition, GPT uses a transformer encoder Vaswani et al. (2017) instead of an LSTM and jointly fine-tunes with the language modeling objective. howard2018universal propose another fine-tuning strategy by using a different learning rate for each layer with learning rate warmup and gradual unfreezing. Finally, concurrent work by lample2019cross incorporates bitext into BERT by training on pairs of parallel sentences. schuster2019cross aligns pretrained ELMo of different languages by learning an orthogonal mapping and shows strong zero-shot and few-shot cross-lingual transfer performance on dependency parsing with 5 Indo-European languages. Similar to mulitlingual BERT, mulcaire2019polyglot trains a single ELMo on distantly related language pair and shows mixed results on the benefit of pretaining with distantly related language.

3 Multilingual BERT


Devlin et al. (2018) is a deep contextual representation based on a series of transformers trained by a self-supervised objective. One of the main differences between BERT and related work like ELMo and GPT is that BERT is trained by the Cloze task on words Taylor (1953), also referred to as masked language modeling, instead of right-to-left or left-to-right language modeling. This allows the model to freely encode information from both directions in each layer. Additionally, BERT also optimizes a next sentence classification objective. At training time, 50% of the paired sentences are consecutive sentences while the rest of the sentences are paired randomly. Instead of operating on words, BERT uses a subword vocabulary with WordPiece Wu et al. (2016), a data-driven approach to break up a word into subwords.

Fine-tuning BERT

BERT shows strong performance by fine-tuning the transformer encoder followed by a softmax classification layer on various sentence classification tasks. A sequence of shared softmax classifications produces sequence tagging models for tasks like NER. Fine-tuning usually takes 3 to 4 epochs with a relatively small learning rate, for example, 3e-5.

Multilingual BERT

mBERT Devlin (2018) follows the same model architecture and training procedure as BERT, except with data from Wikipedia in 104 languages. Training makes no use of explicit cross-lingual signal, e.g. pairs of words, sentences or documents linked across languages. In mBERT, the WordPiece modeling strategy allows the model to share embeddings across languages. For example, “DNA” has a similar meaning even in distantly related languages like English and Chinese 111

“DNA” indeed appears in the vocabulary of mBERT as a stand-alone lexicon.

. To account for the various size of Wikipedia training data in different languages, training uses a heuristic to subsample words from languages with large Wikipedia and oversample words from languages with small Wikipedia when running WordPiece as well as sampling a training batch, random words for cloze and random sentences for next sentence classification.


For completeness, we describe the Transformer used by BERT. Let , be a sequence of subwords from a sentence pair. Note a special token [CLS] is prepended to and [SEP] is appended to both and . The embedding is obtained by

where is the embedding function and LN is layer normalization Ba et al. (2016). transformer blocks are followed by the embeddings. In each transformer block,

where GELU

is an element-wise activation function

Hendrycks and Gimpel (2016). In practice, , , , , and . MHSA is the multi-heads self-attention function. We show how one new position is computed.

In each attention, referred to as attention head,

where is the number of attention heads, , , , and .

4 Tasks

width=2 ar bg ca cs da de el en es et fa fi fr he hi hr hu id it ja ko la lv nl no pl pt ro ru sk sl sv sw th tr uk ur vi zh MLDoc NLI NER POS Parsing

Table 1: The 39 languages used in the 5 tasks.

Does mBERT learn a cross-lingual representation, or does it produce a representation for each language in its own embedding space? We answer this question by considering five tasks in the zero-shot transfer setting. We assume labeled training data for each task in English, and transfer the trained model to a target language. We select a range of different types of tasks: document classification, natural language inference, named entity recognition, part-of-speech tagging, and dependency parsing. We cover zero-shot transfer from English to 38 languages in the 5 different tasks as shown in Tab. 1. In this section, we describe the tasks as well as task-specific layers.

4.1 Document Classification

We use MLDoc Schwenk and Li (2018), a balanced subset of the Reuters corpus covering 8 languages for document classification. The 4-way topic classification task decides between CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). We only use the first two sentences222We only use the first sentence if the document only contains one sentence. Document are segmented into sentences with NLTK Perkins (2014). of a document for classification. The sentence pairs are provided to the mBERT encoder. The task-specific classification layer is a linear function mapping into , and a softmax is used to get class distribution. Document classification is evaluated by classification accuracy (ACC).

4.2 Natural Language Inference

We use XNLI Conneau et al. (2018) which cover 15 languages for natural language inference. The 3-way classification includes entailment, neutral, and contradiction given a pair of sentences. We feed a pair of sentences directly into mBERT and the task-specific classification layer is the same as § 4.1. Natural language inference is evaluated by classification accuracy (ACC).

4.3 Named Entity Recognition

We use the CoNLL 2002 and 2003 NER shared tasks Sang (2002); Sang and Meulder (2003) which total 4 languages. In addition, we also consider a Chinese NER dataset Levow (2006). The labeling scheme is BIO with 4 types of named entities. We add a linear classification layer with softmax to obtain word level predictions. Since mBERT operates at the subword-level while the labeling is word-level, we only consider the first subword label if a word is broken into multiple subwords with masking. NER is evaluated by F1 of predicted entity (F1).

4.4 Part-of-Speech Tagging

We use a subset of Universal Dependencies (UD) Treebanks (v1.4) Nivre et al. (2016), which cover 15 languages, following the setup of kim2017cross. The task-specific labeling layer is the same as § 4.3. POS tagging is evaluated by the accuracy of predicted POS tags (ACC).

4.5 Dependency parsing

Following the setup of ahmad2018near, we use a subset of Universal Dependencies (UD) Treebanks (v2.2) Nivre et al. (2018), which includes 31 languages. Dependency parsing is evaluated by unlabelled attachment score (UAS) and labeled attachment score (LAS) 333Punctuations (PUNCT) and symbols (SYM) are excluded. During evaluation, we threshold the sentence length to 140.. We only predict the coarse-grain dependency label following ahmad2018near. We use the model of dozat2016deep, a graph-based parser as a task-specific layer. The LSTM encoder in this model is replaced by mBERT. Similar to § 4.3, we only take the representation of the first subword of each word. We use masking to prevent the parser from operating on non-first subwords.

5 Experiments

For each task, no preprocessing is performed except tokenization of words into subwords with WordPiece. At training time, the sequence has a maximum length of 128. We use the base case multilingual BERT, which has attention heads and

transformer blocks. The dropout probability is 0.1 and

is 768. The model has 110M parameters. In fine-tuning, we select the best hyperparameters by searching a combination of batch size, learning rate and the number of fine-tuning epochs with the following range: learning rate

; batch size ; number of epochs: . Note the best hyperparameters are selected by development performance in English. We use Adam for fine-tuning with of 0.9, of 0.999 and L2 weight decay of 0.01. We warm up the learning rate over the first 10% of batches and linearly decay the learning rate. See § A.1 for details on evaluation.

5.1 Question #1: Is mBERT Multilingual?

en de zh es fr it ja ru Average

In language supervised learning

Schwenk and Li (2018) 92.2 93.7 87.3 94.5 92.1 85.6 85.4 85.7 89.5
mBERT 94.2 93.3 89.3 95.7 93.4 88.0 88.4 87.5 91.2
Zero-shot cross-lingual transfer
Schwenk and Li (2018) 92.2 81.2 74.7 72.5 72.4 69.4 67.6 60.8 73.9
Artetxe and Schwenk (2018) *† 89.9 84.8 71.9 77.3 78.0 69.4 60.3 67.8 74.9
mBERT 94.2 80.2 76.9 72.6 72.6 68.9 56.5 73.7 74.5
Table 2: MLDoc experiments. * denotes the model is pretrained with bitext, and † denotes concurrent work. Bold and underline denote best and second best.


We include two recent strong baselines. Schwenk and Li (2018) use MultiCCA, multilingual word embeddings trained with a bilingual dictionary Ammar et al. (2016)

, and convolution neural networks. Concurrent to our work,

Artetxe and Schwenk (2018)

pretrain a multilingual sentence representation with a sequence-to-sequence model where the decoder only has access to a max-pooling of the encoder hidden states. The model requires bitext for training.

mBERT outperforms (Tab. 2) multilingual word embeddings and performs comparably with a multilingual sentence representation, even though mBERT does not have access to bitext. Interestingly, mBERT outperforms Artetxe and Schwenk (2018) in distantly related languages like Chinese and Russian and under-performs in closely related Indo-European languages.

en fr es de el bg ru tr ar vi th zh hi sw ur Average
Pseudo supervision with machine translated training data from English to target language
Lample and Conneau (2019) (MLM+TLM) *† 85.0 80.2 80.8 80.3 78.1 79.3 78.1 74.7 76.5 76.6 75.5 78.6 72.3 70.9 63.2 76.7
mBERT 82.1 76.9 78.5 74.8 72.1 75.4 74.3 70.6 70.8 67.8 63.2 76.2 65.3 65.3 60.6 71.6
Zero-shot cross-lingual transfer
Conneau et al. (2018) (X-LSTM) * 73.7 67.7 68.7 67.7 68.9 67.9 65.4 64.2 64.8 66.4 64.1 65.8 64.1 55.7 58.4 65.6
Artetxe and Schwenk (2018) *† 73.9 71.9 72.9 72.6 73.1 74.2 71.5 69.7 71.4 72.0 69.2 71.4 65.5 62.2 61.0 70.2
Lample and Conneau (2019) (MLM+TLM) *† 85.0 78.7 78.9 77.8 76.6 77.4 75.3 72.5 73.1 76.1 73.2 76.5 69.6 68.4 67.3 75.1
Lample and Conneau (2019) (MLM) † 83.2 76.5 76.3 74.2 73.1 74.0 73.1 67.8 68.5 71.2 69.2 71.9 65.7 64.6 63.4 71.5
mBERT 82.1 73.8 74.3 71.1 66.4 68.9 69 61.6 64.9 69.5 55.8 69.3 60.0 50.4 58.0 66.3
Table 3: XNLI experiments. * denotes the model is pretrained with cross-lingual signal including bitext or bilingual dictionary, and † denotes concurrent work.


We include three strong baselines, Artetxe and Schwenk (2018) and Lample and Conneau (2019) are concurrent to our work. Lample and Conneau (2019) with MLM is similar to mBERT; the main difference is that it only trains with the 15 languages of XNLI, and MLM+TLM also uses bitext as training data. Conneau et al. (2018)

use supervised multilingual word embeddings with an LSTM encoder and max-pooling. After an English encoder and classifier are trained, the target encoder is trained to mimic the English encoder with ranking loss and bitext.

In Tab. 3, mBERT outperforms one model with bitext training but (as expected) falls short of models with more cross-lingual training information. Interestingly, mBERT and MLM are mostly the same except for the training languages, yet we observe that mBERT under-performs MLM by a large margin. We hypothesize that limiting pretraining to only those languages needed for the downstream task is beneficial. The gap between Artetxe and Schwenk (2018) and mBERT in XNLI is larger than MLDoc, likely because XNLI is harder.

en nl es de zh Average (-en,-zh)
In language supervised learning
Xie et al. (2018) - 86.40 86.26 78.16 - 83.61
mBERT 91.97 90.94 87.38 82.82 93.17 87.05
Zero-shot cross-lingual transfer
Xie et al. (2018) - 71.25 72.37 57.76 - 67.13
mBERT 91.97 77.57 74.96 69.56 51.90 74.03
Table 4: NER tagging experiments.


We use xie2018neural as a zero-shot cross-lingual transfer baseline, which is state-of-the-art on CoNLL 2002 and 2003. It uses unsupervised bilingual word embeddings Conneau et al. (2017) with a hybrid of a character-level/word-level LSTM, self-attention, and a CRF. Pseudo training data is built by word-to-word translation with an induced dictionary from bilingual word embeddings.

mBERT outperforms a strong baseline by an average 6.9 point absolute F1 improvement and a 11.8 point absolute improvement in German with a simple one layer 0-order CRF as a prediction function (Tab. 4). However, there is a large performance gap when transferring to distantly related languages like Chinese compared to a supervised learning baseline. This suggests further effort should focus on transferring between distantly related languages. We hypothesize that by sharing subwords across three closely related languages, mBERT more effectively transfers to the target language, especially for compound words in German. We provide further analysis in § 5.4.

lang bg da de en es fa hu it nl pl pt ro sk sl sv Average (-en)
In language supervised learning
mBERT 99.0 97.9 95.2 97.1 97.1 97.8 96.9 98.7 92.1 98.5 98.3 97.8 97.0 98.9 98.4 97.4
Low resource cross-lingual transfer
Kim et al. (2017) (1280) 95.7 94.3 90.7 - 93.4 94.8 94.5 95.9 85.8 92.1 95.5 94.2 90.0 94.1 94.6 93.3
Kim et al. (2017) (320) 92.4 90.8 89.7 - 90.9 91.8 90.7 94.0 82.2 85.5 94.2 91.4 83.2 90.6 90.7 89.9
Zero-shot cross-lingual transfer
mBERT 87.4 88.3 89.8 97.1 85.2 72.8 83.2 84.7 75.9 86.9 82.1 84.7 83.6 84.2 91.3 84.3

Table 5: POS tagging. Kim et al. (2017) use small amounts of training data in the target language.


We use kim2017cross as a reference. Note that they utilized a small amount of supervision in the target language as well as English supervision so the results are not directly comparable. Tab. 5 shows a large (average) gap between mBERT and kim2017cross. Interestingly, mBERT still outperforms kim2017cross with 320 sentences in German (de), Polish (pl), Slovak (sk) and Swedish (sv).

Dist mBERT(S) Baseline(Z) mBERT(Z) mBERT(Z+POS)
en 0 91.5/81.3 90.4/88.4 91.5/81.3 91.8/82.2
no 0.06 93.6/85.9 80.8/72.8 80.6/68.9 82.7/72.1
sv 0.07 91.2/83.1 81/73.2 82.5/71.2 84.3/73.7
fr 0.09 91.7/85.4 77.9/72.8 82.7/72.7 83.8/76.2
pt 0.09 93.2/87.2 76.6/67.8 77.1/64 78.3/66.9
da 0.1 89.5/81.9 76.6/67.9 77.4/64.7 79.3/68.1
es 0.12 92.3/86.5 74.5/66.4 78.1/64.9 79/68.9
it 0.12 94.8/88.7 80.8/75.8 84.6/74.4 86/77.8
ca 0.13 94.3/89.5 73.8/65.1 78.1/64.6 79/67.9
hr 0.13 92.4/83.8 61.9/52.9 80.7/65.8 80.4/68.2
pl 0.13 94.7/79.9 74.6/62.2 82.8/59.4 85.7/65.4
sl 0.13 88/77.8 68.2/56.5 72.6/51.4 75.9/59.2
uk 0.13 90.6/83.4 60.1/52.3 76.7/60 76.5/65.5
bg 0.14 95.2/85.5 79.4/68.2 83.3/62.3 84.4/68.1
cs 0.14 94.2/86.6 63.1/53.8 76.6/58.7 77.4/63.6
de 0.14 86.1/76.5 71.3/61.6 80.4/66.3 83.5/71.2
he 0.14 91.9/83.6 55.3/48 67.5/48.4 67/54.3
nl 0.14 94/85 68.6/60.3 78/64.8 79.9/67.1
ru 0.14 94.7/88 60.6/51.6 73.6/58.5 73.2/61.5
ro 0.15 92.2/83.2 65.1/54.1 77/58.5 76.9/62.6
id 0.17 86.3/75.4 49.2/43.5 62.6/45.6 59.8/48.6
sk 0.17 93.8/83.3 66.7/58.2 82.7/63.9 82.9/67.8
lv 0.18 87.3/75.3 70.8/49.3 66/41.4 70.4/48.5
et 0.2 88.8/79.7 65.7/44.9 66.9/44.3 70.8/50.7
fi 0.2 91.3/81.8 66.3/48.7 68.4/47.5 71.4/52.5
zh* 0.23 88.3/81.2 42.5/25.1 53.8/26.8 53.4/29
ar 0.26 87.6/80.6 38.1/28 43.9/28.3 44.7/32.9
la 0.28 85.2/73.1 48/35.2 47.9/26.1 50.9/32.2
ko 0.33 86/74.8 34.5/16.4 52.7/27.5 52.3/29.4
hi 0.4 94.8/86.7 35.5/26.5 49.8/33.2 58.9/44
ja* 0.49 94.2/87.4 28.2/20.9 36.6/15.7 41.3/30.9
AVER 0.17 91.3/82.6 64.1/53.8 71.4/54.2 73/58.9
Table 6: Dependency parsing results by language (UAS/LAS). * denotes delexicalized parsing in the baseline. S and Z denotes supervised learning and zero-shot transfer. Bold and underline denotes best and second best. We order the languages by word order distance to English.

Dependency Parsing

We use the best performing model on average in ahmad2018near as a zero-shot transfer baseline, i.e. transformer encoder with graph-based parser Dozat and Manning (2016), and dictionary supervised cross-lingual embeddings Smith et al. (2017). Dependency parsers, including Ahmad et al., assume access to POS tags: a cross-lingual representation. We consider two versions of mBERT: one with gold POS tags and one without. When POS tags are available, a POS tag embedding is concatenated with the final output of mBERT.

Tab. 6 shows that mBERT outperforms the baseline on average by 7 .3points UAS and 0.4 point LAS absolute improvement. Note that we highlight both the best and the second best result. Interestingly, the LAS is weaker than the baseline in many languages. With the help of POS tags, we further observe 1.6 points UAS and 4.7 point LAS absolute improvement on average. It appears that adding POS tags, which provide clearer cross-lingual representations, benefit mBERT.


Across all five tasks, mBERT demonstrate strong and sometimes state-of-the-art zero-shot cross-lingual performance without any cross-lingual signal. It outperforms cross-lingual embeddings in four tasks. With a small amount of target language supervision and cross-lingual signal, mBERT may improve further, and we leave this as future work. In short, mBERT is a surprisingly effective cross-lingual model for a wide range of NLP tasks.

5.2 Question #2: Does mBERT vary layer-wise?

Figure 6: Performance of different fine-tuning approaches compared with fine-tuning all mBERT parameters. Color denotes absolute difference and number in each entry is the evaluation in the corresponding setting. Languages are sorted by mBERT zero-shot transfer performance. Three downward triangles indicate performance drop more than the legends lower limit.

The goal of a deep neural network is to abstract to higher order representations as you progress up the hierarchy Yosinski et al. (2014). peters2018deep empirically show that for ELMo the lower layer is better at syntax while the upper layer is better at semantics. For mBERT, we would expect a similar generalization across the 12 layers, as well as an abstraction away from a specific language with higher layers. Does the zero-shot transfer performance vary with different layers?

We consider two schemes. First, we follow the feature-based transfer approach of ELMo by taking a learned weighted combination of all 12 layers of mBERT with a two-layer bidirectional LSTM with

hidden size (Feat). Note the LSTM is trained from scratch and mBERT is fixed. For sentence and document classification, an additional max-pooling is used to extract a fixed-dimension vector. Second, when fine-tuning mBERT, we fix the bottom

layers ( included) of mBERT, where layer 0 is the input embedding. We consider . See § A.2 for experiment details.

Freezing the bottom layers of mBERT, in general, improves the performance of mBERT in all five tasks (Fig. 6). For sentence-level tasks like document classification and natural language inference, we observe the largest improvement with . For word-level tasks like NER, POS tagging, and parsing, we observe the largest improvement with . More improvement in under-performing languages is observed.

In each task, the feature-based approach under-performs mBERT. We hypothesize that initialization from pretraining with lots of languages provides a very good starting point that is hard to beat. Additionally, the LSTM could also be part of the problem. In Ahmad et al. (2018) for dependency parsing, an LSTM encoder was worse than a transformer when transferring to languages with high word ordering distance to English.

5.3 Question #3: Does mBERT retain language specific information?

Figure 7: Language identification accuracy for different layer of mBERT. layer 0 is the embedding layer and the layer is output of the i transformer block.

Since mBERT does so well at learning a cross-lingual representation, it may do so by abstracting away from language specific information, thus losing the ability to distinguish between languages. To test this theory we consider the task of language identification: does mBERT retain language specific information? We use WiLI-2018 Thoma (2018), a dataset that includes over 200 languages from Wikipedia. We keep only those languages included in mBERT, leaving 99 languages 444Hungarian, Western-Punjabi, Norwegian-Bokmal, and Piedmontese are not covered by WiLI.. We take various layers of bag-of-words mBERT representation of the first two sentences of the test paragraph and add a linear classifier with softmax. We trained only the classifier while keeping mBERT fixed. See § A.3 for further details.

We found across all tested layers around 96% accuracy (Fig. 7). We see no clear difference between layers, suggesting each layer contains language-specific information; surprising given the zero-shot cross-lingual abilities. As mBERT generalizes its representations and creates cross-lingual representations, it maintains language specific details. This may be encouraged during pretraining since mBERT needs to retain enough language-specific information to perform the cloze task and select language-related subwords.

Figure 8: Relation between cross-lingual zero-shot transfer performance with mBERT and percentage of observed subwords at both type-level and token-level. Pearson correlation coefficient and -value are shown in red.

5.4 Question #4: Does mBERT benefit by sharing subwords across languages?

As discussed in § 3, mBERT shares subwords in closely related languages or perhaps in distantly related languages. At training time, the representation of a shared subword is explicitly trained to contain enough information for the cloze task in all languages in which it appears. During fine-tuning for zero-shot cross-lingual transfer, if a subword in the target language test set also appears in the source language training data, the supervision could be leaked to the target language explicitly. However, all subwords interact in a non-interpretable way inside a deep network, and subword representations could overfit to the source language and potentially hurt transfer performance. In these experiments, we investigate how sharing subwords across languages effects the zero-shot cross-lingual transfer performance.

To quantify how many subwords are shared across languages in any task, we calculate the percentage of observed subwords at type-level and token-level for each target language .

where , is the set of all subwords in the English training set, is the set of all subwords in language test set, and is the count of subword in test set of language .

In Fig. 8, we show the relation between cross-lingual zero-shot transfer performance of mBERT and or for all five tasks with Pearson correlation. In four out of five tasks (not XNLI) we observed a strong positive correlation () with a correlation coefficient larger than 0.5. In Indo-European languages, we observed is usually around 50% to 75% while is usually less than 50%. This indicates that subwords shared across languages are usually high frequency555With the data-dependent WordPiece algorithm, subwords that appear in multiple languages with high frequency are more likely to be selected.. We hypothesize that this could be used as a simple indicator for selecting source language for cross-lingual transfer with shared subword vocabulary. We leave this for future work.

6 Discussion

We show mBERT does well in a cross-lingual zero-shot transfer setting on five different tasks covering a large number of languages from different language families. Notably, it outperforms cross-lingual embeddings, which typically have more cross-lingual supervision. By simply fixing the bottom layers of mBERT during fine-tuning, we observe further performance gains. Language-specific information is also preserved in all layers of mBERT. Sharing subwords across languages helps the cross-lingual transfer of mBERT as a strong correlation is observed between the percentage of overlapping subwords and zero-shot transfer performance.

mBERT effectively learns a good multilingual representation with strong cross-lingual zero-shot transfer performance in various tasks. We recommend building future multi-lingual NLP models on top of mBERT or other models pretrained in a similar fashion. Even without explicit cross-lingual supervision, these models do very well. As we show with XNLI in § 5.1, while bitext are hard to obtain in low resource settings, a variant of mBERT pretrained with bitext Lample and Conneau (2019) shows even stronger performance. Future work could investigate how to use weak supervision to produce a better cross-lingual mBERT, or adapt an already trained model for cross-lingual use. With POS tagging in § 5.1, we show mBERT, in general, under-performs models with a small amount of supervision while Devlin et al. (2018) show that in English NLP tasks, fine-tuning BERT only needs a small amount of data. Future work could investigate when cross-lingual transfer is helpful in NLP tasks of low resource languages.


Appendix A Additional Experiment Detail

a.1 Sequence length at evaluation

For MLDoc and XNLI, we have a maximum sequence length of 128. For NER and POS tagging, we use a sliding window approach with a maximum length of 128. After the first window, we keep the last 64 subwords from the previous window as context. In other words, for a non-first window, only (up to) 64 new subwords are added for prediction. For parsing, we limit the length of sequences to be a minimum of 512 and the total length of subwords of the first 140 tokens of the sentence, including words and punctuation.

a.2 Question #2

The input is the same in both schemes. In the feature-based scheme, the LSTM and the weight to combine mBERT layers is trained by Adam with learning rate 1e-3. The batch size is 32. The learning rate is halved whenever the development evaluation does not improve. The training is stopped early when learning rate drop below 1e-5. The fine-tune approach is the same as fine-tuning all parameters of mBERT.

a.3 Question #3

The classifier is trained the same as the feature-based scheme in § A.2.