Emerging Cross-lingual Structure in Pretrained Language Models

11/04/2019 ∙ by Shijie Wu, et al. ∙ 0

We study the problem of multilingual masked language modeling, i.e. the training of a single model on concatenated text from multiple languages, and present a detailed study of several factors that influence why these models are so effective for cross-lingual transfer. We show, contrary to what was previously hypothesized, that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains. The only requirement is that there are some shared parameters in the top layers of the multi-lingual encoder. To better understand this result, we also show that representations from independently trained models in different languages can be aligned post-hoc quite effectively, strongly suggesting that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces. For multilingual masked language modeling, these symmetries seem to be automatically discovered and aligned during the joint training process.



There are no comments yet.


page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multilingual language models such as mBERT Devlin et al. (2019) and XLM Lample and Conneau (2019) enable effective cross-lingual transfer — it is possible to learn a model from supervised data in one language and apply it to another with no additional training. Recent work has shown that transfer is effective for a wide range of tasks Wu and Dredze (2019); Pires et al. (2019). These work speculates why multilingual pretraining works (e.g. shared vocabulary), but only experiments with a single reference mBERT and is unable to systematically measure these effects.

In this paper, we present the first detailed empirical study of the effects of different masked language modeling (MLM) pretraining regimes on cross-lingual transfer. Our first set of experiments is a detailed ablation study on a range of zero-shot cross-lingual transfer tasks. Much to our surprise, we discover that language universal representations emerge in pretrained models without the requirement of any shared vocabulary or domain similarity, and even when only a small subset of the parameters in the joint encoder are shared. In particular, by systematically varying the amount of shared vocabulary between two languages during pretraining, we show that the amount of overlap only accounts for a few points of performance in transfer tasks, much less than might be expected. By sharing parameters alone, pretraining learns to map similar words and sentences to similar hidden representations.

To better understand these effects, we also analyze multiple monolingual BERT models trained completely independently, both within and across languages. We find that monolingual models trained in different languages learn representations that align with each other surprisingly well, as compared to the same language upper bound, even though they have no shared parameters. This result closely mirrors the widely observed fact that word embeddings can be effectively aligned across languages Mikolov et al. (2013). Similar dynamics are at play in MLM pretraining, and at least in part explain why they aligned so well with relatively little parameter tying in our earlier experiments.

This type of emergent language universality has interesting theoretical and practical implications. We gain insight into why the models transfer so well and open up new lines of inquiry into what properties emerge in common in these representations. They also suggest it should be possible to adapt to pretrained to new languages with little additional training and it may be possible to better align independently trained representations without having to jointly train on all of the (very large) unlabeled data that could be gathered. For example, concurrent work has shown that a pretrained MLM model can be rapidly fine-tuned to another language Artetxe et al. (2019).

2 Background

Language Model Pretraining

Our work follows in the recent line of language model pretraining. ELMo Peters et al. (2018)

first popularized learning representations from a language model. The representations are used in a transfer learning setup to improve performance on a variety of downstream NLP tasks. Follow-up work by howard-ruder-2018-universal,radford2018improving further improves on this idea by introducing end-task fine-tuning of the entire model and a transformer architecture

Vaswani et al. (2017). BERT Devlin et al. (2019) represents another significant improvement by introducing a masked-language model and next-sentence prediction training objectives combined with a bi-directional transformer model.

BERT offers a multilingual version (dubbed mBERT), which is trained on Wikipedia data of over 100 languages. mBERT obtains strong performance on zero-shot cross-lingual transfer without using any parallel data during training Pires et al. (2019); Wu and Dredze (2019). This shows that multilingual representations can emerge from a shared Transformer with a shared subword vocabulary. XLM Lample and Conneau (2019) was introduced concurrently to mBERT. It is trained without the next-sentence prediction objective and introduces training based on parallel sentences as an explicit cross-lingual signal. XLM shows that cross-lingual language model pretraining leads to a new state-of-the-art cross-lingual transfer on XNLI Conneau et al. (2018), a natural language inference dataset, even when parallel data is not used. They also use XLM models to pretrained supervised and unsupervised machine translation systems Lample et al. (2018). When pretrained on parallel data, the model does even better. Other work has shown that mBERT outperforms previous state-of-the-art of cross-lingual transfer based on word embeddings on token-level NLP tasks  Wu and Dredze (2019), and that adding character-level information Mulcaire et al. (2019) and using multi-task learning Huang et al. (2019) can improve cross-lingual performance.

Alignment of Word Embeddings

Researchers working on word embeddings noticed early that embedding spaces tend to be shaped similarly across different languages Mikolov et al. (2013). This inspired work in aligning monolingual embeddings. The alignment was done by using a bilingual dictionary to project words that have the same meaning close to each other Mikolov et al. (2013). This projection aligns the words outside of the dictionary as well due to the similar shapes of the word embedding spaces. Follow-up efforts only required a very small seed dictionary (e.g., only numbers Artetxe et al. (2017)) or even no dictionary at all Conneau et al. (2017); Zhang et al. (2017). Other work has pointed out that word embeddings may not be as isomorphic as thought Søgaard et al. (2018) especially for distantly related language pairs Patra et al. (2019). ormazabal-etal-2019-analyzing show joint training can lead to more isomorphic word embeddings space.

schuster-etal-2019-cross showed that ELMo embeddings can be aligned by a linear projection as well. They demonstrate a strong zero-shot cross-lingual transfer performance on dependency parsing. wang2019cross align mBERT representations and evaluate on dependency parsing as well. artetxe2019cross showed the possibility of rapidly fine-tuning a pretrained BERT model on another language.

Neural Network Activation Similarity

We hypothesize that similar to word embedding spaces, language-universal structures emerge in pretrained language models. While computing word embedding similarity is relatively straightforward, the same cannot be said for the deep contextualized BERT models that we study. Recent work introduces ways to measure the similarity of neural network activitation between different layers and different models

Laakso and Cottrell (2000); Li et al. (2016); Raghu et al. (2017); Morcos et al. (2018); Wang et al. (2018)

. For example, raghu2017svcca use canonical correlation analysis (CCA) and a new method, singular vector canonical correlation analysis (SVCCA), to show that early layers converge faster than upper layers in convolutional neural networks. kudugunta2019investigating use SVCCA to investigate the multilingual representations obtained by the encoder of a massively multilingual neural machine translation system

Aharoni et al. (2019). kornblith2019similarity argues that CCA fails to measure meaningful similarities between representations that have a higher dimension than the number of data points and introduce the centered kernel alignment (CKA) to solve this problem. They successfully use CKA to identify correspondences between activations in networks trained from different initializations.

Figure 1: On the impact of anchor points and parameter sharing on the emergence of multilingual representations. We train bilingual masked language models and remove parameter sharing for the embedding layers and first few Transformers layers to probe the impact of anchor points and shared structure on cross-lingual transfer.
Figure 2: Probing the layer similarity of monolingual BERT models. We investigate the similarity of separate monolingual BERT models at different levels. We use an orthogonal mapping between the pooled representations of each model. We also quantify the similarity using the centered kernel alignment (CKA) similarity index.

3 Cross-lingual Pretraining

We study a standard multilingual masked language modeling formulation and evaluate performance on several different cross-lingual transfer tasks, as described in this section.

3.1 Multilingual Masked Language Modeling

Our multilingual masked language models follow the setup used by both mBERT and XLM. We use the implementation of Lample and Conneau (2019). Specifically, we consider continuous streams of 256 tokens and mask 15% of the input tokens which we replace 80% of the time by a mask token, 10% of the time with the original word, and 10% of the time with a random word. Note the random words could be foreign words. The model is trained to recover the masked tokens from its context Taylor (1953). The subword vocabulary and model parameters are shared across languages. Note the model has a softmax prediction layer shared across languages. We use BPE Sennrich et al. (2016) to learn subword vocabulary and Wikipedia for training data, preprocessed by Moses Koehn et al. (2007) and Stanford word segmenter (for Chinese only). During training, we sample a batch of continuous streams of text from one language proportionally to the fraction of sentences in each training corpus, exponentiated to the power .

Pretraining details

Each model is a Transformer Vaswani et al. (2017) with 8 layers, 12 heads and GELU activiation functions Hendrycks and Gimpel (2016)

. The output softmax layer is tied with input embeddings

Press and Wolf (2017). The embeddings dimension is 768, the hidden dimension of the feed-forward layer is 3072, and dropout is 0.1. We train our models with the Adam optimizer Kingma and Ba (2014) and the inverse square root learning rate scheduler of transformer17 with

learning rate and 30k linear warmup steps. For each model, we train it with 8 NVIDIA V100 GPUs with 32GB of memory and mixed precision. We use batch size 96 for each GPU and each epoch contains 200k batches. We stop training at epoch 200 and select the best model based on English dev perplexity for evaluation.

3.2 Cross-lingual Evaluation

We consider three NLP tasks to evaluate performance: natural language inference (NLI), named entity recognition (NER) and dependency parsing (Parsing). We adopt the

zero-shot cross-lingual transfer

setting, where we (1) fine-tune the pretrained model and randomly initialized task-specific layer with source language supervision (in this case, English) and (2) directly transfer the model to target languages with no additional training. We select the model and tune the hyperparameter with the English dev set. We report the result on average of best two set of hyperparameters.

Fine-tuning details

We fine-tune the model for 10 epochs for NER and Parsing and 200 epochs for NLI. We search the following hyperparameter for NER and Parsing: batch size ; learning rate . For XNLI, we search: batch size ; encoder learning rate

; classifier learning rate

. We use Adam with fix learning rate for XNLI and warmup the learning rate for the first 10% batch then decrease linearly to 0 for NER and Parsing. We save checkpoint after each epoch.


We use the cross-lingual natural language inference (XNLI) dataset Conneau et al. (2018). The task-specific layer is a linear mapping to a softmax classifier, which takes the representation of the first token as input.


We use WikiAnn Pan et al. (2017), a silver NER dataset built automatically from Wikipedia, for English-Russian and English-French. For English-Chinese, we use CoNLL 2003 English Tjong Kim Sang and De Meulder (2003) and a Chinese NER dataset Levow (2006)

, with realigned Chinese NER labels based on the Stanford word segmenter. We model NER as BIO tagging. The task-specific layer is a linear mapping to a softmax classifier, which takes the representation of the first subword of each word as input. We adopt a simple post-processing heuristic to obtain a valid span, rewriting standalone

I-X into B-X and B-X I-Y I-Z into B-Z I-Z I-Z, following the final entity type. We report the span-level F1.


Finally, we use the Universal Dependencies (UD v2.3) Nivre (2018) for dependency parsing. We consider the following four treebanks: English-EWT, French-GSD, Russian-GSD, and Chinese-GSD. The task-specific layer is a graph-based parser Dozat and Manning (2016), using representations of the first subword of each word as inputs. We measure performance with the labeled attachment score (LAS).

4 Dissecting mBERT/XLM models

Figure 3: Cross-lingual transfer of bilingual MLM on three tasks and language pairs under different settings. Others tasks and languages pairs follows similar trend. See Tab. 1 for full results.

We hypothesize that the following factors play important roles in what makes multilingual BERT multilingual: domain similarity, shared vocabulary (or anchor points), shared parameters, and language similarity. Without loss of generality, we focus on bilingual MLM. We consider three pairs of languages: English-French, English-Russian, and English-Chinese.

Model Domain BPE Merges Anchors Pts Share Param. Softmax XNLI (Acc) NER (F1) Parsing (LAS)
fr ru zh fr ru zh fr ru zh
Default Wiki-Wiki 80k all all shared 73.6 68.7 68.3 70.2 79.8 60.9 63.6 68.1 73.2 56.6 28.8 52.9
Domain Similarity (§ 4.1)
Wiki-CC Wiki-CC - - - - 74.2 65.8 66.5 68.8 74.0 49.6 61.9 61.8 71.3 54.8 25.2 50.4
Anchor Points (§ 4.2)
Default anchors - 40k/40k - - - 74.0 68.1 68.9 70.3 79.8 60.9 63.6 68.1 73.2 56.6 28.8 52.9
No anchors - 40k/40k 0 - - 72.1 67.5 67.7 69.1 74.0 57.9 65.0 65.6 72.3 56.2 27.4 52.0
Extra anchors - - extra - - 74.0 69.8 72.1 72.0 76.1 59.7 66.8 67.5 73.3 56.9 29.2 53.1
Parameter Sharing (§ 4.3)
Sep Emb - 40k/40k 0* sep. emb lang-specific 72.7 63.6 60.8 65.7 75.5 57.5 59.0 64.0 71.7 54.0 27.5 51.1
Sep Emb + L1-3 - 40k/40k 0* sep. emb + L1-3 lang-specific 69.2 61.7 56.4 62.4 73.8 46.8 53.5 58.0 68.2 53.6 23.9 48.6
Sep Emb + L1-6 - 40k/40k 0* sep. emb + L1-6 lang-specific 51.6 35.8 34.4 40.6 56.5 5.4 1.0 21.0 50.9 6.4 1.5 19.6
Sep L1-3 - 40k/40k - sep. L1-3 - 72.4 65.0 63.1 66.8 74.0 53.3 60.8 62.7 69.7 54.1 26.4 50.1
Sep L1-6 - 40k/40k - sep. L1-6 - 61.9 43.6 37.4 47.6 61.2 23.7 3.1 29.3 61.7 31.6 12.0 35.1
Table 1: Dissecting bilingual MLM based on zero-shot cross-lingual transfer performance. - denote the same as the first row (Default). See Figure 3 for visualization of three columns.

4.1 Domain Similarity

Multilingual BERT and XLM are trained on the Wikipedia comparable corpora. Domain similarity has been shown to affect the quality of cross-lingual word embeddings Conneau et al. (2017), but this effect is not well established for masked language models. We consider domain difference by training on Wikipedia for English and a random subset of Common Crawl of the same size for the other languages (Wiki-CC). We also consider a model trained with Wikipedia only, the same as XLM (Default) for comparison.

The first group in Tab. 1 shows domain mismatch has a relatively modest effect on performance. XNLI and parsing performance drop around 2 points while NER drops over 6 points for all languages on average. One possible reason is that the labeled WikiAnn data consists of Wikipedia text; domain differences between source and target language during pretraining hurt performance more. Indeed for English and Chinese NER, where neither sides come from Wikipedia, performance only drops around 2 points.

4.2 Anchor points

Anchor points are identical strings that appear in both languages in the training corpus. Words like DNA or Paris appear in the Wikipedia of many languages with the same meaning. In mBERT, anchor points are naturally preserved due to joint BPE and shared vocabulary across languages. Anchor point existence has been suggested as a key ingredient for effective cross-lingual transfer since they allow the shared encoder to have at least some direct tying of meaning across different languages Lample and Conneau (2019); Pires et al. (2019); Wu and Dredze (2019). However, this effect has not been carefully measured.

We present a controlled study of the impact of anchor points on cross-lingual transfer performance by varying the amount of shared subword vocabulary across languages. Instead of using a single joint BPE with 80k merges, we use language-specific BPE with 40k merges for each language. We then build vocabulary by taking the union of the vocabulary of two languages and train a bilingual MLM (Default anchors). To remove anchor points, we add a language prefix to each word in the vocabulary before taking the union. Bilingual MLM (No anchors) trained with such data has no shared vocabulary across languages. However, it still has a single softmax prediction layer shared across languages and tied with input embeddings.

We additionally increase anchor points by using a bilingual dictionary to create code switch data for training bilingual MLM (Extra anchors). For two languages, and , with bilingual dictionary entries , we add anchors to the training data as follows. For each training word in the bilingual dictionary, we either leave it as is (70% of the time) or randomly replace it with one of the possible translations from the dictionary (30% of the time). We change at most 15% of the words in a batch and sample word translations from PanLex Kamholz et al. (2014) bilingual dictionaries, weighted according to their translation quality.111Although we only consider pairs of languages, this procedure naturally scales to multiple languages which could produce larger gains in future work.

The second group of Tab. 1 shows cross-lingual transfer performance under the three anchor point conditions. Anchor points have a clear effect on performance and more anchor points help, especially in the less closely related language pairs (e.g. English-Chinese has a larger effect than English-French with over 3 points improvement on NER and XNLI). However, surprisingly, effective transfer is still possible with no anchor points. Comparing no anchors and default anchors, the performance of XNLI and parsing drops only around 1 point while NER drops over 2 points averaging over three languages. Overall, these results strongly suggest that we have previously overestimated the contribution of anchor points during multilingual pretraining.

4.3 Parameter sharing

Given that anchor points are not required for transfer, a natural next question is the extent to which we need to tie the parameters of the transformer layers. Sharing the parameters of the top layer is necessary to provide shared inputs to the task-specific layer. However, as seen in Figure 2, we can progressively separate the bottom layers 1:3 and 1:6 of the Transformers and/or the embeddings layers (including positional embeddings) (Sep Emb; Sep Emb+L1-6; Sep Emb+L1-3; Sep L1-3; Sep L1-6). Since the prediction layer is tied with the embeddings layer, separating the embeddings layer also introduces a language-specific softmax prediction layer for the cloze task. Additionally, we only sample random words within one language during the MLM pretraining. During fine-tuning on the English training set, we freeze the language-specific layers and only fine-tune the shared layers.

The third group in Tab. 1 shows cross-lingual transfer performance under different parameter sharing conditions with “Sep” denote which layers is not shared across languages. Sep Emb (effectively no anchor point) drops more than No anchors with 3 points on XNLI and around 1 point on NER and parsing, suggesting have a cross-language softmax layer also helps to learn cross-lingual representation. Performance degrades as fewer layers are shared for all pairs, and again the less closely related language pairs lose the most. Most notably, the cross-lingual transfer performance drops to random when separating embeddings and bottom 6 layers of the transformer. However, reasonably strong levels of transfer are still possible without tying the bottom three layers. These trends suggest that parameter sharing is the key ingredient that enables the learning of an effective cross-lingual representation space, and having language-specific capacity does not help learn a language-specific encoder for cross-lingual representation. Our hypothesis is that the representations that the models learn for different languages are similarly shaped and models can reduce their capacity budget by aligning representations for text that has similar meaning across languages.

4.4 Language Similarity

Finally, in contrast to many of the experiments above, language similarity seems to be quite important for effective transfer. Looking at Tab. 1 column by column in each task, we observe performance drops as language pairs become more distantly related. Using extra anchor points helps to close the gap. However, the more complex tasks seem to have larger performance gaps and having language-specific capacity does not seem to be the solution. Future work could consider scaling the model with more data and cross-lingual signal to close the performance gap.

4.5 Conclusion

Summarised by Figure 3, parameter sharing is the most important factor. More anchor points help but anchor points and shared softmax projection parameters are not necessary. Joint BPE and domain similarity contribute a little in learning cross-lingual representation.

5 Similarity of BERT Models

To better understand the robust transfer effects of the last section, we show that independently trained monolingual BERT models learn representations that are similar across languages, much like the widely observed similarities in word embedding spaces found with methods such as word2vec. In this section, we show that independent monolingual BERT models produce highly similar representations when evaluated for at the word level (§ 5.1.1) and sentence level (§ 5.1.2) . We also plot the cross-lingual similarity of neural network activation with center kernel alignment (§ 5.2) at each layer, to better visualize the patterns that emerge. We consider five languages: English, French, German, Russian, and Chinese.

5.1 Aligning Monolingual BERTs

To measure similarity, we learn an orthogonal mapping using the Procrustes Smith et al. (2017) approach:

with , where and are representation of two monolingual BERT models, sampled at different granularities as described below. We apply iterative normalization on and before learning the mapping Zhang et al. (2019).

5.1.1 Word-level alignment

Figure 6: Alignment of word-level representations from monolingual BERT models on subset of MUSE benchmark. Figure (a)a and Figure (b)b are not comparable due to different embedding vocabularies.

In this section, we align both the non-contextual word representations from the embedding layers, and the contextual word representations from the hidden states of the Transformer at each layer.

For non-contextualized word embeddings, we define and as the word embedding layers of monolingual BERT, which contain a single embedding per word (token). Note that in this case we only keep words containing only one subword. For contextualized word representations, we first encode 500k sentences in each language. At each layer, and for each word, we collect all contextualized representations of a word in the 500k sentences and average them to get a single embedding. Since BERT operates at the subword level, for one word we consider the average of all its subword embeddings. Eventually, we get one word embedding per layer. We use the MUSE benchmark Conneau et al. (2017), a bilingual dictionary induction dataset for alignment supervision and evaluate the alignment on word translation retrieval. As a baseline, we use the first 200k embeddings of fastText Bojanowski et al. (2017) and learn the mapping using the same procedure as § 5.1. Note we use a subset of 200k vocabulary of fastText, the same as BERT, to get a comparable number. We retrieve word translation by CSLS Conneau et al. (2017) with K=10.

In Figure 6, we report the alignment results under these two settings. Figure (a)a shows that the subword embeddings matrix of BERT, where each subword is a standalone word, can easily be aligned with an orthogonal mapping and obtain slightly better performance than the same subset of fastText. Figure (b)b shows embeddings matrix with the average of all contextual embeddings of each word can also be aligned to obtain a decent quality bilingual dictionary, although underperforming fastText. We notice that using contextual representations from higher layers obtain better results compared to lower layers.

5.1.2 Sentence-level alignment

Figure 7: Parallel sentence retrieval accuracy after Procrustes alignment of monolingual BERT models.

In this case, and are obtained by average pooling subword representation (excluding special token) of sentences at each layer of monolingual BERT. We use multi-way parallel sentences from XNLI Conneau et al. (2018), a cross-lingual natural language inference dataset, for alignment supervision and Tatoeba Schwenk et al. (2019), a sentence retrieval dataset, for evaluation.

Figure 7

shows the sentence similarity search results with nearest neighbor search and cosine similarity, evaluated by precision at 1, with four language pairs. Here the best result is obtained at lower layers. The performance is surprisingly good given we only use 10k parallel sentences to learn the alignment without fine-tuning at all. As a reference, the state-of-the-art performance is over 95%, obtained by LASER

Artetxe and Schwenk (2019) trained with millions of parallel sentences.

Figure 8: CKA similarity of mean-pooled multi-way parallel sentence representation at each layers. Note en is parapharse of en from back-translation (en-fr-en). Random encoder is only used by non-Engligh sentences. L0 is the embeddings layers while L1 to L8 are the corresponding transformer layers. The average row is the average of 9 (L0-L8) similarity measurements.

5.1.3 Conclusion

These findings demonstrate that both word-level and sentence-level BERT representations can be aligned with a simple orthogonal mapping. Similar to the alignment of word embeddings Mikolov et al. (2013), this shows that BERT models, resembling deep CBOW, are similar across languages. This result gives more intuition on why mere parameter sharing is sufficient for multilingual representations to emerge in multilingual masked language models.

5.2 Neural network similarity

Based on the work of kornblith2019similarity, we examine the centered kernel alignment (CKA), a neural network similarity index that improves upon canonical correlation analysis (CCA), and use it to measure the similarity across both monolingual and bilingual masked language models. The linear CKA similarity measure is defined as follows:

In the above equations, and correspond respectively to the matrix of the -dimensional mean-pooled (excluding special token) subword representations at layer of the parallel source and target sentences, corresponds to the centering matrix

The linear CKA is both invariant to orthogonal transformation and isotropic scaling, but are not invertible to any linear transform, a property that allows measuring the similarity between representations of high dimension than the number of data points 

Kornblith et al. (2019).

In Figure 8, we show the CKA similarity of monolingual models, compared with bilingual models and random encoders, of multi-way parallel sentences Conneau et al. (2018) for five languages pair: English to English (obtained by back-translation from French), French, German, Russian, and Chinese. The monolingual en is trained on the same data as en but with different random seed and the bilingual en-en is trained on English data but with separate embeddings matrix as in § 4.3. The rest of the bilingual MLM is trained with the Default setting. We only use random encoder for non-English sentences.

Figure 8 shows bilingual models have slightly higher similarity compared to monolingual models and random encoders serving as a lower bound. Despite the slightly lower similarity between monolingual models, it still explains the alignment performance in § 5.1. Because the measurement is also invariant to orthogonal mapping, the CKA similarity is highly correlated with the sentence-level alignment performance in Figure 7 with over 0.9 Pearson correlation for all four languages pairs. For monolingual and bilingual models, the first few layers have the highest similarity, which explains why wu2019beto finds freezing bottom layers of mBERT helps cross-lingual transfer. The similarity gap between monolingual model and bilingual model decrease as the languages pair become more distant. In other words, when languages are similar, using the same model increase representation similarity. On the other hand, when languages are dissimilar, using the same model does not help representation similarity much. Future work could consider how to best train multilingual models covering distantly related languages.

6 Discussion

In this paper, we show that multilingual representations can emerge from unsupervised multilingual masked language models with only parameter sharing of some Transformer layers. Even without any anchor points, the model can still learn to map representations coming from different languages in a single shared embedding space. We also show that isomorphic embedding spaces emerge from monolingual masked language models in different languages, similar to word2vec embedding spaces Mikolov et al. (2013). By using a simple linear mapping, we are able to align the embedding layers and the contextual representations of Transformers trained in different languages. We also use the CKA neural network similarity index to probe the similarity between BERT Models and show that the early layers of the Transformers are more similar across languages than the last layers. All of these effects where stronger for more closely related languages, suggesting there is room for significant future work on more distant language pairs.