DeepAI AI Chat
Log In Sign Up

First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT

by   Benjamin Müller, et al.

Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.


page 3

page 11

page 12

page 13

page 14


Can Monolingual Pretrained Models Help Cross-Lingual Classification?

Multilingual pretrained language models (such as multilingual BERT) have...

On the Universality of Deep COntextual Language Models

Deep Contextual Language Models (LMs) like ELMO, BERT, and their success...

On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning

Recent work has shown evidence that the knowledge acquired by multilingu...

A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models

Recent work on tokenizer-free multilingual pretrained models show promis...

It's not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT

Recent works have demonstrated that multilingual BERT (mBERT) learns ric...

SQuId: Measuring Speech Naturalness in Many Languages

Much of text-to-speech research relies on human evaluation, which incurs...

Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models

There is growing evidence that pretrained language models improve task-s...