XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

by   Yaobo Liang, et al.

In this paper, we introduce XGLUE, a new benchmark dataset to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE (Wang et al.,2019), which is labeled in English and includes natural language understanding tasks only, XGLUE has three main advantages: (1) it provides two corpora with different sizes for cross-lingual pre-training; (2) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (3) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder (Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.


page 1

page 2

page 3

page 4


Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

We present Unicoder, a universal language encoder that is insensitive to...

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

In this paper, we introduce ELECTRA-style tasks to cross-lingual languag...

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has achieved SOTA p...

ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation

Now, the pre-training technique is ubiquitous in natural language proces...

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Recent advances in machine learning have significantly improved the unde...

ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization

We present ClidSum, a benchmark dataset for building cross-lingual summa...

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

We present results from a large-scale experiment on pretraining encoders...

1 Introduction

Pre-training + Fine-tuning has become a new NLP paradigm, where the general knowledge are firstly learnt from large-scale corpus by self-supervised learning and then transferred to downstream tasks by task-specific fine-tuning. Three different types of pre-trained models are explored recently, including

monolingual pre-trained models Radford et al. (2018); Devlin et al. (2019); Liu et al. (2019); Yang et al. (2019b); Dong et al. (2019); Lewis et al. (2019a), multilingual and cross-lingual pre-trained models Devlin et al. (2019); Conneau and Lample (2019); Huang et al. (2019); Conneau et al. (2019) and multimodal pre-trained models Lu et al. (2019); Li et al. (2020); Chen et al. (2019); Zhou et al. (2020). In this paper, we focus on the cross-lingual pre-trained models, due to their importance to alleviating the low-resource issue among languages, where an NLP task often has rich training data in one language (such as English) but has few or no training data in other languages (such as French and German). In order to further advance the development of cross-lingual pre-trained models for various downstream tasks in different languages, this paper introduces XGLUE, a new benchmark dataset that can be used to: (i) train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, (ii) evaluate generalization capabilities of the cross-lingual pre-trained models across a diverse set of cross-lingual tasks.

The contribution of XGLUE is two-fold. First, it provides 11 diversified cross-lingual tasks covering both understanding and generation scenarios, which, to the best of our knowledge, is the first attempt in the cross-lingual dataset construction efforts. XTREME Hu et al. (2020) is a concurrent work of XGLUE. But it includes cross-lingual understanding tasks only. Second, an extended version of Unicoder Huang et al. (2019) is described and evaluated as a strong cross-lingual pre-trained model baseline on XGLUE for both understanding and generation tasks. We also evaluate the base versions (12-layer) of Multilingual BERT Devlin et al. (2019), XLM Conneau and Lample (2019) and XLM-R Conneau et al. (2019) for comparison. We conduct comprehensive experiments on XGLUE, which not only show interesting findings, but also point out several ways to further improve the cross-lingual pre-trained models.

2 XGLUE Benchmark111https://to-be-released.

2.1 Pre-training Corpus

We collect two corpora, Small Corpus and Large Corpus, with different sizes for cross-lingual pre-training: the former can be used to evaluate new ideas effectively and the latter can be used to train large-scale models. Table 1 lists the data statistics.

2.1.1 Small Corpus (SC)

Multilingual Corpus

We extract raw sentences from the Wikipedia dump using WikiExtractor333https://github.com/attardi/wikiextractor., which leads to a 101G multilingual corpus covering 100 languages.

Bilingual Corpus

We use an in-house pipeline to extract bilingual sentence pairs from the Web, which leads to a 99G bilingual corpus covering 27 languages, including Arabic, Bulgarian, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Swedish, Swahili, Thai, Turkish, Urdu, Vietnamese and Chinese.

2.1.2 Large Corpus (LC)

Multilingual Corpus

Following Wenzek et al. (2019), we construct a clean version of Common Crawl (CC)444https://commoncrawl.org/.

as the multilingual corpus. First, we use a language identification model trained based on Wikipedia to classify the language of each page in CC. Then, we train a language model for each language using the corresponding part of the Wikipedia corpus, and use it to filter documents as

Wenzek et al. (2019) did. We use one CC dump for English and twelve CC dumps for other languages. It leads to a 2,500G multilingual corpus covering 89 languages. We also include the 101G multilingual corpus described in Section 2.1.1.

Bilingual Corpus

We reuse the bilingual corpus described in Section 2.1.1. We will add CCMatrix Schwenk et al. (2019) in the future.

Type # of Languages Size
Small Corpus Multilingual 100 101G
Bilingual 27 99G
Large Corpus Multilingual 100 2,500G+101G
Bilingual 27 99G
Table 1: The statistics of two pre-training corpora.

2.2 Downstream Tasks

Task # of Languages Train Dev Test Metric Data Source
NER 4 15.0K 2.8K 3.4K F1 ECI Multilingual Text Corpus
POS 18 25.4K 1.0K 0.9K ACC UD Tree-banks (v2.5)
NC 5 100K 9.6K 9.6K ACC MSN
MLQA 7 87.6K 0.6K 5.7K F1 Wikipedia
XNLI 15 433K 2.5K 5K ACC MultiNLI Corpus
PAWS-X 4 49.4K 2K 2K ACC Wikipedia
QADSM 3 100K 10K 10K ACC Commercial Search Engine
WPR 6 100K 10K 10K nDCG Commercial Search Engine
QAM 3 100K 10K 10K ACC Commercial Search Engine
QG 3 100K 3.3K 3.8K BLEU-4 Commercial Search Engine
NTG 5 1,170K 10K 10K BLEU-4 MSN
Table 2: 11 downstream tasks in XGLUE. For each task, training set is only available in English. Train denotes the number of labeled instances in the training set. Dev and Test denote the average numbers of labeled instances in the dev sets and test sets, respectively. denotes the corresponding dataset is constructed by this paper, not from an existing task.
Task ar bg de el en es fr hi it nl pl pt ru sw th tr ur vi zh
Table 3: The 19 languages covered by the 11 downstream tasks, including Arabic (ar), Bulgarian (bg), German (de), Greek (el), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swahili (sw), Thai (th), Turkish (tr), Urdu (ur), Vietnamese (vi), and Chinese (zh).

We select 11 cross-lingual tasks in XGLUE, which are categorized into 3 groups: single-input understanding tasks, pair-input understanding tasks, and generation tasks. For each task, training set is only available in English. In order to obtain a good performance on XGLUE, a model should be able to learn how to do a task well using its English training set, and then transfer this ability to test sets in other languages. Table 2 gives the dataset statistics and Table 3 lists languages covered by all tasks.

2.2.1 Single-input Understanding Tasks


We select a subset of the following two NER tasks, CoNLL-2002 NER Sang (2002) and CoNLL-2003 NER Sang and De Meulder (2003), to form this cross-lingual NER dataset. It covers 4 languages, including English, German, Spanish and Dutch, and 4 types of named entities, including Person, Location, Organization and Miscellaneous entities that do not belong to the previous three types. F1 score is used as the metric.

POS Tagging (POS)

Following Kim et al. (2017), we select a subset of Universal Dependencies (UD) Treebanks (v2.5) Zeman et al. (2019), which cover 18 languages. Accuracy (ACC) of the predicted POS tags is used as the metric.

News Classification (NC)

This task aims to predict the category of a given news article. We collect news article, news category pairs from MSN555https://www.msn.com/.. It covers 10 different news categories and 5 languages, including English, Spanish, French, German and Russian. Accuracy (ACC) of the multi-class classification is used as the metric.

2.2.2 Pair-input Understanding Tasks


The MLQA Lewis et al. (2019b) is a multilingual machine reading comprehension task, which contains QA annotations labeled in 7 languages, including English, Arabic, German, Spanish, Hindi, Vietnamese and Chinese. F1 score of the predicted answers is used as the metric.


We reuse the original XNLI dataset Conneau et al. (2018) in XGLUE.


The PAWS-X Yang et al. (2019a) is a paraphrase identification dataset, which extends the Wikipedia portion of the PAWS Zhang et al. (2019) evaluation to more languages. We select 4 languages, including English, Spanish, French and German, from the original dataset and use them in XGLUE. Accuracy (ACC) of the binary classification is used as the metric.

Query-Ads Matching (QADSM)

This task aims to find the most relevant ads of a given user query. We construct this dataset based on the user clicks obtained from a commercial search engine. It covers 3 languages, including English, French and German. Each labeled instance is a 5-tuple: query, ads keywords, ads title, ads description, label. Accuracy (ACC) of the binary classification is used as the metric.

Web Page Ranking (WPR)

This task aims to find the most relevant web page of a given user query. We construct this dataset based on the user clicks obtained from a commercial search engine. It covers 6 languages, including English, German, French, Italian, Portugal and Chinese. Each labeled instance is a 4-tuple: query, web page title, web page snippet, label. The label contains 4 ratings: highly relevance, middle relevance, low relevance and not related. Normalize Discounted Cumulative Gain (nDCG) is used as the metric.

QA Matching (QAM)

This task aims to determine whether a question, passage pair is a QA pair. We construct this dataset based on a commercial search engine. It covers 3 languages, including English, French and German. Each labeled instance is a 3-tuple: question, passage, label. The label indicates whether the passage is the answer of the question, or not. Accuracy (ACC) of the binary classification is used as the metric.

2.2.3 Generation Tasks

Question Generation (QG)

This task aims to generate a natural language question for a given passage. We construct this dataset from the same data source of the QAM task, but the inputs are passages and the outputs are questions now. BLEU-4 score is used as the metric.

News Title Generation (NTG)

This task aims to generate a title for a given news body. We extract the news title, news body pairs from MSN. It covers 5 languages, including German, English, French, Spanish and Russian. BLEU-4 score is used as the metric.

3 Pre-train Unicoder for Cross-lingual Understanding Tasks

We select Unicoder Huang et al. (2019) as the backbone model. Section 3 introduces a simplified version of Unicoder using two pre-training tasks (MLN and TLM) for cross-lingual understanding tasks. Section 4 describes how to extend Unicoder to cover cross-lingual generation tasks.

The original Unicoder Huang et al. (2019) includes more pre-training tasks besides MLM and TLM. But to keep the baseline pre-trained model simple and to reduce the experimental cost, we just use these two most commonly used tasks in Unicoder. It means for understanding tasks, Unicoder is almost equal to XLM, except several hyper-parameter differences. We will add the results of Unicoder pre-trained by more tasks beyond MLM and TLM in the updated version.

3.1 Masked Language Model (MLM)

Following Devlin et al. (2019)

, this task extends the masked language model task to multiple languages. At each iteration, a batch is composed of sentences sampled from different languages. The sampling probability of a language

is :

where is the percentage of the language in the entire corpus, the smoothing factor is set to 0.3. For each batch, we randomly sample 15% of the words and replace them with (i) a special symbol [MASK], (ii) a random token or (iii) keep them unchanged with probability 80%, 10% and 10%, respectively. For each token, we only use its word embedding and position embedding, and discard segment embedding and language embedding.

3.2 Translation Language Model (TLM)

Following Conneau and Lample (2019), this task extends the MLN task to bilingual corpus. Given a bilingual sentence pair, TLM first concatenates them into a single sentence, and then masks words using the same strategy of MLM. The pre-trained model learns to recover each masked word based on the bilingual context. We follow MLM to sample language pairs in each batch with .

4 Pre-train Unicoder for Cross-lingual Generation Tasks

The encoder-decoder architecture is employed to extend Unicoder to generation tasks, where the BPE embeddings are shared between encoder and decoder. Two separate generative tasks are proposed for Unicoder pre-training: Multilingual Denoising Auto-Encoding (xDAE) and

Multilingual Future N-gram Prediction (xFNP)


4.1 Multilingual Denoising Auto-Encoding (xDAE)

Motivated by BART Lewis et al. (2019a), xDAE aims to predict the original text from a language based on its corrupted form , where is a noising function that corrupts an input text as its output.

Four different noising strategies for are explored in this paper. (1) Shuffle the input text by adding a noise to the input indices and then re-ordering based on the rank of the noised indices. (2) Drop words with a probability of 0.1. (3) Replace 10 of the input words in with the [MASK] symbol. (4) Sample a number of token spans from

with span lengths drawn from a Poisson distribution (

), and then replace each token span with a single [MASK] token. Here, 0-length spans correspond to the insertion of [MASK] tokens. Based on the performance of different noising strategies (Table 11), we select (4) and use it in pre-training. We leave finding better noising strategies for future work.

We train Unicoder using this task by maximizing the following loss function


where denotes languages, is an instance in the language , denotes the probability of generating a token at time step given and .

4.2 Multilingual Future N-gram Prediction (xFNP)

Motivated by ProphetNet Yan et al. (2020)

, xFNP introduces a future n-gram prediction mechanism to natural language generation. It encourages the model to plan for the future tokens explicitly and prevents over-fitting on strong local correlations.

Given an input text from a language , we randomly mask token spans of to generate the masked text as the input, and concatenate all masked token spans into as the output. Details of this mask strategy are described in Section 6.1. After this, xFNP first encodes to with the encoder:

Then, instead of predicting the next token only at each time step, xFNP generates future tokens simultaneously at time step with the decoder:

Following Yan et al. (2020), we set .

We train Unicoder using this task by maximizing the following loss function :

where and are generated from based on the method mentioned above. Following Yan et al. (2020), we set .

5 Related Work


GLUE Wang et al. (2019) includes 9 natural language understanding tasks that are labeled in English only. Comparing to GLUE, XGLUE not only expands task annotations to multiple languages, but also includes natural language generation tasks. XNLI Conneau et al. (2018), NER Sang (2002); Sang and De Meulder (2003), POS Tagging Kim et al. (2017), MLQA Lewis et al. (2019b) and PAWS-X Yang et al. (2019a) are 5 multilingual datasets built for specific tasks. XGLUE not only includes these 5 existing tasks, but also introduces 6 new tasks selected from real-world scenarios (i.e., Search, Ads and News). This makes XGLUE have more practical values. XTREME Hu et al. (2020) is a concurrent work of XGLUE. Comparing to it, XGLUE includes both understanding and generation tasks, which, to the best of our knowledge, is the first attempt in the cross-lingual dataset construction efforts.

Cross-lingual Pre-trained Model

Multilingual BERT (M-BERT) Devlin et al. (2019) performs pre-training based on the multilingual corpus with the masked language model task. By sharing the model parameters and the vocabulary for all languages, M-BERT can obtain the cross-lingual capability over 102 languages. XLM Conneau and Lample (2019) performs cross-lingual pre-training based on multilingual corpus and bilingual corpus, by introducing the translation language model task into pre-training. Based on XLM, Unicoder Huang et al. (2019) uses more cross-lingual pre-training tasks and achieves better results on XNLI. XLM-R Conneau et al. (2019) is a RoBERTa Liu et al. (2019)-version XLM without using translation language model in pre-training. It is trained based on a much larger multilingual corpus (i.e. Common Crawl) and become the new state-of-the-art on XNLI. In this paper, we use both the Common Crawl corpus and the bilingual corpus, aiming to build a stronger baseline model on XGLUE. BART Lewis et al. (2019a) and ProphetNet Yan et al. (2020) are two latest generative pre-trained models. We borrow ideas from these two work and extend Unicoder to cross-lingual generation tasks, which goes a step further to verify and explore different text generation approaches in the cross-lingual scenarios.

6 Experiments

Task Model ar bg de el en es fr hi it nl pl pt ru sw th tr ur vi zh AVG
NER M-BERT - - 69.2 - 90.6 75.4 - - - 77.9 - - - - - - - - - 78.2
XLM-R - - 70.4 - 90.9 75.2 - - - 79.5 - - - - - - - - - 79.0
Unicoder - - 71.8 - 91.1 74.4 - - - 81.6 - - - - - - - - - 79.7
POS M-BERT 52.4 85.0 88.7 81.5 95.6 86.8 87.6 58.4 91.3 88.0 83.2 88.3 78.8 - 43.3 69.2 53.8 54.3 58.3 74.7
XLM-R 67.3 88.8 92.2 88.2 96.2 89.0 89.9 74.5 92.6 88.5 86.9 89.7 86.9 - 57.9 72.7 62.1 55.2 60.4 79.9
Unicoder 68.6 88.5 92.0 88.3 96.1 89.1 89.4 69.9 92.5 88.9 86.0 89.8 86.7 - 57.6 75.0 59.8 56.3 60.2 79.7
NC M-BERT - - 78.5 - 92.3 82.1 76.2 - - - - - - - 77.6 - - - - 81.3
XLM-R - - 81.6 - 92.2 85.4 78.1 - - - - - - - 79.2 - - - - 83.3
Unicoder - - 81.8 - 92.5 85.4 77.5 - - - - - - - 79.6 - - - - 83.4
MLQA M-BERT 50.9 - 63.8 - 80.5 67.1 - 47.9 - - - - - - - - - 59.5 55.4 60.7
XLM-R 56.4 - 62.1 - 80.1 67.9 - 60.5 - - - - - - - - - 67.1 61.4 65.1
Unicoder 57.8 - 62.7 - 80.6 68.6 - 62.7 - - - - - - - - - 67.5 62.1 66.0
XNLI M-BERT 64.9 68.9 71.1 66.4 82.1 74.3 73.8 60.0 - - - - 69.0 50.4 55.8 61.6 58.0 69.5 69.3 66.3
XLM 73.1 77.4 77.8 76.6 85.0 78.9 78.7 69.6 - - - - 75.3 68.4 73.2 72.5 67.3 76.1 76.5 75.1
XLM-R 72.1 77.5 77.0 75.9 84.6 79.2 78.2 69.8 - - - - 75.5 64.7 71.6 72.9 65.1 74.8 73.7 74.2
Unicoder 68.5 73.2 71.6 71.6 82.9 75.0 74.7 66.0 - - - - 70.6 64.1 67.0 68.7 62.5 71.2 69.7 70.5
Unicoder 73.9 78.5 78.2 77.3 85.4 79.8 79.2 70.1 - - - - 76.7 67.4 71.8 73.8 66.3 75.9 74.7 75.3
PAWS-X M-BERT - - 82.9 - 94.0 85.9 86.0 - - - - - - - - - - - - 87.2
XLM-R - - 86.9 - 94.4 88.0 88.7 - - - - - - - - - - - - 89.5
Unicoder - - 87.4 - 94.9 88.8 89.3 - - - - - - - - - - - - 90.1
QADSM M-BERT - - 60.9 - 69.0 - 63.8 - - - - - - - - - - - - 64.6
XLM-R - - 65.9 - 70.6 - 68.6 - - - - - - - - - - - - 68.3
Unicoder - - 65.7 - 72.4 - 69.9 - - - - - - - - - - - - 69.3
WPR M-BERT - - 76.5 - 78.2 - 76.0 - 68.5 - - 76.5 - - - - - - 61.8 72.9
XLM-R - - 77.2 - 78.4 - 76.9 - 68.3 - - 77.7 - - - - - - 62.3 73.5
Unicoder - - 77.7 - 78.4 - 77.0 - 68.9 - - 77.9 - - - - - - 62.6 73.7
QAM M-BERT - - 57.7 - 71.2 - 55.0 - - - - - - - - - - - - 61.3
XLM-R - - 66.8 - 72.2 - 62.9 - - - - - - - - - - - - 67.3
Unicoder - - 66.9 - 72.4 - 63.9 - - - - - - - - - - - - 67.7
XLM-R 75.6
Unicoder 76.1
QG M-BERT - - 0.0 - 13.1 - 0.3 - - - - - - - - - - - - 4.5
XLM - - 0.0 - 13.5 - 0.0 - - - - - - - - - - - - 4.5
Unicoder - - 2.8 - 15.4 - 5.5 - - - - - - - - - - - - 7.9
Unicoder - - 1.4 - 15.7 - 2.4 - - - - - - - - - - - - 6.5
NTG M-BERT - - 7.5 - 17.0 8.4 8.0 - - - - - 0.0 - - - - - - 8.2
XLM - - 5.4 - 17.0 7.9 5.7 - - - - - 0.0 - - - - - - 7.2
Unicoder - - 9.1 - 16.7 11.1 10.6 - - - - - 7.5 - - - - - - 11.0
Unicoder - - 8.6 - 17.7 10.7 8.3 - - - - - 0.1 - - - - - - 9.1
XLM 5.9
Unicoder 9.5
Unicoder 7.8
Table 4: The overall evaluation results on XGLUE. We use M-BERT Devlin et al. (2019), XLM Conneau and Lample (2019) and XLM-R Conneau et al. (2019) as baselines. Unicoder and Unicoder are pre-trained using small corpus and large corpus, respectively. Unicoder and Unicoder are pre-trained by xDAE (for 15 languages) and xFNP (for 100 languages), respectively. For the results of M-BERT/XLM on generation tasks, we initialize the encoder-decoder model with M-BERT/XLM and fine-tune it on each downstream task without pre-training. All models are (12-layer) base ones. Given a task, each pre-trained model is fine-tuned using its English training set only, and then applied to all test sets in different languages. AVG and AVG denote the average score of the average scores on 9 understanding tasks and 2 generation tasks, respectively. Due to GPU limitation and time cost consideration, Unicoder is pre-trained using 10 of the large corpus only.
Pivot en fr es de el bg ru tr ar vi th zh hi sw ur AVG
en 85.4 79.2 79.8 78.2 77.3 78.5 76.7 73.8 73.9 75.9 71.8 74.7 70.1 67.4 66.3 75.3
fr 84.0 79.9 80.3 78.8 77.4 79.2 77.0 73.6 73.7 76.7 72.7 75.3 73.0 67.4 68.3 75.8
es 84.5 80.2 81.2 79.7 78.2 79.2 77.6 74.5 74.8 77.0 72.8 76.2 73.2 67.7 69.6 76.4
de 83.5 79.1 80.1 80.2 77.9 78.6 77.0 74.9 74.6 76.1 73.3 76.2 73.1 67.7 68.9 76.1
el 83.8 80.1 81.0 78.6 79.6 79.3 77.0 74.2 74.9 77.1 73.5 75.9 72.7 69.1 69.1 76.4
bg 83.5 79.6 80.4 79.1 77.9 80.5 77.9 74.9 73.9 76.5 73.9 75.6 72.8 68.6 68.9 76.3
ru 84.1 79.9 79.9 78.8 77.5 79.9 78.1 73.9 74.5 77.1 73.8 75.7 73.1 68.5 69.0 76.2
tr 83.3 78.4 79.6 78.4 77.5 79.2 77.5 77.1 74.2 77.1 74.5 76.5 73.7 69.3 70.3 76.4
ar 83.2 78.9 79.5 77.6 77.4 78.6 77.0 75.4 76.8 76.8 74.0 76.0 73.0 69.5 69.3 76.2
vi 83.2 78.6 79.1 77.7 76.6 78.9 77.5 75.3 74.7 78.5 73.5 76.8 73.1 67.8 69.0 76.0
th 82.5 78.5 79.1 77.8 77.1 78.3 76.7 75.0 74.3 76.9 76.4 76.2 72.9 68.4 69.7 76.0
zh 81.6 78.2 77.9 77.1 76.0 77.9 76.2 73.7 73.7 75.8 73.6 76.6 71.7 67.4 68.3 75.1
hi 81.8 78.5 79.2 76.7 77.2 78.2 76.2 74.5 73.9 76.4 71.7 75.2 73.8 68.2 68.5 75.3
sw 82.0 77.6 78.8 77.2 76.5 77.7 76.2 74.4 74.3 76.3 74.0 75.2 72.2 71.4 69.5 75.6
ur 76.7 72.5 74.1 72.6 72.1 73.9 72.7 69.7 69.7 72.8 70.1 72.4 69.0 66.0 67.5 71.5
Table 5: Impacts of different pivot languages on XNLI. Given each pivot language, the corresponding fine-tuned XNLI results on all languages are listed in the same row. Each bolded number is the best result in that column.

6.1 Experimental Settings

Understanding Tasks

The hyper-parameters are set as follows: 768 hidden units, 12 heads, GELU activation, a dropout rate of 0.1, 512 max input length, 12 layers in encoder.

In the pre-training stage, we first initialize Unicoder with XLM-R Conneau et al. (2019), and then run continue pre-training with the accumulated 8,192 batch size with gradients accumulation. We use Adam Optimizer with a linear warm-up Vaswani et al. (2017) and set the learning rate to 3e-5. We select different understanding tasks randomly in different batches.

In the fine-tuning stage, the batch size is set to 32. We use Adam Optimizer Kingma and Ba (2014)

with warm-up and set the learning rate to 5e-6. For all understanding tasks, we fine-tune Unicoder models for 10 epochs. There are two exceptions, for POS Tagging we set the learning rate to 2e-5. For MLQA, we set the learning rate to 3e-5, batch size to 12 and train 2 epochs following BERT for SQuAD. After each epoch, we test the fine-tuned model on the dev sets of all languages. We select the model with the best average result on the dev sets of all languages.

Generation Tasks

We evaluate Unicoder and Unicoder as two separate models.

For Unicoder, the hyper-parameters are set as follows: 1,024 hidden units, 8 heads, GELU activation, a dropout rate of 0.1, 512 max input length, 12 layers in encoder, 6 layers in decoder.

In the pre-training stage, we first initialize encoder and decoder with XLM Conneau and Lample (2019), and then run continue pre-training with the accumulated 1,024 batch size with gradients accumulation. We use Adam optimizer with a linear warm-up and the set the learning rate to 1e-4.

In the fine-tuning stage, the batch size is set to 32. We use Adam Optimizer Kingma and Ba (2014) with learning rate to 5e-6.

For Unicoder, the hyper-parameters are set as follows: 1,024 hidden size, 12 layers in encoder, 12 layers in decoder, 512 max input length, 4,096 feed-forward filter size.

In the pre-training stage, we pre-train the model from scratch, and follow ProphetNet Yan et al. (2020) to randomly mask a continuous span (with a fixed length 9) in every 64 tokens. About 15% of the tokens in original sequence are masked in this step. We use a special symbol [MASK] to replace 80% of the masked tokens, keep 10% unchanged, and random replace 10% of the masked tokens. We set the batch size to 1,024, training steps to 120,000. The learning rate is set to 1e-4. We set the number of future tokens to 2.

In the fine-tuning stage, we use Adam Optimizer Kingma and Ba (2014) and set the learning rate to 1e-4. We set the batch size to 64 and the warm-up steps to 1,000.

Pivot en es fr de ru AVG
en 16.7 11.1 10.6 9.1 7.5 11.0
es 8.5 16.0 9.9 7.9 7.7 10.0
fr 8.6 11.3 17.3 9.0 7.2 10.7
de 8.5 9.3 9.6 13.5 8.1 9.8
ru 6.6 9.2 9.0 6.5 12.8 8.8
Table 6: Impacts of different pivot languages on NTG. Unicoder is used and BLEU-4 is the metric.
en fr es de el bg ru tr ar vi th zh hi sw ur AVG
XLM-R 85.7 81.5 82.5 81.2 79.7 81.7 80 79 77.1 80.1 77.9 79.2 76.5 73 71.3 79.1
Unicoder 85.4 79.2 79.8 78.2 77.3 78.5 76.7 73.8 73.9 75.9 71.8 74.7 70.1 67.4 66.3 75.3
Unicoder 85.8 81.9 82.3 81.5 80.8 82.0 79.9 78.7 78.1 80.2 78.4 79.3 76.2 73.2 72.4 79.4
Table 7: Impact of multi-language fine-tuning on XNLI. and denote pivot-language fine-tuning (using English as the pivot) and multi-language fine-tuning, respectively. XLM-R denotes the multi-language fine-tuning results based on XLM-R.
Model en es fr de ru AVG
Unicoder 16.7 11.1 10.6 9.1 7.5 11.0
Unicoder 16.5 16.8 18.0 14.6 13.2 15.8
Table 8: Impact of multi-language fine-tuning on NTG. and denote pivot-language fine-tuning (using English as the pivot) and multi-language fine-tuning, respectively. BLUE-4 is the metric.
Unicoder 75.3 90.1 82.4 67.7 69.3 77.0
Unicoder 73.4 90.4 82.7 68.5 69.0 76.8
Table 9: Impacts of multi-task fine-tuning on XNLI, PAWS-X, NC, QAM and QADSM. and denote pivot-language fine-tuning (using English as the pivot) on each task and multi-task fine-tuning, respectively.
en Input News
if you ’re planning a trip to europe , you probably want to check some famous landmarks off your list . but there are certain tourist traps
you ’re better off missing . susana victoria perez has more .
Golden Title
do yourself a favor and avoid these tourist traps in europe
tourist traps you should avoid in europe
fr Input News
alain juppe , candidat a la primaire de la droite , ” ne se sent pas engage ” par les investitures decidees par le parti les republicains preside
par nicolas sarkozy , a affirme jeudi a l’ afp son directeur de campagne , gilles boyer . ” c’ est un processus mene a la hussarde . il n’ y a pas
de volonte d’ equilibre et de rassemblement ” , a-t-il denonce , en affirmant que ” l’ accord politique ” entre les differents candidats a la primaire
” n’ a pas ete respecte ” .
Golden Title
legislatives : juppe ” ne se sent pas engage ” par les investitures
alain juppe : ” ne se sent pas engage ” par les investitures
de Input News
vermutlich zur verteidigung seines reviers hat ein aggressiver bussard in baden-wurttemberg einen radfahrer zu fall gebracht , der sich dabei
schwer verletzte . wie die polizei in ludwigsburg am freitag mitteilte , attackierte der greifvogel den 51-jahrigen am vortag auf einem radweg
entlang einer landesstraße . der bussard flog demnach so tief auf den radler zu , dass dieser ausweichen musste und sturzte . den angaben zufolge
erlitt der mann schwere verletzungen und wurde von rettungskraften in ein krankenhaus gebracht . ” aus luftiger hohe , von einem laternenmast
aus , beobachtete der raubvogel anschließend die unfallaufnahme ” , hieß es im polizeibericht .
Golden Title
aggressiver bussard bringt radfahrer zu fall
aggressiver bussard in ludwigsburg sturzes radler
es Input News
despues de la marcha de bruce willis por problemas de agenda , steve carrell le sustituira asi en la nueva pelicula que prepara woody allen . segun
informa variety , el actor se une al reparto ya formado por blake lively , parker posey , kristen stewart , jesse eisenberg , jeannie berlin ,corey stoll ,
anna camp , y ken stott , entre otros . como siempre , los detalles de la trama son aun un secreto aunque el rodaje se encuentre actualmente en
marcha . por otro lado , aun no hay fecha de estreno ni distribuidora para la pelicula sin titulo de woody allen . sin embargo , el director tiene aun
pendiente de estreno su ultimo filme con emma stone y joaquin phoenix titula da irrational man que se estrenara el proximo 25 de septiembre .
Golden Title
steve carrell sustituye a bruce willis en la nueva pelicula de woody allen
steve carrell sustituira a steve carrell en woody allen
Table 10: Some input-output examples of Unicoder on NTG.
Noising Strategy en es fr de ru AVG
(1)+(2)+(3) 16.7 10.6 10.4 9.2 7.4 10.9
(4) 16.7 11.1 10.6 9.1 7.5 11.0
(1)+(2)+(3)+(4) 17.0 10.4 10.0 9.5 7.7 10.9
Table 11: Impact of different noising strategies on NTG with pivot-language fine-tuning (using English as the pivot). BLUE-4 is the metric.
Model fr zh AVG
XNLG Chi et al. (2019) 36.31 38.91 37.61
Unicoder 37.89 42.23 40.06
Table 12: The zero-shot results on Abstractive Summarization. Unicoder and XNLG are fine-tuned using English labeled data. ROUGE-L is the metric.

6.2 Main Result

7 cross-lingual pre-trained models are evaluated and compared in Table 4: 12-layer M-BERT Devlin et al. (2019) trained on Wikipedia corpus for 102 languages, 12-layer XLM Conneau and Lample (2019) trained on Wikipedia corpus and bilingual corpus for 15 languages, 12-layer XLM-R Conneau et al. (2019) trained on Common Crawl corpus for 100 languages, 12-layer Unicoder trained on small corpus for 100 languages, 12-layer Unicoder trained on large corpus for 100 languages, Unicoder trained on Wikipedia corpus for 15 languages, Unicoder trained on Wikipedia corpus for 100 languages. Note that, all results are reproduced by this paper, except the XLM result on XNLI is from Conneau and Lample (2019).

As XLM-R (or Unicoder) performs consistently better than XLM (or Unicoder), we discard the results of XLM and Unicoder on all tasks except XNLI. Given a downstream task, each pre-trained model is fine-tuned using its English training set, and then applied to all test sets in different languages.

Table 4 shows that: (1) Unicoder performs better than M-BERT and XLM-R on almost all tasks, as it leverages bilingual corpus in pre-training. (2) Unicoder performs better than Unicoder, which shows larger corpus can lead to better models. (3) Unicoder and Unicoder perform better than M-BERT and XLM on generation tasks, as they include generation tasks in pre-training while M-BERT and XLM don’t. XLM is unable to generate correct languages on the fr and de QG test sets at all. (4) Unicoder performs worse than Unicoder, as the former is pre-trained for 100 languages while the latter is pre-trained for 15 languages only. We will add the results of Unicoder pre-trained for 100 languages, and the results of combining xDAE and xFNP into a unified pre-trained model in the updated version.

6.3 Ablation Study

6.3.1 Pivot-language Fine-tuning

We define pivot-language fine-tuning as follows: (1) fine-tune the pre-trained model for a downstream task using its labeled data in a pivot language (e.g. English); (2) apply the resulting fine-tuned model to all languages. Table 4 chooses English as the pivot language, as all tasks in XGLUE have labeled data in English. But is English the optimal choice? Will the results become better, if we do fine-tuning using other pivot languages?

In order to answer this question, we investigate the impacts of using different pivot languages in fine-tuning on XNLI and NTG, and list results of using different pivot languages on these 2 tasks in Table 5 and Table 6, respectively.

Table 5 and Table 6 show that: (1) For each test set, its best result is often achieved when the pre-trained model is fine-tuned on the training set in the same language. (2) For XNLI, the best pivot languages are Spanish (es), Greek (el) and Turkish (tr), rather than English (en), and for NTG, the best pivot language is English (en). This phenomenon shows a possibility to further improve the average performance of a cross-lingual pre-trained model on different downstream tasks, by selecting different pivot languages in fine-tuning. We leave explorations on pivot languages for future work.

6.3.2 Multi-language Fine-tuning

We investigate the impact of multi-language fine-tuning, which fine-tunes the pre-trained model for a downstream task using the available labeled data from different languages. We also report results on XNLI and NTG tasks, due to the availability of the labeled data on multiple languages.

Table 7 and Table 8 show that multi-language fine-tuning can achieve better results than pivot-language fine-tuning on both XNLI and NTG. It means we can quickly improve the average performance of a cross-lingual pre-trained model on a specific task over multiple languages, based on the merged label data in these languages.

6.3.3 Multi-task Fine-tuning

We investigate the impact of multi-task fine-tuning on XGLUE. To reduce the experimental cost, we perform this experiment on 5 understanding tasks only, including XNLI, PAWS-X, NC, QAM and QADSM. We first do fine-tuning using the merged English training set of these 5 tasks, and then evaluate the fine-tuned model on the test sets of these tasks. Evaluation results are listed in Table 9.

Table 9 shows that PAWS-X, NC and QAM can benefit from the joint fine-tuning but XNLI and QADSM decrease. We leave discovering the relationships between different tasks for better pre-training and fine-tuning for future work.

6.3.4 Impacts of Noising Strategies

We investigate the impacts of different noising strategies (Section 4.1) in Unicoder, and list comparison results in Table 11, where (1)+(2)+(3) denotes the result of using the first three strategies in pre-training, (4) denotes the result of using the last strategy in pre-training, (1)+(2)+(3)+(4) denotes the result of using all strategies in pre-training. We can see that (4) achieves the best average result on NTG. So all results of Unicoder reported in this paper is pre-trained using (4) only. Table 10 gives some input-output examples of Unicoder on NTG.

We also compare Unicoder with XNLG Chi et al. (2019) on the Abstractive Summarization task with the same experimental setting. The zero-shot comparison results are listed in Table 12. We can see that by using xDAE only in pre-training, Unicoder can outperform XNLG significantly, which is pre-trained using 4 tasks including MLM, DAE, XMLM and XAE. This verifies the effectiveness of the 4 noising strategy described in Section 4.1 for generative tasks.

6.4 Updates in the Next Version

We will add 3 updates in the next version: (1) the results of a 24-layer Unicoder on understanding tasks; (2) the results of a 12-layer Unicoder on understanding tasks, which is pre-trained by new tasks beyond MLM and TLM; (3) the comparison results of Unicoder and Unicoder on generation tasks, which are pre-trained based on the small corpus for 100 languages.

7 Conclusion

We present XGLUE as a new benchmark dataset for the cross-lingual community. Solid evaluations are conducted, with interesting results observed and discussed. We expect it can advance the developments of cross-lingual pre-training, understanding and generation approaches and applications.


  • Chen et al. (2019) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Uniter: Learning universal image-text representations. arXiv.
  • Chi et al. (2019) Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2019. Cross-lingual natural language generation via pre-training. In AAAI.
  • Conneau et al. (2019) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv.
  • Conneau and Lample (2019) Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In NeurIPS.
  • Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
  • Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. NeurIPS.
  • Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv.
  • Huang et al. (2019) Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In EMNLP.
  • Kim et al. (2017) Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Eric Fosler-Lussier. 2017. Cross-lingual transfer learning for POS tagging without cross-lingual resources. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 2832–2838, Copenhagen, Denmark. Association for Computational Linguistics.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lewis et al. (2019a) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019a. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  • Lewis et al. (2019b) Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2019b. Mlqa: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475.
  • Li et al. (2020) Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Zhou Zhou. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. AAAI.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. arXiv.
  • Sang and De Meulder (2003) Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
  • Sang (2002) Tjong Kim Sang. 2002. Ef: Introduction to the conll-2002 shared task. In Proceedings of the 6th Conference on Natural Language Learning.
  • Schwenk et al. (2019) Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin. 2019. Ccmatrix: Mining billions of high-quality parallel sentences on the web. arXiv.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
  • Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Glue: A multi-task benchmark and analysis platform for natural language understanding. ICLR.
  • Wenzek et al. (2019) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzman, Armand Joulin, and Edouard Grave. 2019. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359.
  • Yan et al. (2020) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063.
  • Yang et al. (2019a) Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019a. Paws-x: A cross-lingual adversarial dataset for paraphrase identification. arXiv preprint arXiv:1908.11828.
  • Yang et al. (2019b) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019b. Xlnet: Generalized autoregressive pretraining for language understanding. NeurIPS.
  • Zeman et al. (2019) Daniel Zeman, Joakim Nivre, Mitchell Abrams, Noëmi Aepli, Željko Agić, Lars Ahrenberg, Gabrielė Aleksandravičiūtė, Lene Antonsen, Katya Aplonova, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, Victoria Basmov, Colin Batchelor, John Bauer, Sandra Bellato, Kepa Bengoetxea, Yevgeni Berzak, Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agnė Bielinskienė, Rogier Blokland, Victoria Bobicev, Loïc Boizou, Emanuel Borges Völker, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Kristina Brokaitė, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, Tatiana Cavalcanti, Gülşen Cebiroğlu Eryiğit, Flavio Massimiliano Cecchini, Giuseppe G. A. Celano, Slavomír Čéplö, Savas Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, Jayeol Chun, Alessandra T. Cignarella, Silvie Cinková, Aurélie Collomb, Çağrı Çöltekin, Miriam Connor, Marine Courtin, Elizabeth Davidson, Marie-Catherine de Marneffe, Valeria de Paiva, Elvis de Souza, Arantza Diaz de Ilarraza, Carly Dickerson, Bamba Dione, Peter Dirix, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Olga Erina, Tomaž Erjavec, Aline Etienne, Wograine Evelyn, Richárd Farkas, Hector Fernandez Alcalde, Jennifer Foster, Cláudia Freitas, Kazunori Fujita, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Gärdenfors, Sebastian Garza, Kim Gerdes, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta González Saavedra, Bernadeta Griciūtė, Matias Grioni, Normunds Grūzītis, Bruno Guillaume, Céline Guillot-Barbance, Nizar Habash, Jan Hajič, Jan Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Kim Harris, Dag Haug, Johannes Heinecke, Felix Hennig, Barbora Hladká, Jaroslava Hlaváčová, Florinel Hociung, Petter Hohle, Jena Hwang, Takumi Ikeda, Radu Ion, Elena Irimia, Ọlájídé Ishola, Tomáš Jelínek, Anders Johannsen, Fredrik Jørgensen, Markus Juutinen, Hüner Kaşıkara, Andre Kaasen, Nadezhda Kabaeva, Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Boris Katz, Tolga Kayadelen, Jessica Kenney, Václava Kettnerová, Jesse Kirchner, Elena Klementieva, Arne Köhn, Kamil Kopacewicz, Natalia Kotsyba, Jolanta Kovalevskaitė, Simon Krek, Sookyoung Kwak, Veronika Laippala, Lorenzo Lambertino, Lucia Lam, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phuong Lê Hồng, Alessandro Lenci, Saran Lertpradit, Herman Leung, Cheuk Ying Li, Josie Li, Keying Li, KyungTae Lim, Maria Liovina, Yuan Li, Nikola Ljubešić, Olga Loginova, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, Aibek Makazhanov, Michael Mandl, Christopher Manning, Ruli Manurung, Cătălina Mărănduc, David Mareček, Katrin Marheinecke, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Sarah McGuinness, Gustavo Mendonça, Niko Miekka, Margarita Misirpashayeva, Anna Missilä, Cătălin Mititelu, Maria Mitrofan, Yusuke Miyao, Simonetta Montemagni, Amir More, Laura Moreno Romero, Keiko Sophie Mori, Tomohiko Morioka, Shinsuke Mori, Shigeki Moro, Bjartur Mortensen, Bohdan Moskalevskyi, Kadri Muischnek, Robert Munro, Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani, Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Luong Nguyễn Thị, Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Adédayọ Olúòkun, Mai Omura, Petya Osenova, Robert Östling, Lilja Øvrelid, Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka Patejuk, Guilherme Paulino-Passos, Angelika Peljak-Łapińska, Siyao Peng, Cenel-Augusto Perez, Guy Perrier, Daria Petrova, Slav Petrov, Jason Phelan, Jussi Piitulainen, Tommi A Pirinen, Emily Pitler, Barbara Plank, Thierry Poibeau, Larisa Ponomareva, Martin Popel, Lauma Pretkalniņa, Sophie Prévost, Prokopis Prokopidis, Adam Przepiórkowski, Tiina Puolakainen, Sampo Pyysalo, Peng Qi, Andriela Rääbis, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama, Carlos Ramisch, Vinit Ravishankar, Livy Real, Siva Reddy, Georg Rehm, Ivan Riabov, Michael Rießler, Erika Rimkutė, Larissa Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo Romanenko, Rudolf Rosa, Davide Rovati, Valentin Roșca, Olga Rudina, Jack Rueter, Shoval Sadde, Benoît Sagot, Shadi Saleh, Alessio Salomoni, Tanja Samardžić, Stephanie Samson, Manuela Sanguinetti, Dage Särg, Baiba Saulīte, Yanin Sawanakunanon, Nathan Schneider, Sebastian Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu, Muh Shohibussirri, Dmitry Sichinava, Aline Silveira, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Isabela Soares-Bastos, Carolyn Spadine, Antonio Stella, Milan Straka, Jana Strnadová, Alane Suhr, Umut Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima Taji, Yuta Takahashi, Fabio Tamburini, Takaaki Tanaka, Isabelle Tellier, Guillaume Thomas, Liisi Torga, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit, Andrius Utka, Sowmya Vajjala, Daniel van Niekerk, Gertjan van Noord, Viktor Varga, Eric Villemonte de la Clergerie, Veronika Vincze, Lars Wallin, Abigail Walsh, Jing Xian Wang, Jonathan North Washington, Maximilan Wendt, Seyi Williams, Mats Wirén, Christian Wittern, Tsegay Woldemariam, Tak-sum Wong, Alina Wróblewska, Mary Yako, Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir Zeldes, Manying Zhang, and Hanzhi Zhu. 2019. Universal dependencies 2.5. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
  • Zhang et al. (2019) Yuan Zhang, Jason Baldridge, and Luheng He. 2019. Paws: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130.
  • Zhou et al. (2020) Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020.

    Unified vision-language pre-training for image captioning and vqa.