XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

04/03/2020
by   Yaobo Liang, et al.
0

In this paper, we introduce XGLUE, a new benchmark dataset to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE (Wang et al.,2019), which is labeled in English and includes natural language understanding tasks only, XGLUE has three main advantages: (1) it provides two corpora with different sizes for cross-lingual pre-training; (2) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (3) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder (Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/03/2019

Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

We present Unicoder, a universal language encoder that is insensitive to...
research
05/22/2023

DUMB: A Benchmark for Smart Evaluation of Dutch Models

We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a d...
research
04/16/2021

ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation

Now, the pre-training technique is ubiquitous in natural language proces...
research
06/16/2022

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Recent advances in machine learning have significantly improved the unde...
research
07/31/2020

On Learning Universal Representations Across Languages

Recent studies have demonstrated the overwhelming advantage of cross-lin...
research
04/15/2021

XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

Machine learning has brought striking advances in multilingual natural l...
research
04/30/2020

A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Deep pre-trained contextualized encoders like BERT (Delvin et al., 2019)...

Please sign up or login with your details

Forgot password? Click here to reset