Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

by   Haoyang Huang, et al.

We present Unicoder, a universal language encoder that is insensitive to different languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training data in one language and directly applied to inputs of the same task in other languages. Comparing to similar efforts such as Multilingual BERT and XLM, three new cross-lingual pre-training tasks are proposed, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model. These tasks help Unicoder learn the mappings among different languages from more perspectives. We also find that doing fine-tuning on multiple languages together can bring further improvement. Experiments are performed on two tasks: cross-lingual natural language inference (XNLI) and cross-lingual question answering (XQA), where XLM is our baseline. On XNLI, 1.8 languages) is obtained. On XQA, which is a new cross-lingual dataset built by us, 5.5


page 1

page 2

page 3

page 4


XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

In this paper, we introduce ELECTRA-style tasks to cross-lingual languag...

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

In this paper, we introduce XGLUE, a new benchmark dataset to train larg...

Augmenting Passage Representations with Query Generation for Enhanced Cross-Lingual Dense Retrieval

Effective cross-lingual dense retrieval methods that rely on multilingua...

Machine-Created Universal Language for Cross-lingual Transfer

There are two types of approaches to solving cross-lingual transfer: mul...

Cross-lingual Dependency Parsing as Domain Adaptation

In natural language processing (NLP), cross-lingual transfer learning is...

Syntax-augmented Multilingual BERT for Cross-lingual Transfer

In recent years, we have seen a colossal effort in pre-training multilin...

Cross-lingual Extended Named Entity Classification of Wikipedia Articles

The FPT.AI team participated in the SHINRA2020-ML subtask of the NTCIR-1...

Please sign up or login with your details

Forgot password? Click here to reset