DeepAI AI Chat
Log In Sign Up

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

by   Bo Zheng, et al.

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at


page 3

page 8


XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge

Cross-lingual pre-training has achieved great successes using monolingua...

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

In this paper, we introduce ELECTRA-style tasks to cross-lingual languag...

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Pre-trained multilingual language models underpin a large portion of mod...

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...

Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training

The goal of stance detection is to determine the viewpoint expressed in ...

Improving Multilingual Models with Language-Clustered Vocabularies

State-of-the-art multilingual models depend on vocabularies that cover a...

Machine-Created Universal Language for Cross-lingual Transfer

There are two types of approaches to solving cross-lingual transfer: mul...