Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

09/15/2021
by   Bo Zheng, et al.
0

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.

READ FULL TEXT

page 3

page 8

research
09/26/2021

XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge

Cross-lingual pre-training has achieved great successes using monolingua...
research
06/30/2021

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

In this paper, we introduce ELECTRA-style tasks to cross-lingual languag...
research
09/09/2023

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Pre-trained multilingual language models underpin a large portion of mod...
research
03/11/2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...
research
09/13/2021

Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training

The goal of stance detection is to determine the viewpoint expressed in ...
research
10/24/2020

Improving Multilingual Models with Language-Clustered Vocabularies

State-of-the-art multilingual models depend on vocabularies that cover a...
research
05/22/2023

Machine-Created Universal Language for Cross-lingual Transfer

There are two types of approaches to solving cross-lingual transfer: mul...

Please sign up or login with your details

Forgot password? Click here to reset