DeepAI AI Chat
Log In Sign Up

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

09/15/2021
by   Bo Zheng, et al.
0

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.

READ FULL TEXT

page 3

page 8

09/26/2021

XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge

Cross-lingual pre-training has achieved great successes using monolingua...
06/30/2021

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

In this paper, we introduce ELECTRA-style tasks to cross-lingual languag...
09/09/2023

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Pre-trained multilingual language models underpin a large portion of mod...
03/11/2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...
09/13/2021

Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training

The goal of stance detection is to determine the viewpoint expressed in ...
10/24/2020

Improving Multilingual Models with Language-Clustered Vocabularies

State-of-the-art multilingual models depend on vocabularies that cover a...
05/22/2023

Machine-Created Universal Language for Cross-lingual Transfer

There are two types of approaches to solving cross-lingual transfer: mul...