Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

09/09/2023
by   C. M. Downey, et al.
0

Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed Focus method. We demonstrate that: 1) Embedding-replacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing cross-lingual vocabularies with smaller specialized ones provides an efficient method to improve performance in low-resource languages. 3) Simple embedding re-initialization techniques based on script-wise sub-distributions rival techniques such as Focus, which rely on similarity scores obtained from an auxiliary model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2022

Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Multilingual pre-trained language models (PLMs) have demonstrated impres...
research
09/15/2021

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Compared to monolingual models, cross-lingual models usually require a m...
research
05/23/2023

FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language

Using model weights pretrained on a high-resource language as a warm sta...
research
05/19/2022

Phylogeny-Inspired Adaptation of Multilingual Models to New Languages

Large pretrained multilingual models, trained on dozens of languages, ha...
research
04/18/2023

Romanization-based Large-scale Adaptation of Multilingual Language Models

Large multilingual pretrained language models (mPLMs) have become the de...
research
03/03/2022

Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Pre-trained multilingual language models such as mBERT and XLM-R have de...
research
03/09/2022

Language Adaptive Cross-lingual Speech Representation Learning with Sparse Sharing Sub-networks

Unsupervised cross-lingual speech representation learning (XLSR) has rec...

Please sign up or login with your details

Forgot password? Click here to reset