WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

12/13/2021
by   Benjamin Minixhofer, et al.
0

Recently, large pretrained language models (LMs) have gained popularity. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a method – called WECHSEL – to transfer English models to new languages. We exchange the tokenizer of the English model with a tokenizer in the target language and initialize token embeddings such that they are close to semantically similar English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages (French, German, Chinese and Swahili). WECHSEL improves over a previously proposed method for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch in the target language with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/23/2023

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Most Transformer language models are primarily pretrained on English tex...
research
04/17/2022

Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models

English pretrained language models, which make up the backbone of many m...
research
12/10/2020

As good as new. How to successfully recycle English GPT-2 to make models for other languages

Large generative language models have been very successful for English, ...
research
05/23/2022

KOLD: Korean Offensive Language Dataset

Although large attention has been paid to the detection of hate speech, ...
research
02/04/2021

One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages

Unsupervised word representation learning from large corpora is badly ne...
research
05/22/2023

PrOnto: Language Model Evaluations for 859 Languages

Evaluation datasets are critical resources for measuring the quality of ...
research
09/16/2023

Cross-Lingual Knowledge Editing in Large Language Models

Knowledge editing aims to change language models' performance on several...

Please sign up or login with your details

Forgot password? Click here to reset