As good as new. How to successfully recycle English GPT-2 to make models for other languages

12/10/2020
by   Wietse de Vries, et al.
0

Large generative language models have been very successful for English, but other languages lag behind due to data and computational limitations. We propose a method that may overcome these problems by adapting existing pre-trained language models to new languages. Specifically, we describe the adaptation of English GPT-2 to Italian and Dutch by retraining lexical embeddings without tuning the Transformer layers. As a result, we obtain lexical embeddings for Italian and Dutch that are aligned with the original English lexical embeddings and induce a bilingual lexicon from this alignment. Additionally, we show how to scale up complexity by transforming relearned lexical embeddings of GPT-2 small to the GPT-2 medium embedding space. This method minimises the amount of training and prevents losing information during adaptation that was learned by GPT-2. English GPT-2 models with relearned lexical embeddings can generate realistic sentences in Italian and Dutch, but on average these sentences are still identifiable as artificial by humans. Based on perplexity scores and human judgements, we find that generated sentences become more realistic with some additional full model finetuning, especially for Dutch. For Italian, we see that they are evaluated on par with sentences generated by a GPT-2 model fully trained from scratch. Our work can be conceived as a blueprint for training GPT-2s for other languages, and we provide a 'recipe' to do so.

READ FULL TEXT
research
12/13/2021

WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Recently, large pretrained language models (LMs) have gained popularity....
research
09/17/2022

Unsupervised Lexical Substitution with Decontextualised Embeddings

We propose a new unsupervised method for lexical substitution using pre-...
research
04/12/2023

Measuring Normative and Descriptive Biases in Language Models Using Census Data

We investigate in this paper how distributions of occupations with respe...
research
04/29/2021

Let's Play Mono-Poly: BERT Can Reveal Words' Polysemy Level and Partitionability into Senses

Pre-trained language models (LMs) encode rich information about linguist...
research
08/14/2023

Playing with Words: Comparing the Vocabulary and Lexical Richness of ChatGPT and Humans

The introduction of Artificial Intelligence (AI) generative language mod...
research
05/31/2020

A Unified Feature Representation for Lexical Connotations

Ideological attitudes and stance are often expressed through subtle mean...
research
09/13/2021

On Language Models for Creoles

Creole languages such as Nigerian Pidgin English and Haitian Creole are ...

Please sign up or login with your details

Forgot password? Click here to reset