Improving Multilingual Models with Language-Clustered Vocabularies

10/24/2020
by   Hyung Won Chung, et al.
0

State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TyDi QA (+2.9 F1), XNLI (+2.1%), and WikiAnn NER (+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2023

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Multilingual language models have recently gained attention as a promisi...
research
05/04/2023

Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages

Multilingual language models have shown impressive cross-lingual transfe...
research
04/28/2023

Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

This paper provides a detailed discussion of the multilingual tokenizer ...
research
09/15/2021

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Compared to monolingual models, cross-lingual models usually require a m...
research
04/09/2020

Learning to Scale Multilingual Representations for Vision-Language Tasks

Current multilingual vision-language models either require a large numbe...
research
09/14/2022

Parameter-Efficient Finetuning for Robust Continual Multilingual Learning

NLU systems deployed in the real world are expected to be regularly upda...
research
03/11/2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...

Please sign up or login with your details

Forgot password? Click here to reset