Improving Multilingual Models with Language-Clustered Vocabularies

10/24/2020
by   Hyung Won Chung, et al.
0

State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TyDi QA (+2.9 F1), XNLI (+2.1%), and WikiAnn NER (+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/15/2021

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Compared to monolingual models, cross-lingual models usually require a m...
10/25/2019

On the Cross-lingual Transferability of Monolingual Representations

State-of-the-art unsupervised multilingual models (e.g., multilingual BE...
04/09/2020

Learning to Scale Multilingual Representations for Vision-Language Tasks

Current multilingual vision-language models either require a large numbe...
03/19/2021

MuRIL: Multilingual Representations for Indian Languages

India is a multilingual society with 1369 rationalized languages and dia...
03/11/2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...
10/22/2020

Multilingual Synthetic Question and Answer Generation for Cross-Lingual Reading Comprehension

We propose a simple method to generate large amounts of multilingual que...
10/12/2021

Learning Compact Metrics for MT

Recent developments in machine translation and multilingual text generat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.