Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

05/20/2023
by   Ayyoob ImaniGooghari, et al.
7

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 languages, almost all of them low-resource. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and lowresource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2022

Language Models are Multilingual Chain-of-Thought Reasoners

We evaluate the reasoning abilities of large language models in multilin...
research
09/14/2023

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Despite the progress we have recorded in the last few years in multiling...
research
05/24/2023

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Despite the major advances in NLP, significant disparities in NLP system...
research
10/24/2020

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

Transfer learning based on pretraining language models on a large amount...
research
03/31/2023

Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

As large language models (LLMs) gain popularity among speakers of divers...
research
05/19/2021

Essay-BR: a Brazilian Corpus of Essays

Automatic Essay Scoring (AES) is defined as the computer technology that...
research
01/29/2022

Does Transliteration Help Multilingual Language Modeling?

As there is a scarcity of large representative corpora for most language...

Please sign up or login with your details

Forgot password? Click here to reset