SERENGETI: Massively Multilingual Language Models for Africa

12/21/2022
by   Ife Adebara, et al.
0

Multilingual language models (MLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. So far, only   28 out of  2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a set of massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to four MLMs that each cover any number of African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks and achieves 82.27 average F-1. We also perform error analysis on our models' performance and show the influence of mutual intelligibility when the models are applied under zero-shot settings. We will publicly release our models for research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2022

Cedille: A large autoregressive French language model

Scaling up the size and training of autoregressive language models has e...
research
12/11/2022

IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages

In this work, we introduce IndicXTREME, a benchmark consisting of nine d...
research
12/22/2016

Continuous multilinguality with language vectors

Most existing models for multilingual natural language processing (NLP) ...
research
06/10/2018

Are All Languages Equally Hard to Language-Model?

For general modeling methods applied to diverse languages, a natural que...
research
04/30/2020

Pretraining on Non-linguistic Structure as a Tool for Analyzing Learning Bias in Language Models

We propose a novel methodology for analyzing the encoding of grammatical...
research
07/01/2021

A Primer on Pretrained Multilingual Language Models

Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, etc. hav...
research
05/23/2023

mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations

Multilingual sequence-to-sequence models perform poorly with increased l...

Please sign up or login with your details

Forgot password? Click here to reset