MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

09/09/2023
by   Sneha Kudugunta, et al.
0

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.

READ FULL TEXT
research
04/08/2022

MMTAfrica: Multilingual Machine Translation for African Languages

In this paper, we focus on the task of multilingual machine translation ...
research
03/03/2023

Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM

The NLP community recently saw the release of a new large open-access mu...
research
07/27/2021

gaBERT – an Irish Language Model

The BERT family of neural language models have become highly popular due...
research
03/11/2021

Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution

We propose a straightforward vocabulary adaptation scheme to extend the ...
research
12/09/2022

BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model

The BigScience Workshop was a value-driven initiative that spanned one a...
research
05/06/2022

Aksharantar: Towards building open transliteration tools for the next billion users

We introduce Aksharantar, the largest publicly available transliteration...
research
03/24/2022

Multilingual CheckList: Generation and Evaluation

The recently proposed CheckList (Riberio et al,. 2020) approach to evalu...

Please sign up or login with your details

Forgot password? Click here to reset