Sabiá: Portuguese Large Language Models

by   Ramon Pires, et al.

As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3 evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabiá-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.


page 1

page 2

page 3

page 4


MicroBERT: Effective Training of Low-resource Monolingual BERTs through Parameter Reduction and Multitask Learning

Transformer language models (TLMs) are critical for most NLP tasks, but ...

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Domain adaptive pretraining, i.e. the continued unsupervised pretraining...

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

In this work we provide a systematic empirical comparison of pretrained ...

Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

We present models which complete missing text given transliterations of ...

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

Pretrained multilingual contextual representations have shown great succ...

Naver Labs Europe (SPLADE) @ TREC NeuCLIR 2022

This paper describes our participation in the 2022 TREC NeuCLIR challeng...

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

Pretrained multilingual large language models have typically used heuris...

Please sign up or login with your details

Forgot password? Click here to reset