Sabiá: Portuguese Large Language Models

04/16/2023
by   Ramon Pires, et al.
5

As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3 evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabiá-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2022

MicroBERT: Effective Training of Low-resource Monolingual BERTs through Parameter Reduction and Multitask Learning

Transformer language models (TLMs) are critical for most NLP tasks, but ...
research
09/14/2021

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Domain adaptive pretraining, i.e. the continued unsupervised pretraining...
research
12/31/2020

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

In this work we provide a systematic empirical comparison of pretrained ...
research
09/09/2021

Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

We present models which complete missing text given transliterations of ...
research
09/29/2020

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

Pretrained multilingual contextual representations have shown great succ...
research
03/10/2023

Naver Labs Europe (SPLADE) @ TREC NeuCLIR 2022

This paper describes our participation in the 2022 TREC NeuCLIR challeng...
research
04/18/2023

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

Pretrained multilingual large language models have typically used heuris...

Please sign up or login with your details

Forgot password? Click here to reset