BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

07/14/2022
by   Javier de la Rosa, et al.
0

The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name $\textit{perplexity sampling}$ that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget. Our models are available at this $\href{https://huggingface.co/bertin-project}{URL}$.

READ FULL TEXT
research
09/10/2021

Pre-train or Annotate? Domain Adaptation with a Constrained Budget

Recent work has demonstrated that pre-training in-domain language models...
research
10/30/2022

Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts

Explicit decomposition modeling, which involves breaking down complex ta...
research
06/05/2023

On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

This evidence-based position paper critiques current research practices ...
research
04/23/2021

Transfer training from smaller language model

Large language models have led to state-of-the-art accuracies across a r...
research
11/09/2022

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive compu...
research
08/10/2022

Alternating Cross-attention Vision-Language Model for Efficient Learning with Medical Image and Report without Curation

Recent advances in vision-language pre-training have demonstrated astoun...
research
01/11/2021

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inp...

Please sign up or login with your details

Forgot password? Click here to reset