Lessons learned from the evaluation of Spanish Language Models

12/16/2022
by   Rodrigo Agerri, et al.
0

Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2022

Large Language Models Meet NL2Code: A Survey

The task of generating code from a natural language description, or NL2C...
research
03/07/2023

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

As language models grow ever larger, the need for large-scale high-quali...
research
02/02/2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...
research
06/09/2021

Probing Multilingual Language Models for Discourse

Pre-trained multilingual language models have become an important buildi...
research
01/17/2022

Evaluation of HTR models without Ground Truth Material

The evaluation of Handwritten Text Recognition (HTR) models during their...
research
09/09/2023

Toward Reproducing Network Research Results Using Large Language Models

Reproducing research results in the networking community is important fo...
research
09/15/2023

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

Self-supervised pre-training of language models usually consists in pred...

Please sign up or login with your details

Forgot password? Click here to reset