Spanish Language Models

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2021

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish ...
research
04/25/2023

Unstructured and structured data: Can we have the best of both worlds with large language models?

This paper presents an opinion on the potential of using large language ...
research
12/26/2022

TextBox 2.0: A Text Generation Library with Pre-trained Language Models

To facilitate research on text generation, this paper presents a compreh...
research
04/06/2022

Language Model for Text Analytic in Cybersecurity

NLP is a form of artificial intelligence and machine learning concerned ...
research
10/31/2022

When Language Model Meets Private Library

With the rapid development of pre-training techniques, a number of langu...
research
04/18/2021

Documenting the English Colossal Clean Crawled Corpus

As language models are trained on ever more text, researchers are turnin...
research
11/09/2020

An Analysis of Dataset Overlap on Winograd-Style Tasks

The Winograd Schema Challenge (WSC) and variants inspired by it have bec...

Please sign up or login with your details

Forgot password? Click here to reset