Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

09/16/2021
by   Casimiro Pio Carrino, et al.
0

We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license, both in Zenodo (<https://zenodo.org/record/4561971#.YTI5SnVKiEA>).

READ FULL TEXT

page 1

page 2

page 3

research
10/12/2020

BioMegatron: Larger Biomedical Domain Language Model

There has been an influx of biomedical domain-specific language models, ...
research
07/15/2021

Spanish Language Models

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, a...
research
04/18/2021

Documenting the English Colossal Clean Crawled Corpus

As language models are trained on ever more text, researchers are turnin...
research
09/29/2021

EDGAR-CORPUS: Billions of Tokens Make The World Go Round

We release EDGAR-CORPUS, a novel corpus comprising annual reports from a...
research
08/08/2023

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

The legality of training language models (LMs) on copyrighted or otherwi...
research
02/27/2023

The ROOTS Search Tool: Data Transparency for LLMs

ROOTS is a 1.6TB multilingual text corpus developed for the training of ...
research
04/30/2022

Detoxifying Language Models with a Toxic Corpus

Existing studies have investigated the tendency of autoregressive langua...

Please sign up or login with your details

Forgot password? Click here to reset