1 Introduction and Motivations
. It turned out to be exceptionally successful when large-scale language models leveraging unlabelled data to perform self-supervised learning were employed. Two paradigmatic examples are the GPT, and BERT . However, although gathering unlabelled data (raw text) is considerably cheaper than producing annotations, obtaining high-quality text is especially challenging in the biomedical and health domains for non-English languages.
Following the paradigm of "the web as a corpus", manually crawling websites belonging to the target domains of interest is a strategy worth exploring. The CommonCrawl111https://commoncrawl.org/ is very a large repository of crawled websites. However, it needs preprocessing to extract the text relevant to the user. The OSCAR corpus  was built applying language identification to CommonCrawl. In the case of the biomedical domain in English, BioBERT , and SciBERT  aggregated and processed different corpora (mostly proceeding from articles) to develop domain-specific BERT models. In the case of biomedical data in Spanish, there is an ongoing effort to develop textual resources for the biomedical domain [4, 6, 9]. However, there are still few corpora available. Furthermore, a comprehensive biomedical corpus should ideally allow the transfer to the even more low-resourced sub-domains, such as the clinical one. Therefore, the CoWeSe represents an unprecedented effort to build the largest biomedical and health-related corpus in Spanish to the best of our knowledge. Unlike other works, we crawl the web instead of scientific literature, providing a large-scale corpus with diverse contents.
2 Data Collection
We crawled the web using 3,338 manually curated links as seeds with a depth of 5. They were selected to include diverse and relevant content, including websites categorized as Sites of Interest for Health by the Carlos III Health Institute (ISCIII).222https://www.isciii.es/QueHacemos/Servicios/Biblioteca/Paginas/default.aspx Although the majority of the content is in Spanish; content in Catalan, Galician and Basque have been also included.
The crawling was performed during the first half of 2020, and we exclusively scraped websites whose robots files allowed it, resulting in a total of 2,766 websites. The raw crawling size is about 905GB of WARC files. When extracting the text, we only considered the paragraph and headers HTML tags. Formats other than HTML were not included.
The selected websites belong to at least one of the following categories:
Informative websites about health issues.
Personal blogs from healthcare professionals.
Public health organizations.
We use the cleaning pipeline333https://github.com/TeMU-BSC/corpus-cleaner-acl introduced in  that is a series of customized components performing data parsing in different formats, sentence splitting, language detection, removal of noisy and ill-formed sentences, content deduplication and eventually output the data with their original document boundaries. Due to the vast amount of data, we deployed the cleaning pipeline across 50 nodes of a High-Performance Computing cluster444https://www.bsc.es/support/MareNostrum4-ug.pdf.
The pipeline is parametrized to allow configuring its action according to the nature of the data to be processed. We then modify the cleaning pipeline’s parameters to adapt the components to peculiarities of the biomedical and health domains. Specifically, we increased the default language identification thresholds since we found that biomedical Spanish is often predicted as Spanish with a lower probability than general domain Spanish. We also decreased the allowed minimum length for each document since the overall crawling is not big enough to be too aggressive in this regard. After preprocessing, we obtained a cleaned corpus of 4.5GB of plain text from the original 905GB of WARC files. Table1 shows some statistics of the cleaned corpus.
|Size (plain text)||4.5GB|
|Tokens (wc word count)||745.70M|
The CoWeSe corpus is openly available under a Creative Commons Attribution 4.0 International license in Zenodo555https://doi.org/10.5281/zenodo.4561970.
In this work, we have introduced the CoWeSe corpus, the largest Spanish biomedical corpus to date. From one side, the data collection process ensures vast size while gathering diverse content. On the other side, the cleaning preprocessing produced high-quality and easy to use textual data. We believe the CoWeSe corpus will have a significant impact for the biomedical NLP community encouraging the development of biomedical and health-related language models and tools in Spanish.
This work was partially funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL, the Future of Computing Center, a Barcelona Supercomputing Center and IBM initiative (2020).
-  Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, and Marta Villegas. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4933–4946, Online, August 2021. Association for Computational Linguistics.
-  Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China, November 2019. Association for Computational Linguistics.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
Aitor Gonzalez-Agirre, Montserrat Marimon, Ander Intxaurrondo, Obdulia Rabal,
Marta Villegas, and Martin Krallinger.
PharmaCoNER: Pharmacological substances, compounds and proteins named entity recognition track.In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, pages 1–10, Hong Kong, China, November 2019. Association for Computational Linguistics.
-  Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 09 2019.
-  A Miranda-Escalada, E Farré, and M Krallinger. Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
-  Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.
-  Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
-  Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, and Jordi Armengol-Estapé. Medical word embeddings for Spanish: Development and evaluation. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 124–133, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics.
-  Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. A monolingual approach to contextualized word embeddings for mid-resource languages. CoRR, abs/2006.06202, 2020.