Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

09/08/2021
by   Casimiro Pio Carrino, et al.
0

This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model suitable for real-world clinical data. We evaluated our models on Named Entity Recognition (NER) tasks for biomedical documents and challenging hospital discharge reports. When compared against the competitive mBERT and BETO models, we outperform them in all NER tasks by a significant margin. Finally, we studied the impact of the model's vocabulary on the NER performances by offering an interesting vocabulary-centric analysis. The results confirm that domain-specific pretraining is fundamental to achieving higher performances in downstream NER tasks, even within a mid-resource scenario. To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine. Our models will be made freely available after publication.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2020

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Pretraining large neural language models, such as BERT, has led to impre...
research
04/07/2020

Inexpensive Domain Adaptation of Pretrained Language Models: A Case Study on Biomedical Named Entity Recognition

Domain adaptation of Pretrained Language Models (PTLMs) is typically ach...
research
04/01/2019

Using Similarity Measures to Select Pretraining Data for NER

Word vectors and Language Models (LMs) pretrained on a large amount of u...
research
07/20/2023

UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition

Pre-trained transformer language models (LMs) have in recent years becom...
research
09/14/2021

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Domain adaptive pretraining, i.e. the continued unsupervised pretraining...
research
08/13/2019

BioFLAIR: Pretrained Pooled Contextualized Embeddings for Biomedical Sequence Labeling Tasks

Biomedical Named Entity Recognition (NER) is a challenging problem in bi...
research
10/13/2022

Incorporating Context into Subword Vocabularies

Most current popular subword tokenizers are trained based on word freque...

Please sign up or login with your details

Forgot password? Click here to reset