The success of transformer-based models in the general domain [devlin-etal-2019-bert] has encouraged the development of language models for domain-specific scenarios [chalkidis-etal-2020-legal, tai-etal-2020-exbert, DBLP:journals/corr/abs-1908-10063, DBLP:journals/corr/abs-1906-02124]. Specifically, in the biomedical domain, there has been a proliferation of models [peng-etal-2019-transfer, beltagy-etal-2019-scibert, alsentzer-etal-2019-publicly, pubmedbert] since the first BioBERT [10.1093/bioinformatics/btz682] model was published. Unfortunately, there is still a significant lack of biomedical and clinical models in languages other than English, despite the increasing efforts of the NLP community [Nvol2014ClinicalNL, schneider-etal-2020-biobertpt]. Consequently, general-domain pretrained language models supporting Spanish, such as mBERT [devlin-etal-2019-bert] and BETO [beto], are often used as a proxy in the absence of a genuine domain-specialized model. To fill this gap, we trained several biomedical language models experimenting with a wide range of pretraining choices, namely, masking with word and subword units, varying the vocabulary size and experimenting with the data domain. Furthermore, we applied cross-domain transfer and mixed-domain pretraining using biomedical and clinical data to train bio-clinical models intended for clinical applications. To evaluate our models, we choose the fundamental Named Entity Recognition (NER) task in information retrieval defined for biomedical and clinical scenarios, the former based on biomedical documents and the latter on real hospital documents. As the main result, our models obtained a significant gain over both mBERT and BETO models in all tasks. Additionally, we studied the impact of the model’s vocabulary on several downstream tasks by performing a vocabulary-centric analysis. The evaluation results highlights the importance of domain-specific pretraining over continual pretraining from general domain data in a mid-resource scenario. Our main contributions are:
We release the first Spanish biomedical and clinical transformer-based pretrained language models, trained with the largest biomedical corpus known to date.
We assess the effectiveness of these models in different settings, two biomedical scenarios and a demanding clinical scenario with real hospital reports.
We show that biomedical models exhibit a remarkable cross-domain transfer ability on the clinical domain.
We perform an in-depth vocabulary and segmentation analysis, which offers insights on domain-specific pretraining and raises interesting open questions.
2 Related work
In the last years, several language models for both the biomedical and clinical domain have been trained through unsupervised pretraining of transformer-based architectures [kalyan2021ammu]. The first trained model was BioBERT [10.1093/bioinformatics/btz682], where the authors adapted the BERT model [devlin-etal-2019-bert], trained with general-domain data, to the biomedical domain by continual pretraining. Similarly, other works followed the continual pretraining approach to train the BlueBERT [peng-etal-2019-transfer] and ClinicalBERT [alsentzer-etal-2019-publicly] models. When enough in-domain data is available, training from scratch has been used as an alternative method to continual pretraining, leading to the SciBERT [beltagy-etal-2019-scibert] and PubMedBERT [pubmedbert] models. However, the SciBERT model uses mixed-domain data from the biomedical and computer science domain, while PubMedBERT leverages only data belonging to the biomedical domain. Interestingly, in [pubmedbert]
, the authors call into question the benefits of mixed-domain pretraining, estimating its negative impact on a set of downstream tasks belonging to an extensive biomedical benchmark (named BLURB) that they provide.
Our pretraining approach employed training from scratch with mixed-domain data by combining biomedical and clinical resources. However, in contrast to [beltagy-etal-2019-scibert, pubmedbert] i) we used a corpus 3-5 times smaller, ii) we performed pretraining from scratch with mixed-domain data from biomedical documents and clinical notes and iii) we assess the suitability of cross-domain transfer from the biomedical to the clinical domain.
This work considers two corpora with very different sizes and domains, namely, a clinical corpus and a biomedical one. The clinical corpus contains 91M tokens from more than 278K clinical documents (including discharge reports, clinical course notes and X-ray reports). For the biomedical corpus we gathered data from a variety of sources, namely: a miscellany of medical content, essentially clinical cases111A clinical case report is a type of scientific publication where medical practitioners share patient cases.; scientific literature from Scielo and PubMed; medical patents; a Wikipedia health crawler; the EMEA corpus; the Spanish content from Medline; the background corpus from BARR2 Shared Task [barr2_corpus] and a massive Spanish health domain crawled data. The crawling was conducted during the year 2020 on more than 3,000 Spanish domains associated to the biomedical domain, related to medical societies, scientific societies, journals, research centers, pharmaceutical companies, health educational websites, patient associations, personal web pages from healthcare professionals, hospital websites, medical regulatory colleges, as well as healthcare institutions and organizations. Notice that, although the biomedical documents undoubtedly share a significant percentage of medical terms with clinical notes, the syntax and vocabulary may change radically due to the specific contexts and the idiosyncrasies of the user-generated content in clinical texts.
We cleaned each biomedical corpus to get the final corpus, and left the clinical corpus uncleaned. For each biomedical resource, we applied a cleaning pipeline with customized operations designed to read data in different formats, split it into sentences, detect the language, remove noisy and ill-formed sentences, deduplicate and eventually output the data with their original document boundaries. Finally, to remove repetitive content, we concatenated the entire corpus and deduplicate again, obtaining a total of 968M words. Table 1 shows detailed statistics of each dataset.
|Corpus name||No. tokens|
|Medical crawler 333https://zenodo.org/record/4561971||745,705,946|
|Clinical cases misc.||102,855,267|
4 Models pretraining
The models trained in this work employed a RoBERTa [DBLP:journals/corr/abs-1907-11692] base model with 12 self-attention layers. Following the original training, we prescind the auxiliary Next Sentence Prediction task used in BERT, and used masked language modelling as the pretraining objective. We used Subword Masking (SWM) as in [DBLP:journals/corr/abs-1907-11692], and the Whole Word Masking (WWM) 222https://github.com/google-research/bert technique that masks all sub-words belonging to the same word, studied in [Cui2019PreTrainingWW]..
We tokenized with the Byte-Level BPE algorithm introduced in [radford2019language] and employed in the original RoBERTa [DBLP:journals/corr/abs-1907-11692], unlike previous biomedical language models [beltagy-etal-2019-scibert, pubmedbert, 10.1093/bioinformatics/btz682] that use WordPiece [devlin-etal-2019-bert] or SentencePiece [kudo-2018-subword] segmentations. We learned a cased vocabulary of 52k and 30k tokens to perform a comparative analysis.
We run the training for 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer [Adam] with peak learning rate of 0.0005 and an effective batch size of 2,048 sentences777Through gradient accumulation as implemented in Fairseq [ott-etal-2019-fairseq].
. We left the other hyperparameters in their default values as the original RoBERTa trainings. We then selected the model with the lowest perplexity in a holdout subset as the best model. Moreover, training was performed at the document level, preserving document boundaries. Document-level training may be crucial to enforce long-range dependencies and push the model towards the comprehension of entire documents, fostering the modelling of long-range dependencies.
We applied the pretraining method described above to train a variety of models, which can be divided into two groups, the Biomedical language models and the Bio-clinical language models.
Biomedical language models:
We used the biomedical corpora described in section 3 of about 968M tokens to train biomedical language models. To study the impact of the masking mechanism and the vocabulary size, we experimented with the SWM and WWM techniques and with vocabularies of 52k and 30k tokens. We refer to the four variants as bio-52k-SWM, bio-52k-WWM, bio-30k-SWM and bio-30k-WWM.
Bio-clinical language models:
Due to the lack of a large-scale clinical corpus of comparable size to the biomedical one, we combined the biomedical and clinical corpora described in section 3 to train bio-clinical language models suitable for clinical settings. Furthermore, we also trained a bio-clinical model variant that leverages a vocabulary learned only from the clinical corpus with a vocabulary size of 52k. We refer to these two models as bio-cli-52k and bio-cli-52k-vocab-cli.
5 Downstream NER tasks
We performed Named Entity Recognition (NER) tasks as a testbed for the our models since they are essential building blocks for many biomedical Text Mining and NLP applications. We employed a standard linear layer as a token classification head and the BIO tagging schema [sang2000introduction]
to fine-tune the pretrained models for the NER tasks. During fine-tuning, both the pretrained model and the classification layers parameters are learned with stochastic gradient descent.
Fine-tuning pretrained transformer-based language models for NER tasks by adding a linear layer on top of them is a usual practice in the literature (both in general-purpose models [devlin-etal-2019-bert, DBLP:journals/corr/abs-1907-11692] and domain-specific ones [10.1093/bioinformatics/btz682]
). Remarkably, this method obtained impressive performances compared to more sophisticated classification layers such as Conditional Random Field layers on top of Bidirectional Long Short-Term Memory Recurrent Networks[panchendrarajan-amaresan-2018-bidirectional]. Furthermore, its simplicity allows a head-to-head comparison of different pretraining strategies and baseline models, emphasising the ability of the pretrained representations.
We applied the fine-tuning method described above to three different NER datasets. The first two are data from two shared tasks and use annotations on curated medical data (clinical cases extracted from medical literature). The last one uses medical records from the ICTUSnet project888https://ictusnet-sudoe.eu/es/.
is a track on chemical and drug mention recognition from Spanish medical texts. The authors collected a manually classified collection of clinical case report sections derived from open access Spanish medical publications, named the Spanish Clinical Case Corpus (SPACCC). The corpus contained a total of 1,000 clinical cases and 396,988 words and was manually annotated, with a total of 7,624 entity mentions, corresponding to four different mention types999For a detailed description, see https://temu.bsc.es/pharmaconer/. The track received several system submissions from the NLP community [stoeckel-etal-2019-specialization, Akhtyamova2020TestingCW, 9087359, xiong-etal-2019-deep]
CANTEMIST [miranda2020named] is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish. The CANTEMIST corpus101010CANTEMIST corpus: https://doi.org/10.5281/zenodo.3878178 is a collection of 1,301 oncological case reports written in Spanish, with a total of 63,016 sentences and 1,093,501 tokens. Several systems employing different strategies have been proposed to tackle the task [vicomtech-cantemist, Xiong-cantemist, Vunikili-cantemist]
The ICTUSnet data set consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables. The dataset is part of the ICTUSnet project, whose main objective was the development of an information extraction system to support domain experts when identifying relevant information in discharge reports.
6 Evaluation and Main Results
We evaluated our models on two biomedical benchmarks, PharmaCoNER and CANTEMIST, as well as the clinical ICTUSnet dataset, described in section 5
. For each NER dataset and model, we fine-tune for 10 epochs with a batch size of 32 and a maximum sentence length of 512 tokens. After training, we select as the best model the one with the highest F1 score on the development set. Finally, we computed the evaluation scores by feeding the test set to the best model. Tables2 and 3 show the results for the biomedical models and bio-clinical models, respectively. We would like to remark that the ICTUSnet dataset is a challenging task since its consists of real hospital discharge reports. Moreover, for the biomedical models, the ICTUSnet evaluation represents a cross-domain transfer experiment from the biomedical to clinical domain.
Overall, our models achieved the best scores, beating both mBERT and BETO significantly. The biomedical models showed a remarkable cross-domain transfer ability compared to the bio-clinical models, achieving competitive performances on the ICTUSnet task. Nonetheless, the bio-clinical models obtained the best performances, indicating that the mixed-domain pretraining is practical approach to mitigate the lack of large-scale real-world clinical data. These results confirm the suitability of pretraining from scratch in a scenario with medium-size resources. We finally point out that, although our primary objective aside from achieving the best task performance, the results obtained are promising, evincing that a more sophisticated classification layer could further improve the tasks performances.
|- F1||89.48 0.60||89.620.57||89.470.33||89.850.47||87.460.23||88.180.41|
|- Precision||87.85 1.15||88.180.99||88.330.42||88.410.41||86.500.95||87.120.52|
|- Recall||91.18 0.74||91.110.40||90.640.76||91.350.58||88.460.57||89.280.69|
|- F1||83.87 0.41||83.000.17||82.850.36||83.230.34||82.610.67||82.420.06|
Evaluation scores (F1, Precision and Recall) for the PharmaCoNER, CANTEMIST and ICTUSnet NER tasks. We compared our biomedical models with two general-domain baseline models, namely, multilingual BERT and BETO. Rows in light gray stress that the Biomedical model is performing cross-domain transfer of the ICTUSnet task since it belongs to the clinical domain. The scores are averaged across 5 random runs with standard deviation as error bar.
7 Discussion and Analysis
This section attempts to shed light on the evaluation results by conducting a vocabulary analysis and segmentation experiments.
7.1 Vocabulary overlap
We look at the evaluation results through the lens of the model’s vocabulary. Undoubtedly, the vocabulary plays a crucial role during the downstream transfer since it is responsible for encoding the task-specific data. Intuitively, we expect that the more the overlap between vocabulary and task’s tokens, the better. A high overlap leverages more pretrained representations that could be beneficial for fine-tuning. Specifically, we hypothesize that the performance on each downstream task may be related to the number of vocabulary tokens used to encode the task’s data. Therefore, we first tokenize the three NER tasks with each model’s vocabulary and then calculate the vocabulary overlap with each task, expressed as the number of tokens. The results shown in Table 4 seem to support our hypothesis by showing that the best evaluation scores for each task are obtained by the model with the maximum number of tokens overlap. Accordingly, the lowest overlap exhibited by the baselines models may explain their lowest evaluation scores.
|Bio-cli-52k||20,620 (40%)||22,001 (42%)||23,360 (45%)|
|Bio-cli-vocab-cli-52k||20,335 (39%)||23,095 (44%)||30,467 (59%)|
|Bio-52k||19,978 (38%)||20,951 (40%)||21,449 (41%)|
|Bio-30k||15,792 (53%)||16,302 (54%)||16,266 (54%)|
|BETO||12,829 (41%)||13,044 (42%)||13,388 (43%)|
|mBERT||11,084 (9%)||11,434 (9%)||13,187 (11%)|
7.2 Impact of segmentation
Intuitively, it is reasonable to assume that a proper domain-specific segmentation should preserve the integrity of biomedical terms, minimizing the number of subword units required to encode them. As pointed out in [pubmedbert], models employing out-of-domain vocabulary "are forced to divert parametrization capacity and training bandwidth to model biomedical terms using fragmented subwords". Therefore, from one side, over-segmenting may have a negative influence on the downstream performance. On the other side, under-segmentation that employs term-specific units could dramatically increase the vocabulary size. Moreover, under-segmenting could prevent the model from detecting relatedness between terms based on the shared subwords, especially in morphologically rich terminology. Therefore, we conducted a meticulous analysis of domain-specific terms segmentation to study the trade-off between under and over-segmentation conditions and shed light on its relation with the NER downstream performances.
7.2.1 Splitting terms
We analyzed, both qualitatively and quantitatively, how the different models under study segment biomedical terms. We retrieved all the biomedical terms used as annotations in the CANTEMIST, PharmaCoNER and ICTUSnet tasks and segmented them applying each model’s tokenizer. Table 7 shows the quality of segmentation on a random set of NER annotations, making a comparison between mBERT, BETO and our best model bio-cli-52k-vocab-cli. Table 5 shows the average number of subwords generated by each model computed over all the NER annotations. As expected, mBERT and BETO split terms into many more pieces than our models, on average. However, the variations across biomedical and clinical models also point out that the vocabulary size and mixed-domain pretraining are relevant factors determining the final segmentation. Finally, Table 6 illustrates, for each model, how many tokens are segmented with a given number of subwords (up to more than 4 subwords). Again, both mBERT and BETO tend to over-segment when compared to our models. As an example, our best model (bio-cli-vocab-cli-52k) has roughly 50% of PharmaCoNER annotations segmented in either one or two tokens. In comparison, mBERT and BETO have less than 20% of the annotations segmented in that same amount of subwords.
7.2.2 Dissecting the F1 score
Finally, we seek a more precise relationship between the evaluation scores and the model’s over-segmentation, expressed as the average number of subwords per term. Specifically, for each model and NER task, we group the annotations in the test split by the number of subwords they are shattered into and then recalculate the performance scores for each group of annotations. From the results presented in Figure 1, it can be observed that the F1 score decreases as the number of subwords increases. On average, the decreasing trend is exhibited across the three tasks, with a more considerable variation starting from 7 subwords. The analysis suggests that over-segmentation should be avoided in order to obtain higher F1 scores, confirming the intuition that a helpful segmentation preserves terms integrity. Note that, rather than criticising subword segmentation, we believe that further experiments unravelling the relationship between the segmentation and the downstream task performances could guide the design of an optimal biomedical segmentation.
8 Open Questions
In this section, we open up interesting questions motivated by the evaluation results and analysis presented in previous sections.
Is WWM better than SWM?
The comparison between SWM and WWM pretraining techniques shows, in the case of the biomedical evaluation (see Table 2), that the impact provided by the latter is affected by the vocabulary size. In particular, the WWM technique shows consistent superiority only with a vocab size of 30k. This evidence is in contrast to the finding, pointed out in [pubmedbert], that WWM is in general beneficial. We believe that further ablation studies are necessary to elucidate the interplay between the vocabulary size and the masking mechanism.
Is mixed-domain pretraining beneficial?
The evaluation scores show that the bio-clinical models obtained the best performances across all tasks. Surprisingly, the bio-clinical model with clinical vocabulary obtains the highest performance on the CANTEMIST and ICTUSnet test sets. These results suggest that mixed-domain data might not always degrade performance, questioning the finding presented in [pubmedbert], where they show the negative impact of mixed-domain pretraining with biomedical and computer science text, as applied in SciBERT [beltagy-etal-2019-scibert]. In our case, since we deal with two distinct but close domains, the biomedical and the clinical ones, we somehow expect that mixed-domain pretraining could profit by adding more training data. On the other side, the results of the bio-cli-52k-vocab-cli model also highlight that, under the same training size conditions, the vocabulary plays an important role. In general, we believe further experiments are needed to understand the limitations of mixed-domain pretraining. However, in our evaluation scenario, we hypothesize the reason behind the encouraging performances may be related to how much overlap the specific NER data has with each model’s vocabulary, as partially supported by the results obtained in Table 4.
9 Conclusions and Future Work
In this work, we trained the first biomedical and clinical transformer-based pretrained language models for Spanish. We then evaluated them on a set of NER tasks, including a demanding one based on real hospital discharge reports. Our models overcome two competitive baselines, namely mBERT and BETO, representing superior solutions for biomedical NLP applications in Spanish. Finally, we analyzed the results by performing an in-depth analysis involving the model vocabularies and segmentations. Throughout the work, we outline some underexplored aspects of language model pretraining, such as the feasibility of the mixed-domain approach, the effectiveness of cross-domain transfer for clinical settings and the interplay between the vocabulary size and the token masking mechanism. Moreover, we showed the impact of different terms segmentation on the evaluation score, suggesting that over-segmentation can be detrimental for downstream tasks such as NER. Overall, we experimentally show that domain-specific pretraining has a positive impact than general-domain pretraining in a mid-resource scenario.
As future work, we suggest extending the evaluation to other tasks apart from NER, arguably the most studied task in the biomedical and clinical NLP literature but not the only relevant one. In addition, according to open questions raised in Section 8, we will perform more in-depth experiments to elucidate under which conditions mixed-domain pretraining is advantageous, and we will investigate the relationship between the vocabulary size and the masking mechanism.
This work was partially funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL, and the Future of Computing Center, a Barcelona Supercomputing Center and IBM initiative (2020).