There is a sharp increase in the number of research papers in the biomedical domain since the pandemic arrived. Scientists around the world are conducting experiments and clinical trials to learn more about the effects of this pandemic on global health and the economy. Because of this, Journals around the world are flooded with biomedical literature and it’s getting difficult to find articles that are relevant, robust, and credible. According to different re-ports, over 100,000 papers are already being published for COVID-19 alone. PubMed alone comprises over 30 million citations for biomedical literature. As reports on information about discoveries and insights are added to the already overwhelming amount of literature, the need for advanced computational tools for text mining and information extraction is more important than ever.
Recent progress of deep learning techniques in natural language processing (NLP) has led to significant advancements on a wide range of tasks and applications. The domain of biomedical text mining has likewise seen an improvement. The performance in biomedical named entity recognition which automatically extracts entities such as disease, gene/protein, chemicals, species has substantially improvedHong and Lee (2020); Lee et al. (2020)
. We can use BioNer for building biomedical knowledge graph. Other NLP domains like entity relation, question answering (QA), depend upon this graph. Thus, improved performance of BioNer can lead to better performance of other complex NLP tasks. Named Entities in biomedical literature have several characteristics that make their extraction from text particularly challengingZhou et al. (2004), including the descriptive naming convention (e.g. ‘normal thy-mic epithelial cells’), abbreviations (e.g. ‘IL2’ for ‘Inter-leukin 2’), non-standardized naming convention (e.g. ‘Nace-tylcysteine’, ‘N-acetyl-cysteine’, ‘NAcetylCysteine’, etc.), c-onjunction and disjunction (e.g. ‘91 and 84 kDa proteins’ comprises two entities ‘91 kDa proteins’ and ‘84 kDa proteins’). Traditionally, NER models for biomedical literature perform-ed efficaciously using feature engineering, i.e. carefully selecting features from the text. These features can be linguistic, orthographic, morphological, contextual Campos et al. (2012). Selecting right features that properly represent target entities requires expert knowledge, lots of trial-error experiments, and is often time consuming whose solution leads to highly specialized models that only works for specialized domains.
Models based on convolutional neural networks was proposed to tackle sequence tagging problemsCollobert et al. (2011)
. This kind of neural network architecture and learning algorithms reduced the need for domain-specific feature engineering. However, these types of networks could not connect with previous information that could improve performance for Named Entity Recognition. RNN’s could capture earlier information through back propagation, but they suffer from the vanishing gradients, exploding gradient problems, and don’t handle long-term dependencies well. The gradients carry information for parameter updates. The text data sequences for NER are generally long. For longer sequences, gradients become vanishingly smaller, resulting in no updates of weightsBengio et al. (1994)
. These problems are addressed by a special RNN architecture - Long Short-Term Memory (LSTM), capable of handling long-term dependenciesHochreiter and Schmidhuber (1997).
The neural architecture - BiLSTM-CRFs produces state-of-the-art performance for NER tasks. This architecture comprises two components: BiLSTM that predict the label by capturing information from the text in both directions and CRF that compute transition compatibility between all possible pairs of labels on neighboring tokens. We now consider this neural architecture standard for sequence labeling problems Lample et al. (2016)
. This kind of architecture generally uses vector representation of words (word embeddings) as input to LSTMs. Word2VecMikolov et al. (2013), GloVe Pennington et al. (2014) are some popular context-independent vector representations of words. Many times, character level features of the text are incorporated into word embeddings layer to improve the performance of NER models Kim et al. (2015).
The use of BiLSTM-CRFs along with certain word embeddings led to significant improvement in the performance of NER models. Researchers starting experimenting with this architecture for Biomedical named entity recognition. Some models used character level embedding along with word embedding pre-trained on a large entity independent corpus (Pub-Med abstracts). These models outperformed earlier state-of-the-art models for BioNER Habibi et al. (2017); Luo et al. (2018); Verwimp et al. (2017). All the word embeddings used until now were context independ-ent. They cannot address the polysemous and context dependent nature of words. The introduction of contextualized string embeddings such as flair embeddings Akbik et al. (2018), ELMo Peters et al. (2018) solved this problem. These context-dependent word embeddings when used with BiLSTM-CRFs outperformed all previous models in named entity recognition. Also, transformers based Vaswani et al. (2017) language representation models like BERT Devlin et al. (2018) came that achieved state-of-the-art performance in NER. However, applying these NLP methodologies on biomedical literature has limitations because of the different word distribution of general and biomedical corpora. Since recent language representation models are mostly trained in general domain text, they often face problems on biomedical corpora. Most recent state-of-the-art solutions have shown that using a language representation model pre-trained on biomedical corpora (like PubMed abstracts and PMC full-text articles) gives the best results for Biomedical Named Entity Recognition Hong and Lee (2020); Lee et al. (2020).
This paper represents BioNerFlair, a novel architecture for biomedical named entity recognition. BioNerFlair uses contextualized string embeddings Flair (pre-trained on bio-medical domain) along with GloVe embeddings at the token embeddings layer, then a sequence tagger based on BiLSTM-CRFs is used to extract named entities from biomedical literature. I evaluate the performance of BioNerFlair on 8 benchmarks datasets. BioNerFlair outperforms earlier state-of-the-art models on 5 datasets while shows near similar performance of previous models on other 3 datasets.
2 Materials and methods
The following sections present a description of the corpora used for evaluation. Furthermore, a technical description of the architecture used along with details of evaluation metrics is given.
The statistics of biomedical named entity recognition datasets are listed in Table 1. BioNerFlair performance is evaluated on eight standard corpora of disease, gene/protein, dru-g/chemical, and species for biomedical Ner: The NCBI Doğan et al. (2014) and BC5CDR Li et al. (2016) corpus for disease, BC5CDR Li et al. (2016) and BC4CHEMD Krallinger et al. (2015) corpus for drug/chemical, BC2GM Smith et al. (2008) and JNLPBA Kim et al. (2004) corpus for gene/protein, LINNAEUS Gerner et al. (2010) and Species-800 Pafilis et al. (2013) corpus for species. These datasets are widely used by Biomedical NLP researchers for testing Bio-Ner models. All the datasets are tagged with the IOB tagging scheme. For proper evaluation with other state-of-the-art techniques, the same data split for training, validation, and testing from earlier works Lee et al. (2020); Wang et al. (2019) is adopted.
|Datasets||Entity type||Number of annotations|
2.2 Model architecture
BioNerFlair comprises of three layers, namely token embedding layer giving contextualized vector representation of input sequence, passed into vanilla BiLSTM-CRF sequence labeler as depicted in Figure 2, giving state-of-the-art results on BioNer tasks.
2.2.1 Token embedding layer
The token embeddings layer takes as input a sequence of tokens , and outputs a fixed-dimensional vector representation of each token . The output here is the concatenation (Equation 1) of pre-computed GloVe embeddings Pennington et al. (2014) and contextualized flair embeddings Akbik et al. (2018) pre-trained on on roughly 3 million full texts and about 25 million abstracts from the PubMed. Analysis by Akbik et al. (2018), shows that combining flair embeddings with classic world embeddings improves the performance of NER models. In BioNerFlair, GloVe embedding is combined with flair embedding.
Flair embedding is a contextualized character level word embedding that combines the best attributes of different kinds of embeddings. As shown in recent studies Lee et al. (2020); Dang et al. (2018), that pre-training models on biomedical corpora significantly improves the performance of BioNer models, this study uses a flair embedding model pre-trained on biomedical data and it seems to capture latent syntactic and semantic similarities. Flair embeddings produce vector representation from hidden states that computes not only on the characters of the word but also the characters of the surrounding context like illustrated in Figure 1. Since flair embedding is pre-trained on biomedical corpora and extracts context based on linguistic features at the character level, it handles rare, misspelled, different naming conventions of the words, frequently occurring in biomedical literature very well.
2.2.2 Bidirectional Long Short-Term Memory (BiLSTM)
A Long Short Term Memory network (LSTM), is a special kind of RNN introduced by Hochreiter and Schmidhuber (1997), explicitly designed to avoid long-term dependency problem. LSTMs does not suffer from vanishing and exploding gradient problems. Unlike RNN, LSTMs can therefore remember information for long periods of time. LSTMs are equipped with memory cells along with an adaptive gating mechanism that regulates the information added or removed from the memory cells. There are three layers in a typical LSTM. A sigmoid layer that decides what information to remove (forget gate), a concatenation of sigmoid and tanh layer that decides what new information to add (input gate), another sigmoid layer that decides the output (output gate). LSTM memory cell is implemented using equations as follows:
In the above Equations,
denotes logistic sigmoid function, and i, f, O, and C are the input gate, forget gate, output gate and cell vectors. In BioNerFlair, the final word embeddings are passed into a BiLSTM network as is seems to capture past features and future features efficiently for a specific time frameLample et al. (2016); Graves et al. (2013); Huang et al. (2015). The bidirectional LSTM network is trained using back-propagation through time Boden (2002).
2.2.3 Conditional Random Fields
Conditional Random Fields (CRFs) Lafferty et al. (2001) is a probabilistic discriminative sequence modeling framework that brings in all the advantages of MEMMs models Ratnaparkhi (1996); McCallum et al. (2000) while also solving the label bias problem.
Given a training dataset of data sequences to be labeled and their corresponding label sequences
, CRFs maximize the log-likelihood of conditional probability of label sequences given their data sequences, that is:
2.3 Evaluation metrics
The performance of BioNerFlair is evaluated by training models for each dataset. I used pre-processes versions of BioNer datasets provided by Lee et al. (2020)
. Also, the same data split is used for training and testing the models. Models are evaluated using precision (P), recall (R), and F1 score metrics on the test corpora. A predicted entity is considered correct if and only if both the entity type and boundary exactly match with annotations in test data. Precision and recall are computed using true positives (TP), false positives (FP), and false negatives (FN). All calculations are done using flair NLP library.
3 Results and discussion
3.1 Experimental setups
All the models are trained using Flair NLP library, a simple frame-work for state-of-the-art NLP tasks built directly upon PyTorch. I used GPU (12 GB) provided for free by Google Colab to train models. The maximum sequence length was set to 512 to get the best training speed without running out of GPU memory while the mini-batch size for all experiments was set to 32.
Model training is started using an initial learning rate of 0.1, patience of 3, and annealing factor of 0.5. A high learning rate of 0.1 works well at starting when using Stochastic Gradient Descent optimizer and is gradually reduced as the model converges. Flair embeddings dropout is set to 0.5. These hyper-parameters are same for all the models. Because of the smaller size of training data and fast GPU, training time of most of the models was less than an hour. However, for the BC4CHEMD dataset, the model could not fit into GPU memory because of which training time increased to around 5 hours.
Flair NLP library also comes with Hunflair Weber et al. (2020), a NER tagger for biomedical text. HunFlair comes with models for genes/proteins, chemicals, diseases, species and cell lines. HunFlair models are trained with multiple datasets at same time due to which it outperforms tools like SciSpacy Neumann et al. (2019) for unseen text but does not give state-of-the-art results on gold standard datasets. In BioNerFlair, I trained models from scratch for each dataset giving results mentioned above. For experiments, I tried to fine tune HunFlair models on target corpus but the model doesn’t fit within 12GB of GPU memory.
3.2 Experimental results
Results of the BioNerFlair method for different datasets are shown in Table 2. The performance of BioNerFlair is compared with other recent state-of-the-art methods. BioNerFlair outperformed state-of-the-art methods on five out of eight datasets while shows near best performance on the remaining three datasets. We can see the biggest improvement in the gene/protein category. BioNerFlair achieves the best F1 score of 90.17 beyond 84.72 on BC2GM corpus and an F1 score of 88.73 beyond 78.58 on JNLPBA corpus. For the species category, BioNerFlair achieves the best F1 score of 85.48 beyond 74.98 on Species-800 corpus, while gets second best score on LINNAEUS corpus. We can notice the same thing for disease and drug/chemical category where BioNerFlair achieves state-of-the-art results of one dataset while getting near best score for other datasets. Even though BioNerFlair does not get best results on BC5CDR corpus for disease and chemical, the results are still competitive when compared with other recent methods and significant improvements can be seen on other datasets.
3.3 Use of different word embeddings
In BioNerFlair, I use GloVe embedding and flair embedding at the token embedding layer. Flair NLP library provides the option of Stacked embedding, which allows us to combine different embeddings together. Flair supports classic word embeddings, character embedding, contextualized word embeddings, pre-trained transformer embedding. Therefore, we can experiment with different pairs of embeddings for sequence labeling tasks. The initial plan for this experiment was to use the concatenation of XLNet Yang et al. (2019), GloVe embedding, and pooled variant of flair embedding Akbik et al. (2019). However, this combination of embeddings requires lots of GPU memory because of which I used the combination of embeddings mentioned above. If more resources are available, we can possibly further improve the performance of BioNer models.
In conclusion, this article presents BioNerFlair, a metho-d to train models for biomedical named entity recognition using Flair plus GloVe embeddings and a sequence tagger. This paper shows that using contextualized word embedding pre-trained on biomedical corpora significantly improves the results of BioNer models. I evaluated the performance of BioNerFlair on eight datasets. BioNerFlair achieves state-of-the-art results on five datasets. For future study, I plan to experiment with different contextualized and transformer-based word embeddings to further improve the performance of Biomedical Named Entity recognition models.
I would like to thank the Department of Computer Science and Engineering, Medi-Caps University for the support. I also thank the anonymous reviewers for their comments and suggestions.
This research did not receive any specific grant from fun-ding agencies in the public, commercial, or not-for-profit sectors.
Availability and implementation
Source code and data is available at https://github.com/harshpatel1014/-BioNerFlair
Conflict of interest statement
Declarations of interest: none
- Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 724–728. Cited by: §3.3.
- Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Cited by: §1, §2.2.1.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §1.
- . the Dallas project. Cited by: §2.2.2.
Biomedical named entity recognition: a survey of machine-learning tools. Theory and Applications for Advanced Text Mining, pp. 175–195. Cited by: §1.
- Natural language processing (almost) from scratch. Journal of machine learning research 12 (ARTICLE), pp. 2493–2537. Cited by: §1.
- D3NER: biomedical named entity recognition using crf-bilstm improved with fine-tuned embeddings of various linguistic information. Bioinformatics 34 (20), pp. 3539–3546. Cited by: §2.2.1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
- NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics 47, pp. 1–10. Cited by: §2.1.
- LINNAEUS: a species name identification system for biomedical literature. BMC bioinformatics 11 (1), pp. 85. Cited by: §2.1.
- Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34 (23), pp. 4087–4094. Cited by: Table 2.
- Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §2.2.2.
- Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33 (14), pp. i37–i48. Cited by: §1, Table 1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §2.2.2.
- DTranNER: biomedical named entity recognition with deep learning-based label-label transition model. BMC bioinformatics 21 (1), pp. 53. Cited by: §1, §1, Table 2.
- Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §2.2.2.
- Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp. 70–75. Cited by: §2.1.
- Character-aware neural language models. arXiv preprint arXiv:1508.06615. Cited by: §1.
- The chemdner corpus of chemicals and drugs and its annotation principles. Journal of cheminformatics 7 (1), pp. 1–17. Cited by: §2.1.
- Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: §2.2.3.
- Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Cited by: §1, §2.2.2.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §1, §1, §2.1, §2.2.1, §2.3, Table 1, Table 2.
- BioCreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016. Cited by: §2.1.
- A transition-based joint model for disease named entity recognition and normalization. Bioinformatics 33 (15), pp. 2363–2371. Cited by: Table 2.
- An attention-based bilstm-crf approach to document-level chemical named entity recognition. Bioinformatics 34 (8), pp. 1381–1388. Cited by: §1, Table 2.
Maximum entropy markov models for information extraction and segmentation.. In Icml, Vol. 17, pp. 591–598. Cited by: §2.2.3.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
- Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669. Cited by: §3.1.
- The species and organisms resources for fast and accurate identification of taxonomic names in text. PloS one 8 (6), pp. e65390. Cited by: §2.1.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §2.2.1.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1.
- A maximum entropy model for part-of-speech tagging. In Conference on empirical methods in natural language processing, Cited by: §2.2.3.
- Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition. In Machine Learning for Healthcare Conference, pp. 383–402. Cited by: Table 2.
- Overview of biocreative ii gene mention recognition. Genome biology 9 (S2), pp. S2. Cited by: §2.1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
- Character-word lstm language models. arXiv preprint arXiv:1704.02813. Cited by: §1.
- Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35 (10), pp. 1745–1752. Cited by: §2.1.
- HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition. arXiv preprint arXiv:2008.07347. Cited by: §3.1.
- Document-level attention-based bilstm-crf incorporating disease dictionary for disease named entity recognition. Computers in biology and medicine 108, pp. 122–132. Cited by: Table 2.
- Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §3.3.
- Collabonet: collaboration of deep neural networks for biomedical named entity recognition. BMC bioinformatics 20 (10), pp. 249. Cited by: Table 2.
- Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20 (7), pp. 1178–1190. Cited by: §1.
- Clinical concept extraction with contextual word embedding. arXiv preprint arXiv:1810.10566. Cited by: Table 1.