CollaboNet for Biomedical Named Entity Recognition
Background: Finding biomedical named entities is one of the most essential tasks in biomedical text mining. Recently, deep learning-based approaches have been applied to biomedical named entity recognition (BioNER) and showed promising results. However, as deep learning approaches need an abundant amount of training data, a lack of data can hinder performance. BioNER datasets are scarce resources and each dataset covers only a small subset of entity types. Furthermore, many bio entities are polysemous, which is one of the major obstacles in named entity recognition. Results: To address the lack of data and the entity type misclassification problem, we propose CollaboNet which utilizes a combination of multiple NER models. In CollaboNet, models trained on a different dataset are connected to each other so that a target model obtains information from other collaborator models to reduce false positives. Every model is an expert on their target entity type and takes turns serving as a target and a collaborator model during training time. The experimental results show that CollaboNet can be used to greatly reduce the number of false positives and misclassified entities including polysemous words. CollaboNet achieved state-of-the-art performance in terms of precision, recall and F1 score. Conclusions: We demonstrated the benefits of combining multiple models for BioNER. Our model has successfully reduced the number of misclassified entities and improved the performance by leveraging multiple datasets annotated for different entity types. Given the state-of-the-art performance of our model, we believe that CollaboNet can improve the accuracy of downstream biomedical text mining applications such as bio-entity relation extraction.READ FULL TEXT VIEW PDF
CollaboNet for Biomedical Named Entity Recognition
The amount of biomedical text continues to increase rapidly. There were 4.7 million full-text online accessible articles in PubMed Central  in 2017. One of the obstacles in utilizing biomedical text data is that it is too large for a human to read or even search for needed information. This has led to the demand for automated extraction of valuable information. Text mining can be used to turn the time-consuming task into a fully automated job [2, 3, 4, 5, 6, 7].
Named Entity Recognition (NER) is the computerized procedure of recognizing and labeling entities in given texts. In the biomedical domain, typical entity types include disease, chemical, gene and protein.
Biomedical named entity recognition (BioNER) is an essential building block of many downstream text mining applications such as extracting drug-drug interactions  and disease-treatment relations . BioNER is also used when building a sophisticated biomedical entity search tool  that enables users to pose complex queries to search for bio-entities.
NER in biomedical text mining is focused mainly on dictionary-, rule-, and machined learning-based approaches[11, 12, 13, 14, 15, 16]. Dictionary based systems have a simple and intuitive structure but they cannot handle unseen entities or polysemous words, resulting in low recall [11, 12]. Moreover, building and maintaining a comprehensive and up-to-date dictionary involves a considerable amount of manual work. The rule based approach is more scalable, but it needs hand crafted feature sets to fit a model to a dataset [13, 14]. These rule and dictionary-based approaches can achieve high precision  but can produce incorrect predictions when a new word, which is not in the training data, appears in a sentence (out-of-vocabulary problem). This out-of-vocabulary problem occurs frequently especially in the biomedical domain, as it is common for a new biomedical term, such as a new drug name, to be registered in this domain.
Recently, studies have demonstrated the effectiveness of deep learning based methods. Sahu and Anand 
demonstrated the efficiency of Recurrent Neural Network (RNN) for NER in biomedical text. The model by Sahu and Anand is composed of a bidirectional Long Short-Term Memory Network (BiLSTM) and Conditional Random Field (CRF). Sahu and Anand also used character level word embeddings but could not demonstrate their benefits. Habibi et al.  combined the BiLSTM-CRF model implementation of Lample et al.  and the word embeddings of Pyysalo et al. . Habibi et al.  utilized character level word embeddings to capture characteristics, such as orthographic features, of bio-medical entities and achieved state-of-the-art performance, demonstrating the effectiveness of character level word embeddings in BioNER.
Although these models showed some promising results, NER is still a very challenging task in the biomedical domain for the following reasons. First, a limited amount of training data is available for BioNER tasks. Gold-standard datasets contain annotations of one or two entity types. For example, the NCBI corpus  includes annotations of diseases but not of other types of entities such as genes and proteins. On the other hand, the JNLPBA corpus  contains annotations of only genes and proteins. Therefore, the data for each entity type comprises only a small portion of the total amount of annotated data.
Multi-task learning (MTL) is a method for training a single model for multiple tasks at the same time. MTL can leverage different datasets that are collected for different but related tasks . Although extracting genes is different from extracting chemicals, both tasks require learning some common features that can help understand the linguistic expressions of biomedical texts. Crichton et al.  developed an MTL model that was trained on various source datasets containing annotations of different subsets of entity types. An MTL model by Wang et al.  achieved performance comparable to that of the state-of-the-art single task NER models. Inspired by the previous studies, we propose CollaboNet which uses the collaboration of multiple models. Unlike the conventional MTL methods which use only a single static model, CollaboNet is composed of multiple models trained on different datasets for different tasks. Each model in CollaboNet is trained on dataset annotated on a specific type of entity and becomes an expert on their own entity type.
Despite the high recall obtained by the MTL based models, the precision of these models is relatively low. Since MTL based models are trained on multiple types of entities and larger training data, they have a broader coverage of various biomedical entities, which naturally results in high recall. On the other hand, as the MTL models are trained on combinations of different entity types, they tend to have difficulty in differentiating among entity types, resulting in lower precision.
Another reason NER is difficult in the biomedical domain is that an entity could be labeled as different entity types depending on its textual context. In our experiments, we observed that many incorrect predictions were a result of the polysemy problem, in which a word, for example, can be used as both a gene and disease name. Models designed to predict disease entities misidentify some genes as diseases. This misidentification of entity types increases the false positive rate. For instance, BiLSTM-CRF based models for disease entities mistakenly label the gene name “BRCA1” as a disease entity because there are disease names such as “BRCA1 abnormalities” or “Brca1-deficient” in the training set. Besides, the training set that annotates “VHL” (Von Hippel-Lindau disease) as a disease entity confuses the models because VHL is also used as a gene name, since the mutation of this gene causes VHL disease.
To solve the false positive problems due to polysemous words, CollaboNet aggregates the results of collaborator models, and uses them as an additional input to the target model. Consider the case of predicting the disease entity VHL utilizing the outputs of gene and chemical models. Once a gene model predicts VHL as a gene, the gene model informs a disease model that VHL is a gene entity so that the disease model will not predict VHL as a disease. In CollaboNet, each model is individually trained on an entity type and then further trained on the outputs of other models that are trained on the other entity types. The models in CollaboNet take turns in being the target and collaborator models during training. Consequently, each model is an expert in its own domain and helps improve the accuracy by leveraging the multi-domain information from the other models.
In the following section, we first discuss a BiLSTM-CRF model for biomedical named entity recognition. The overall structure of the BiLSTM-CRF model is illustrated in Figure 1. Next, we introduce the structure of CollaboNet, which is comprised of a set of BiLSTM-CRF models as shown in Figure 2.
Named entity recognition involves annotating words in a sentence as named entities. More formally, given an input sequence , we predict corresponding labels . We use the BIOES scheme  for representing , where B stands for Beginning, I for Inside, O for Out, E for End, and S for Single.
Word embedding is an effective way of representing words. As word embeddings capture semantic and syntactic meanings of words, they have been widely used in various natural language processing tasks including named entity recognition. The experiment of Habibi et al. showed that word embeddings trained on biomedical corpora notably improved the performance of BioNER models. Pyysalo et al.  were the first to suggest training word embeddings on biomedical corpora from PubMed, PubMed Central (PMC), and Wikipedia. The results of Pyysalo et al.  and Habibi et al.  suggest that using word embeddings trained on biomedical corpora is essential for BioNER. We also use the trained word embeddings provided by Pyysalo et al. . For each word in a sequence , we denote a word represented by a word embedding as where is a dimension of the word embedding.
To give our model character level morphological information (e.g., ‘-ase
’ is common in protein entities), we also leverage the character level information of each word. We build character level word embeddings (CLWEs) using a convolution neural network (CNN), similar to the work of Santos and Zadrozny. Given a word , composed of number of characters, we represent where
is a randomly initialized character embedding for each unique character. Note that unlike the word embeddings trained on separate biomedical corpora, character embeddings are learned from only the BioNER task. For the CNN, padding of the proper size () according to window size
should be attached before and after each word. We obtain a window vectorby simply concatenating the character embeddings of with the character embeddings of characters on both sides:
From the window vector , we perform a convolution operation as follows:
where and denote a trainable filter and bias, respectively. We obtain the element-wise maximum values, and the output is a character level word embedding denoted as . We concatenate the character level word embedding with the word embedding trained on biomedical corpora as to utilize both representations in our model.
A Recurrent Neural Network (RNN) is a neural network that effectively handles variable-length inputs. RNNs have proven to be useful in various natural language processing tasks including language modeling, speech recognition and machine translation [28, 29, 30]. Long Short-Term Memory (LSTM)  is one of the most frequently used variants of recurrent neural networks. Our model uses the LSTM architecture from Graves et al. . Given the outputs of an embedding layer , the hidden states of LSTM are calculated as follows:
denote a logistic sigmoid function and a hyperbolic tangent function, respectively, andis an element-wise product. We use a forward LSTM that extracts the representations of inputs in the forward direction, and we use a backward LSTM that represents the inputs in the backward direction.
We concatenate the two states coming from the forward LSTM and the backward LSTM to form the hidden states of the bi-directional LSTM (BiLSTM). BiLSTM, proposed by Schuster and Paliwal , was extensively used in various sequence encoding tasks. We obtain a set of hidden states where and are hidden states of forward and backward LSTMs, respectively, at a time step .
While BiLSTM handles long term dependency problems as well as backward dependency issues, modeling dependencies among adjacent output tags helps improve the performance of the sequence labeling models . We applied a Conditional Random Field (CRF) to the output layer of the BiLSTM to capture these dependencies.
First, we compute the probability of each label given the sequenceas follows:
where and are parameters of the fully connected layer for BIOES tags, and the function computes the probability of each tag. Based on the probability and the CRF layer, our training objective to minimize is defined as follows:
where is the cross entropy loss for the label , and is the negative sentence-level log likelihood. The score of a tag is the summation of the transition score and the emission score from our LSTM at time step .
At test time, we use Viterbi decoding to find the most probable sequence given the outputs of the BiLSTM-CRF model.
CollaboNet, our novel NER model, is composed of multiple BiLSTM-CRF models, and following the terminology of , we call each BiLSTM-CRF model a single-task model (STM). In CollaboNet, each STM is trained on a specific dataset as illustrated in Figure 2
, and each STM is regarded as an expert on a particular entity type. These experts help each other since the knowledge of each expert is transferred to all the other experts. Training CollaboNet consists of phases and in each phase, except for the first preparation phase, every STM is trained on a single dataset for one epoch.
More formally, let us denote a set of datasets as , and a single-task model as , which is trained on the -th dataset in phase . In the preparation phase () of CollaboNet, each STM is trained independently on a corresponding dataset until the performance of each model converges.
Note that an STM in the preparation phase () is the same as a single BiLSTM-CRF model. In the preparation phase, we assume that each model has obtained the maximum amount of knowledge about the -th dataset.
In the subsequent phases , where , we select an STM which is an expert on the dataset . We refer to the target STM as the target model, and the remaining STMs as the collaborator models. To train the target model , we use inputs from the target dataset and outputs from collaborator models . We train each STM on its dataset for one epoch, and change the target STM as follows:
where denotes concatenation and
denotes an aggregation operation such as max pooling or concatenation. We used weighted max pooling for the aggregation operation.is the input sequences of -th dataset, and is output , defined by Equation 7. When aggregating the results of collaborator models, we multiply each of the results by a weighting factor . The results are used to train the model . Using the outputs obtained by Equation 14, we train for one epoch, and it becomes in the next phase. The CRF layer is attached to the final output of . Once we iterate all the target datasets , the next phase begins.
We used 5 datasets (BC2GM , BC4CHEMD , BC5CDR [35, 36, 37, 38], JNLPBA , NCBI ), all of which were collected by Crichton et al. . Each of the 5 datasets were constructed from MEDLINE abstracts, and we used the BIOES notation format for named entity labels . Each dataset focuses on one of the three biomedical entity types: disease, chemical, and gene/protein. We did not use cell-type entity tags from JNLPBA for the entity types.
All the datasets are comprised of pairs of input sentences and biomedical entity labels for the sentences. While the JNLPBA dataset has only training and test sets, the other four datasets contain training, development and test sets. For JNLPBA, we used part of its training set as its development set which is the same size as its test set. Also, we found that the JNLPBA dataset from Crichton et al.  contained sentences that were incorrectly split. So we preprocessed the original dataset by Kim et al.  with a more accurate sentence separation.
The BC5CDR dataset has the sub-datasets BC5CDR-chem, BC5CDR-disease and BC5CDR-both, and they contain chemical entity types, disease entity types, and both entity types, respectively. We reported the performance on BC5CDR-chem and BC5CDR-disease. We have a total of six datasets: BC2GM, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, JNLPBA, and NCBI.
For the evaluation of the named entity recognition task, true positives are counted from exact matches between predicted entity spans and ground truth spans based on the BIOES notation.
We also designed and applied a simple post-processing step that corrects invalid BIOES sequences. This simple step improved precision by about 0.1% to 0.5%, and thus boosted the F1 score by about 0.04% to 0.3%.
Precision, recall and F1 scores were used to evaluate the models.
M = total number of predicted entities in the sequence.
N = total number of ground truth entities in the sequence.
C = total number of correct entities.
We used the 200 dimensional word embedding (WE) by Pyysalo et al.  which was trained on PubMed, PubMed Central (PMC) and Wikipedia text, and it contains about 5 million words. Word2vec  was used to train the word embedding. For character level word embedding (CLWE), we used window sizes of 3, 5, and 7.
We used AdaGrad optimizer  with an initial learning rate of 0.01 which was exponentially decayed for each epoch by 0.95. The dimension of the character embedding () was 30 and dimension of the character level word embedding () was 200*3. We used 300 hidden units for both forward and backward LSTMs. We applied dropout  to two parts of CollaboNet: outputs of CLWE (0.5) and BiLSTM (0.3). The mini-batch size for our experiment was 10.
Most of our hyperparameter settings are similar to those of Wang et al.. Only a few settings such as the dropout rates were different from the hyperparameters of Wang. We tuned these hyperparameters using validation sets.
The preparation phase for 6 datasets takes approximately 900 minutes, which is the same amount of time it takes to train 6 single-task models. The rest of the phases require 3000 minutes for complete training. If we exclude BC4GM, the largest dataset, then the training time for is reduced to 1500 minutes, which is half the time required for the remainder phases.
The experimental results of the baseline models and CollaboNet are provided in Table 5 and Table 5, respectively. Table 5 shows the results of the single-task models (STMs) where Table 5 shows the comparison between the existing state-of-the-art multi-task learning model (MTM) and our CollaboNet.
Since Wang et al.  used BC5CDR-both for their experiments, we reran their models on BC5CDR-chem and BC5CDR-disease for a fair comparison with other models. The rerun scores are denoted with asterisks. We conducted 10 experiments with 10 different random initializations on our STM. We take arithmetic mean over the 6 datasets to compare the overall performance of each model.
Table 5 shows the results of the STMs of Habibi et al.  and Wang et al.  (baseline STMs), and our STM on the 6 datasets. While the baseline STMs applied BiLSTM for the Character Level Word Embedding (CLWE) layer [25, 18], our STM used Convolution Neural Network (CNN) for the CLWE layer.
On average, our STM significantly outperforms the baseline STMs in terms of precision, recall and F1 score. Although, Sahu and Anand  tried to improve the performance of NER models with CNN based CLWE layer, they have failed to do so. In our experiments, however, our STM outperforms other baseline STMs, demonstrating the effectiveness of STM with CNN based CLWE layer.
, CollaboNet achieves higher precision and F1 score than most STM models on all datasets. On average, CollaboNet has improved both precision and recall. CollaboNet also outperforms the multi-task model (MTM) from Wang et al. on 4 out of 6 datasets (Table 5). While multi-task learning has improved performance in previous studies , using CollaboNet, which consists of expert models trained for each entity type, could further improve biomedical named entity recognition performance.
Compared to baseline models, CollaboNet achieves higher precision on average. Even though we observe a slight increase in recall, the increase in precision is more valuable than that in recall when considering the practical use of the bioNER systems. Important information tends to be repeated in a large size text corpus. Therefore, missing a few entities may not hinder the performance of an entire system, as this can be compensated elsewhere. However, incorrect information and the propagation of errors can effect the entire system.
In Table 5, we report the error types of our STM and CollaboNet. We define bio-entity error as recognizing different types of biomedical entities as target entity types. For instance, recognizing ‘VHL’ as a gene when it was used as a disease in a sentence is a bio-entity error. Note that a bio-entity error could occur when an entity is a polysemous word (e.g. VHL), or comprised of multiple words (e.g. BRCA1 deficient), and thus correcting bio-entity errors requires contextual information or supervision of other entity type models. The error analysis was conducted on 4334 errors of our STM and 3966 errors of CollaboNet on 5 datasets (BC2GM, BC5CDR-chem, BC5CDR-disease, JNLPBA, NCBI). Error analysis was conducted on models which showed best performance in our experiments.
The error analysis of our STM, which is a single BiLSTM-CRF model, shows that the majority of errors are classified as bio-entity errors which comprise up to 49.3% of the total errors in JNLPBA. According to the error analysis of our STM model, bio-entity errors constitute 1333 errors out of 4334 errors, comprising 30.8% of all the errors. Although bio-entity error was not the most common error type, the importance of bio-entity error is much greater that of other errors such as span error which was the most common error type, constituting 38% of incorrect errors. While most span errors tend to come from subjective annotations, or can be easily fixed by non-experts, bio-entity errors are difficult to detect, even for biomedical researchers. Also, for biomedical text mining methods, such as drug-drug interaction extraction, span errors can cause minor errors but bio-entity errors could lead to completely different results.
In CollaboNet, each expert model is trained on a single entity type dataset, and their training inputs are a concatenation of word embeddings and outputs of the other expert models. We expect that the other expert models will transfer knowledge on their respective entity to the target model, and thus improve the bio-entity type error problem by collaboration. As Table 5 shows, CollaboNet performs better than our STM in detecting polysemy and other entity types. Among 3966 errors from CollaboNet, 736 errors are bio-entity errors, comprising 18.6% of all the errors.
We sampled the predictions of CollaboNet and those of our STM (single-task model) to further understand the strengths of CollaboNet in Table 5.
The first example from chemical dataset in Table 5 shows our expected result from CollaboNet. Our STM annotates antilymphocyte globulin as a chemical entity. However, it is clear that the entity is not a chemical but a type of globulin which is a protein. The second example sentence from the chemical dataset is about an ACE / ARB entity. Again, our STM misidentifies the entity as a chemical entity. On the other hand, in CollaboNet, the target model (chemical model) obtains knowledge from one of the collaborator models (the gene/protein model) to avoid mistakenly recognizing the entity as a chemical entity. As globulin or ACE entities appear in the gene/protein dataset, the chemical model obtains information from the gene/protein model.
In the disease dataset, the first example shows a multi-word entity in parentheses. As a gene model can pass syntactic and semantic information about a word e.g., mutated and its surrounding words to a disease model, CollaboNet can abstain from predicting A-T, mutated as the disease entity, which our STM model failed to do. The second example in the disease dataset is on cardiac troponin T. Since cardiac + noun in biomedical text can be easily considered as a disease name, our STM misidentified this word as a disease entity. However, with the help of a gene model, CollaboNet did not mark it as a disease entity.
The gene/protein entity type further demonstrates the effectiveness of CollaboNet in reducing bio-entity type errors. Two example sentences contain abbreviations, which are one of the distinct characteristics of gene entities. LMB and cHD are incorrectly predicted as gene/protein entities by our STM, since lots of gene/protein entities are abbreviations. However, the target model (gene/protein model) in CollaboNet can obtain information on leptomycin and disease from the chemical and disease models, respectively. With the help of information from collaborator models, CollaboNet can effectively increase the precision of other entity type models.
In addition, we found some labels in the ground truth set, which we believe are incorrect. Tsai et al.  also reported that the inconsistent annotations in the JNLPBA corpus limit the NER system. We report our findings in Table 6.
In the first row of Table 6, the gene/protein entity osteopontin was not marked in the ground truth labels, whereas our network correctly predicted it as a gene entity. The second row also displays questionable results of the ground truth labels. Although lg and bcl-6, which are abbreviations of Immunoglobulin and B-cell lymphoma 6, where not labeled in the ground truth labels, our model detected them as a gene / protein entity. The example sentences of gene/protein annotations in Table 6 were reviewed by several domain experts and medical doctors. As shown in the third row, beta-muricholate is a chemical entity but it was not annotated in the ground truth labels. However, the last row shows another type of annotation error. Contrast media is a general term for a medium used in medical imaging and since is not a proper noun, it is not a named entity.
These examples shows the presence of incorrect ground truth labels, which can harm the performance of bioNER models. However, we believe that these missed or misidentified ground truth labels can be corrected by our system.
In this paper, we introduced CollaboNet, which consists of multiple BiLSTM-CRF models, for biomedical named entity recognition. While existing models were only able to handle datasets with a single entity type, CollaboNet leverages multiple datasets and achieves the highest F1 scores. Unlike recently proposed multi-task models, CollaboNet is built upon multiple single-task NER models (STMs) that send information to each other for more accurate predictions. In addition to the performance improvement over multi-task models, CollaboNet differentiates between biomedical entities that are polysemous or have similar orthographic features. As a result, our model achieved state-of-the-art performance on four bioNER datasets in terms of F1 score, precision and recall. Although our model requires a large amount of memory and time, which existing multi-task models require as well, the simple structure of CollaboNet allows researchers to build another expert model for different entity types in CollaboNet. As CollaboNet obtains higher precision than other models, we plan to apply CollaboNet in a biomedical text mining system.
The source code of CollaboNet and the datasets are available at https://github.com/wonjininfo/CollaboNet.
BiLSTM: Bidirectional long short-term memory; BioNER: Biomedical named entity recognition; CE: Character embedding; CLWE: Character level word embedding; CNN: convolution neural network; CRF: Conditional random field; LSTM: long short-term memory; MTL: Multi-task learning; MTM: Multi-task model; NER: Named entity recognition; NLP: Natural language processing; PMC: PubMed Central; STM: Single-task model; RNN: Recurrent neural network; WE: Word embedding
This work was supported by the National Research Foundation of Korea (NRF-2016M3A9A7916996, NRF-2017M3C4A7065887) and National IT Industry Promotion Agency grant funded by the Ministry of Science and ICT and Ministry of Health and Welfare (NO. C1202-18-1001, Development Project of The Precision Medicine Hospital Information System (P-HIS))
The authors declare that they have no competing interests.
WY, CHS, JL and JK conceived the idea. WY and JL designed the model. WY and CHS developed CollaboNet. CHS experimented and collected analysis examples and results. WY, JL and JK wrote the manuscript. JK, as the supervisor of WY, CHS and JL, provided guidance on the experiment. All authors read and approved the final manuscript.
We are sincerely grateful to Inah Chang for conducting manual error counting. We appreciate Susan Kim for editing the manuscript.
Efficient estimation of word representations in vector space.arXiv preprint arXiv:13013781. 2013;.