Healthcare industry is going through a technological transformation especially through the increasing adoption of conversational agents including voice assistants (such as Cortana, Alexa, Google Assistant, and Siri) and the medical chatbots 111https://www.usertesting.com/about-us/press/press-releases/healthcare-chatbot-apps-on-the-rise-but-cx-falls-short. These conversational agents hold great potential when it comes to transforming the way how consumers connect with the providers but recent reports suggest that these conversational agents need to improve their performance and usability in order to fulfill their potential.
The field of biomedical text understanding has received increased attraction in the last few years. With the increase accessibility of the medical records, scientific reports and publications along with the better tools and algorithms to process these big datasets, precision medicine and diagnosis have the potential to make the treatments much more effective. In addition, the relations between different symptoms and diseases, side-effects of different medicines can be more accurately identified based on text mining on these big datasets. The performance of the biomedical text understanding systems depends heavily on the accuracy of the underlying Biomedical Named Entity Recognition component (BioNER) i.e., the ability of these systems to recognize and classify different types of biomedical entities/slots present in the input utterances.
Slot tagging and Named Entity Recognition (NER) extract semantic constituents by using the words of an input sentence/utterance to fill in predefined slots in a semantic frame. The slot tagging problem can be regarded as a supervised sequence labeling task, which assigns an appropriate semantic label to each word in the given input sentence/utterance. The slot type could be a quantity (such as time, temperature, flight number, or building number) or a named entity such as person or location. One of the key challenges these text understanding systems face is better identification of symptoms and other entities in their input utterances. In other words, these conversational agents need to improve their slot tagging capabilities(UserTesting.com, 2019). In general, conversational agents ability to perform different tasks like topic classification or intent detection depends heavily on their ability to accurately identify slot types in the input and the researchers have been trying to both improve the named entity recognition component then use the improved component for the downstream task. Some example of this trend include Entity-aware topic classification (Ahmadvand et al., 2019), de-identification of medical records for conversational systems (Cohn et al., 2019) and improving inquisitiveness of conversational systems through NER (Reshmi and Balakrishnan, 2018).
Building an NER component with high precision and recall is technically challenging because of some reasons: 1) Requirement of hand-crafted features for each of the label to increase the system performance, 2) Lack of extensively labelled training dataset. 3) For conversational agents, the slot tagger may be deployed on limited-memory devices which requires model compression or knowledge distillation to fit into memory. 4) For some domains such as the health/biomedical domain, it is hard to find a single training data set that covers all the required slot types.
Many of the current state of the art BioNER systems rely on hand-crafted features for each of the labels in the data. The computation of these hand-crafted features takes most of the computation time. Furthermore, the usage of these hand-crafted features results in a system that is optimized for the training dataset. Recent advances in the neural network-based feature learning approaches have helped researchers to develop BioNER systems that are no longer dependent on the manual feature engineering process. However, the performance of these neural network-based techniques depends heavily on the presence of big and high-quality labelled training dataset. Such large datasets can be very difficult to obtain in the biomedical domain because of the cost and privacy issues.
Multi-task learning is one way to solve the problem of lack of extensive training data. Basic premise behind multi-task learning is that different datasets may have semantic and syntactic similarities and these similarities can help in training of a much more optimized model as compared to the one trained on a single dataset. It also helps to reduce model overfitting. However, multi-task learning models without pre-trained weights can take a long time to train on large datasets. In contrast, pre-trained language model based approaches ((Devlin et al., 2018); (Peters et al., 2018); and (Radford et al., 2018)) combined with multi-task learning have recently started showing promising results ((Liu et al., 2019).
To overcome these challenges, we present a multi-task transformer-based neural architecture for slot tagging and applied it to the Biomedical domain (MT-BioNER). We consider the training of a slot tagger using multiple data sets covering different slot types as a multi-task learning problem. The experimental results on the biomedical domain have shown that the proposed approach outperforms the previous state-of-the-art methods for slot tagging on the different benchmark biomedical datasets in terms of (time and memory) and effectiveness. These results can be used by the biomedical conversational agent to better identify entities in the input utterances.
2. Related Work
Many recent studies on BioNER have used neural networks to overcome the requirement of hand-crafted feature generation. (Crichton et al., 2017)
used a word and its surrounding context as input to the convolutional neural network while(Habibi et al., 2017) used word embeddings as input to the BiLSTM-CRF model for named entity recognition. Both of these models are based on single task / dataset approaches and they cannot use the information contained in multiple datasets as it is done in multi-task learning (MTL). Although the multi-task learning (MTL) based approaches have been widely used in the NLP research (for example, (Collobert and Weston, 2008) used it for standard NLP tasks like POS Tagging, Chunking etc.,), but application of MTL in biomedical text mining has not seen promising results primarily because many of the approaches (e.g. (Crichton et al., 2017)) used word level features as input ignoring sub-word information which can be quite important in the biomedical domain. To overcome this challenge, (Wang et al., 2018) used a multi-task BiLSTM-CRF model with an additional context dependent BiLSTM layer for modelling character sequences and was able to beat the benchmark results on five canonical biomedical datasets. Our proposed system is different from their scheme as we are able to combine the advantages of both the multi-task learning and pre-trained language models. Furthermore, the use of character-based LSTM models can be slower in terms of training and scoring time. Actual comparison of our model and (Wang et al., 2018) work is described later on in this paper. Pre-trained language model based approaches have been popular for different biomedical text mining tasks. For instance, (Sachan et al., 2017)
used transfer learning based approach to pretrain weights of an LSTM by training it in both forward and reverse direction using Word2Vec ((Mikolov et al., 2013)) embedding trained on a large collection of biomedical documents. In such approaches, Word2vec model needs be fine-tuned according to the variations in the biomedical data ((Pyysalo et al., 2013)). Recent developments of ELMO ((Peters et al., 2018)); GPT ((Radford et al., 2018)); and BERT ((Devlin et al., 2018)) language models have proven the effectiveness of the contextualized representations for different NLP tasks. These contextual representation learning models fine tune unsupervised objectives on text data. For instance, BERT ((Devlin et al., 2018)) uses multi-layer bidirectional Transformer on plain text for masked word prediction and next sentence prediction tasks. However, these unsupervised models must be fine-tuned to achieve better results for the specific prediction tasks using additional task-specific layers and datasets. The examples of BERT based models fine-tuned for biomedical domain include:
BioBERT (Lee et al., 2019) that fine-tunes BERT model on Pubmed abstracts and PMC full text articles along with English Wikipedia and Books corpus
ClinicalBERT (Huang et al., 2019) which fine-tunes the BERT model for the purpose of hospital re-admission prediction
Alsentzer et al. (2019) fine-tuned BERT/BioBERT models for the purpose of named entity recognition on a single dataset (task) and deidentification on clinical texts.
We can see two clear trends in the related research literature described in this section: 1) Multi-task based learning approaches that are able to use the sub-words based information are able to beat the other models for biomedical named entity recognition. 2) Contextual representation models like BERT have been quite successful for language modeling related problems. Based on these observations, we see a lot of potential for the approaches that combine both the multi-task learning and language model pretraining based approaches and this paper presents one such approach. Our approach takes inspiration from the recent multi-task deep neural network (MT-DNN, (Liu et al., 2019) work which combines both multi-task learning and BERT language model, but MT-DNN is optimized for general natural language understanding (NLU) tasks. In contrast, our model MT-BioNER is optimized for biomedical named entity recognition using BioBERT as the shared layers and the different data sets in the task-specific layers.
3.1. Model Architecture
As described in the introduction and related work sections, our model combines pre-trained language models (using BERT as the shared layers) and transfer learning (using task specific output layers). Figure 2
shows the architecture of the MT-BioNER model. The lower layers are shared across all tasks/datasets while the top layers are specific for each dataset (entity types). The input sentence is first represented as a sequence of embedding vectors (), one for each token which consists of word, position and segment embeddings. Then the Transformer encoder captures the contextual information for each token and generates the shared contextual embedding vectors. Finally, for each task/dataset, a shallow layer is used to generate task-specific representations, followed by operations necessary for entity recognition. The input sentence is first represented as a sequence of embedding vectors (), one for each token. Then the Transformer encoder captures the contextual information for each token and generates the shared contextual embedding vectors. Finally, for each task/dataset, a shallow layer is used to generate task-specific representations, followed by operations necessary for entity recognition. The input word sequence and output label sequence are represented as shown in Figure 1. The model architecture details are as follows:
Lexicon Encoder layer: The input sentence is a sequence of tokens of length . Note that two additional labels and are used to represent the start and end of a sentence, respectively. The first token is always the token and the last token is
. The lexicon encoder mapsinto a sequence of input embedding vectors constructed by concatenating the corresponding word, segment, and position embeddings to produce the input to the Transformer Encoder layers.
Transformer Encoder layers: The encoder of the multilayer bidirectional Transformer ((Vaswani et al., 2017)) is used to map the input representation vectors into contextual embedding vectors. In our experiments, we use BERT encoder model (Devlin et al., 2018) as the shared layers across different tasks (datasets). We fine-tune the BERT model alongside the task-specific layers during the training phase using a multi-task objective function.
: We use a shallow linear layer for each of the tasks/datasets. Depending on the slot types covered by each training dataset, we treat each dataset as a separate slot tagging task and add a separate output linear layer as the last layer of the network to learn the entities in that dataset. In our experiments, we use softmax layer. We have conducted additional experiments where we replaced the softmax layer with two layers: a feedforward layer and a CRF layer. But the training takes longer and we didn’t get significant improvement in the test accuracy.
We use a similar objective function as in (Wang et al., 2018) to train the multi-task model. The formal definition of the multi-task setting is as follows. Given datasets, for , each dataset consists of training samples, i.e., . We denote the training set for each dataset as where is the sequence of feature representations of the input word sequence of length . The set of labels for each dataset is where is the output label sequence of . The multi-task model consists of
different models each trained on separate dataset while sharing part of the model parameters across datasets. The loss functionof the multi-task model is
The training maximizes the log-likelihood of the label sequence given the input sequence for each given training data set as shown in Eq. 1
where the cross-entropy loss is used as the loss function. Cross-entropy loss increases as the predicted probability diverges from the actual label. The contribution of each dataset
is controlled by the hyperparameter. In our experiments, we assume that all the data sets have the same contribution and set for all datasets. The BERT shared layer and the task-dependent layer parameters are represented by and , respectively. We conducted two variants of transfer learning. First, we freeze the shared layer and only fine-tune the task-specific layers (that is, for all datasets). In the second variant, we fine-tune the whole network. (That is, ).
We use stochastic gradient descent (SGD) to learn the parameters of all shared layers and task-specific layers as shown in Algorithm1 (based on (Liu et al., 2019)). We initialize the shared layers with the pre-trained BERT model while the task-dependent layers are initialized randomly. After creating mini-batches of each dataset, we combine all the datasets and shuffle the mini-batches. At each epoch, we train the model on all the mini-batches and then at each batch-iteration, a mini-batch corresponding to task is selected (from all the datasets), and the model is updated according to the objective for the task .
|Dataset Size Entity Type and Counts BC2GM 20,000 sentences Gene/Protein (24,583) BC5CDR 1500 articles Chemical (15,935), Disease (12,852) NCBI-Disease 793 abstracts Disease (6,881) JNLPBA 2,404 abstracts Gene/ Protein (35,336), Cell Line (4,330), DNA (10,589), Cell Type (8,649), RNA (1,069)|
4.1. Data Preparation
We evaluate the performance of the proposed approach on four benchmark datasets used by Sachan et al. (2017). Table 1 gives a summary of these datasets based on the number of sentences, and entities. We used these publicly available datasets in order to make the experiments reproducible. We use the same training, development and test sets splits according to Crichton et al. (2017) for each dataset. As in Wang et al. (2018) and Sachan et al. (2017), we use training and development sets to train the final model. As part of the data preprocessing step, word labels are encoded using an IOB scheme. In this scheme, for example, a word describing a disease entity is tagged with "B-Disease" if it is at the beginning of the entity, and "I-Disease" if it is inside the entity. All other words not describing entities of interest are tagged as ’O’.
|Dataset||Metric||Single-Task Learning||Multi-Task Learning|
|JNLPBA (Genes etc.)||Precision||69.42||71.35||71.39||70.91||-||67.40|
4.2. Training and Evaluation Details
We test our method on four benchmark datasets used by (Sachan et al., 2017)
. All the neural network models are trained on one Tesla K80 GPU using PyTorch framework . To train our neural models, we use BertAdam optimizer with a learning rate of 5e-5 and a linear scheduler with a warmup proportion of 0.4, and a weight decay of 0.01 applied to every epoch of training. We use a batch size of 32, and the maximum sentence length of 128 words. We use BioBERT model ((Lee et al., 2019)) as the domain-specific language model. Lee et al. (Lee et al., 2019) also presented the use of BioBERT for biomedical NER scenario. But their scheme is to develop different models for different datasets.
We compare our proposed MT-BioNER model with state-of-the-art BioNER systems such as the single task LSTM-CRF model of Habibi et al. (2017) (BiLSTM), the multi-task model of Wang et al. (2018) (MTM-CW), and transfer learning approach of Sachan et al. (2017) (BiLM-NER). We show the precision, recall, and F1 scores of the models trained on three and four datasets in Table 2. BioBERT model is used as the shared layers for these results. From the results, we see that our approach trained on three datasets obtains the maximum recall and F1 scores. We should mention that our model is based on multi-task approach and achieves better performance even compared to Sachan et al. (2017) single task transfer learning approach (BiLM-NER) in which the whole network is only trained on a single dataset. Moreover, our model trained on four datasets performs better on recall and F1-score compared to MTM-CW approach which is a multi-task model. Another interesting result is that our model achieves the highest recall among all the other approaches. But it has a lack in precision score. To further improve the precision, we can add dictionaries as features which could be an interesting future work. Also, it shows the potential capability of BERT language model to provide the semantic feature representations for the multi-task NER tasks.
5.2. Training/Scoring Time and Model Size
As mentioned before, all the neural network models are trained on one Tesla K80 GPU. We compare the training time of our model with Wang et al. (2018) as we utilize it for benchmark comparison, and they have made their model publicly available. We find that for four datasets, it takes on average 40 minutes per epoch (1.45 sec/mini-batch, with total of 1,537 minibatches, and batch size of 32) and with a total number of 8 epoch, the full training on 4 datasets is less than 6 hours. On average it is equivalent to 0.36 sec per sentence to train a final model which compare to Wang et al. (2018) it is at least twice faster to train our model. One of the main reasons is that Wang et al. (2018) model uses character, and word level bidirectional LSTM layers which could take more time to train a model.
|Dataset Metric BERT-Base BioBERT Epochs 50 8 BC2GM Precision 77.86 80.33 Recall 80.57 82.82 F1 79.19 81.55 BC5CDR Precision 85.92 87.99 Recall 87.79 90.16 F1 86.85 89.06 NCBI Precision 84.78 84.50 Recall 86.90 88.98 F1 85.83 86.68 JNLPBA Precision 65.40 67.40 Recall 75.58 79.35 F1 70.12 72.89|
As another comparison, we study the scoring (inference) time as well since in the real-world applications, this parameter plays an important role. For the scoring time, we test it on a single dataset of BC5CDR test set and it takes 80 seconds to run the prediction which compare to Wang et al. (2018) model, it is at least twice faster and we think the character level LSTM layers could be the main reason. We should also mention that prediction time of a multi-task model in general is faster than multiple single-task models such as Sachan et al. (2017) since we can run the shared layers once and the remaining shallow-task dependent layers could be run in parallel. Model size is another factor in real-world applications. Using our approach, the model size is 430 MB which is bigger than (Wang et al., 2018) model which is 220MB. Since we are using BERT base model with 12 neural network layers, the bigger size of the model was expected. Moreover, compared to single-task models which require multiple models to identify multiple entities (form different datasets), multi-task approaches have the advantage of providing a single model for all the entity types across different datasets which is a big advantage in real-world production environments.
5.3. Fine-tuning Approaches Study
To achieve the best results, we fine-tune the shared language model alongside the task-specific layers during training phase. An interesting study could be to investigate if we only train the task-dependent layers and freeze the language model to its pre-trained weights. We run this experiment with the same parameters as other experiments and observe a poor performance of the model, i.e., for all the datasets the F1-score is less than 60%. This poor performance could be explained as following. Since we are training only the task-dependent layers, the shared layers act as fixed embedding layer which is not trainable. So, the whole network to train would be a shallow linear layer and using a linear layer with an embedding in a multi-task fashion may not give good results as it is observed in this study.
5.4. Domain Adaptation Study
The benchmark results shown in the Table 2 are achieved based on BioBERT as the shared layer which is in-domain language model. To analyze the domain adaptation scenario, we run the same experiment on the general BERT base model. In the experiments, we use BERT-base-Cased model instead of BioBERT since BioBERT is a fine-tuned version of BERT-base-Cased model. Table 3 shows the performance comparison of these two scenarios. We observe that BERT base model reaches to a margin of state-of-the-art results, but it requires a greater number of training iterations to adapt to a new domain. To mention that in order to further improve these results, one might first fine-tune the BERT language model using all datasets and save a new language model and then use this new model in our approach. This approach is very similar to the work of Sachan et al. ((Sachan et al., 2017)) on transfer learning in BiLM-NER in which they have created their own language model based on character-level CNN layer with word embedding layers as lexicon encoder and bidirectional LSTM layers to extract the contextual embedding vectors. We didn’t experiment this approach in this paper since we focused on the multi-task study of the work, but it could be an interesting experiment for future works.
5.5. Training Schema Study
To achieve the best results, we utilized the training scheme of MT-DNN (Liu et al. (2019)) work, in which, at each training iteration (epoch), we train on all the batches of all the datasets by selecting a random batch from all datasets. In Wang et al. (2018), they utilize a different training scheme in which at each iteration a random dataset is selected, and different batches of that specific dataset is fed into the pipeline to train. Using this scheme, they run several hundreds of epochs to train the model and it makes sense since they train the language model part of their approach as well. But, in our experiment, we need a smaller number of iterations due to availability of the pre-trained language model. Thus, we run similar experiment but instead of random selection, we iterate on the order of the datasets (Algorithm 2). Figure 3 show the performance of training algorithms over the number of iterations for algorithm I, and II, respectively. It is clear that compared to algorithm I which achieve the best results with less number of iterations, training scheme II is performing poor and the learning process is not stable since the model gets biased on the latest dataset that it is trained on.
6. Conclusion & Future Work
Conversational agents such as Cortana, Alexa and Siri are continuously working on increasing their capabilities by adding new domains. The support of a new domain includes the design and development of a number of NLU components for domain classification, intents classification and slots tagging (including named entity recognition). Each component only performs well when trained on a large amount of labeled data. Second, these components are deployed on limited-memory devices which requires some model compression. Third, for some domains such as the health domain, it is hard to find a single training data set that covers all the required slot types. In this paper, we presented a multi-task transformer-based neural architecture for slot tagging that overcomes these mentioned problems and applied it into the biomedical domain (MT-BioNER). We formulated the training of a slot tagger using multiple data sets covering different slot types as a multi-task learning problem. We also reported the training and scoring time and compared it to the recent advancements. The experimental results on the biomedical domain have shown that MT-BioNER model trained using the proposed approach outperforms the previous state-of-the-art systems for slot tagging on different benchmark biomedical datasets in terms of both (time and memory efficiency) and effectiveness. The model has a shorter training time and inference time, and smaller memory footprint compared to the baseline multiple single-task based models. This is another advantage of our approach which can play an important role in the real-world applications. We run extra experiments to study the impact of fine-tuning the shared language model as it is found as a crucial point of our approach. Utilizing BERT base model and its comparison with BioBERT showed an interesting result in which for the domains that there is not a specific in-domain BERT model, the base model would also perform well with considering a penalty of few percent in the F1-score and more training iterations. Furthermore, we have showed through detailed analysis that the training algorithm plays a very important role in achieving the strong performance. The output slot tagger can be used by the conversational agent to better identify entities in the input utterances.
Lastly, we highlight several future directions to improve the multi-task BioNER model. Based on MT-BioNER results on three and four datasets, deep diving into the effect of dataset/entity types on the multi-task learning approaches could be an interesting future work. Overall this study, shows that how the named entity recognition capabilities of different biomedical systems can be obtained by employing recent trends from deep learning and multi-task learning based research. Incorporating such techniques can help the researchers to overcome the limitation of extensively labeled training data that is really hard to get in the biomedical domain.
As a future work, we want to further explore the impact of overlap between the datasets on the overall model performance. Addition of the JNLPBA to the training datasets results in degradation of the overall performance of the model. One possible reason for this degradation is that JNLPBA contains many genes and proteins which are represented as small abbreviations and code. These small abbreviations and genes can overlap with the entities in other datasets resulting in confusion and degradation of the overall model. We will like to explore ways to tackle overlap between entities that can degrade the model performance. We will also like to perform comparative analysis of different models on same input sentences to highlight the plus points of our model over other models.
Although, the datasets and experiments specified in this research paper are focused on the biomedical domain but the techniques and algorithms presented in this paper can be used in other domains and general conversational systems: ones that are not focused on the biomedical domain. We also want to analyze the performance of our NER model on general domain conversational systems in future work as well.
- ConCET: entity-aware topic classification for open-domain conversational agents. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1371–1380. Cited by: §1.
- Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323. Cited by: item 3.
- Audio de-identification: a new entity recognition task. External Links: Cited by: §1.
A unified architecture for natural language processing: deep neural networks with multitask learning. In
Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §2.
- A neural network multi-task learning approach to biomedical named entity recognition. BMC bioinformatics 18 (1), pp. 368. Cited by: §2, §4.1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, 2nd item.
- Exploring deep knowledge resources in biomedical name recognition. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 96–99. Cited by: Table 2.
- Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33 (14), pp. i37–i48. Cited by: §2, Table 2, §5.1.
- ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: item 2.
TaggerOne: joint named entity recognition and normalization with semi-markov models. Bioinformatics 32 (18), pp. 2839–2846. Cited by: Table 2.
- Biobert: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746. Cited by: item 1, §4.2.
- HITSZ_CDR system for disease and chemical named entity recognition and relation extraction. In Proceedings of the Fifth BioCreative Challenge Evaluation Workshop. Sevilla: The fifth BioCreative challenge evaluation workshop, Vol. 2015, pp. 196–201. Cited by: Table 2.
- Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §1, §2, §3.2, §5.5.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1, §2.
- Distributional semantics resources for biomedical text processing. In Proceedings of LBM 2013, pp. 39–44. External Links: Cited by: §2.
- Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1, §2.
Enhancing inquisitiveness of chatbots through ner integration.
2018 International Conference on Data Science and Engineering (ICDSE), pp. 1–5. Cited by: §1.
- Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition. arXiv preprint arXiv:1711.07908. Cited by: §2, §4.1, §4.2, Table 2, §5.1, §5.2, §5.4.
- Healthcare chatbot diagnosis. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 2nd item.
- Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35 (10), pp. 1745–1752. Cited by: §2, §3.2, §4.1, Table 2, §5.1, §5.2, §5.2, §5.5.