End-to-end Joint Entity Extraction and Negation Detection for Clinical Text

12/13/2018 ∙ by Parminder Bhatia, et al. ∙ Amazon 0

Negative medical findings are prevalent in clinical reports, yet discriminating them from positive findings remains a challenging task for information extraction. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and rule-based negation detection. We consider this as a multi-task problem and present a novel end-to-end neural model to jointly extract entities and negations. We extend a standard hierarchical encoder-decoder NER model and first adopt a shared encoder followed by separate decoders for the two tasks. This architecture performs considerably better than the previous rule-based and machine learning-based systems. To overcome the problem of increased parameter size especially for low-resource settings, we propose the Conditional Softmax Shared Decoder architecture which achieves state-of-art results for NER and negation detection on the 2010 i2b2/VA challenge dataset and a proprietary de-identified clinical dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In recent years, natural language processing (NLP) techniques have demonstrated increasing effectiveness in clinical text mining. Electronic health record (EHR) narratives, e.g., discharge summaries and progress notes contain a wealth of medically relevant information such as diagnosis information and adverse drug events. Automatic extraction of such information and representation of clinical knowledge in standardized formats could be employed for a variety of purposes such as clinical event surveillance, decision support

[Jin et al.2018], pharmacovigilance, and drug efficacy studies.

Although many NLP applications that successfully extract findings from medical reports have been developed in recent years, identifying assertions such as positive (present), negative (absent), and hypothetical remains a challenging task, especially to generalize [Wu et al.2014]. However, identifying assertions is critical since negative and uncertain findings are frequent in clinical notes, and information extraction algorithms that do not distinguish between them will not paint a clear picture of the patient.

In this paper, we focus on identifying the negated findings in multi-task setting [Bhatia, Arumae, and Celikkaya2018]. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and negation detection. Previous efforts in this area include both rule-based and machine-learning approaches.

Rule-based systems rely on negation keywords and rules to determine the cue of negation. NegEx [Chapman et al.2001] is a widely used algorithm that consists of ontology lookup to index findings, and negation regular expression search in a fixed scope. ConText [Harkema et al.2009] extends NegEx to other attributes like hypothetical and make scope variable by searching for a termination term. NegBio [Peng et al.2018] uses a universal dependency graph for scope detection. Another similar work is [Gkotsis et al.2016], where they utilize a constituency-based parse tree to prune out the parts outside the scope. However, these approaches use rules and regular expressions for cue detection which rely solely on surface text and thus are limited when attempting to capture complex syntactic constructions such as long noun phrases.

Kernel-based approaches are also very common, especially in the 2010 i2b2/VA task of predicting assertions. The state-of-the-art in that challenge applies support vector machines (SVM) to assertion prediction as a separate step after concept extraction

[de Bruijn et al.2011]

. They train classifiers to predict assertions of each concept word, and a separate classifier to predict the assertion of the whole concept.

[Shivade et al.2015] proposes Augmented Bag of Words Kernel (ABoW), which generates features based on NegEx rules along with bag-of-words features. [Cheng, Baldwin, and Verspoor2017] uses CRF for classification of cues and scope detection. These machine learning based approaches often suffer in generalizability, the ability to perform well on unseen text.

Recently, neural network models such as

[Fancellu, Lopez, and Webber2016] and [Rumeng, Jagannatha Abhyuday, and Hong2017] have been proposed. [Fancellu, Lopez, and Webber2016]

exploits feedforward and bidirectional Long Short Term Memory (BiLSTM) networks for generic negation scope detection. This is a slightly different task since the negation cue is assumed to be given as input. Most relevant to our work is

[Rumeng, Jagannatha Abhyuday, and Hong2017]

where gated recurrent units (GRUs) are used to represent the clinical events and their context, along with an attention mechanism. Given a text annotated with events, it classifies the presence and period of the events. However, this approach is not end-to-end as it does not predict the events. Additionally, these models generally require large annotated corpus, which is necessary for good performance. Unfortunately, such clinical text data is not easily available.

Multi-task Learning (MTL) is one of the most effective solutions for knowledge transfer across tasks. In the context of neural network architectures, we perform MTL by sharing parameters across models, such as pretraining using word embeddings [Bhatia, Guthrie, and Eisenstein2016, Bojanowski et al.2016], a popular approach for most NLP tasks. In this paper, we propose a multi-task learning (MTL) approach to negation detection that overcomes some of the limitations in the existing models such as data accessibility. MTL leverages overlapping representation across sub-tasks and it is one of the most effective solutions for knowledge transfer across tasks. In the context of neural network architectures, we perform MTL by sharing parameters across tasks. We look towards parameter sharing methods [Peng and Dredze2017] to transfer overlapped representation from two the tasks.

To the best of our knowledge, this is the first work to jointly model named entity and negation in an end-to-end system. Our main contributions are summarized below:

  • [noitemsep,topsep=5pt]

  • An end-to-end hierarchical neural model consisting of shared encoder and different decoding schemes to jointly extract entities and negations. Using our proposed model, we obtain substantial improvement over prior models for both entities and negations on the 2010 i2b2/VA challenge task as well as a proprietary de-identified clinical note dataset for medical conditions.

  • Conditional softmax shared decoder model to overcome the problem for low resource settings (datasets that have limited amounts of training data), which achieves state of art results across different datasets.

  • A thorough empirical analysis of parameter sharing for low resource setting highlighting the significance of the shared decoder.


We first present a standard neural framework for named entity recognition. To facilitate multi-task learning, we expand on that architecture by building a two decoder model. Then, to overcome the issues of the two decoder model we propose a single shared decoder model. Finally, we introduce the Conditional softmax shared decoder.

Named Entity Recognition Architecture

A sequence tagging problem such as NER can be formulated as maximizing the conditional probability distribution over tags

given an input sequence , and model parameters .


is the length of the sequence, and are tags for the previous words. The architecture we use as a foundation is that of [Lample et al.2016, Yang, Salakhutdinov, and Cohen2016]. The model consists of three main components: the (i) character and (ii) word encoders, and the (iii) decoder/tagger.


Given an input sequence whose coordinates indicate the words in the input vocabulary, we first encode the character level representation for each word. For each the corresponding sequence of character embeddings is fed into an encoder, where is the length of a given word and is the size of the character embedding. The character encoder employs two LSTM units which produce , and

, the forward and backward hidden representations, respectively, where

is the last timestep in both sequences. We concatenate the last timestep of each of these as the final encoded representation, , of at the character level.

The output of the character encoder is concatenated with a pre-trained word embedding, , which is used as the input to the word level encoder.

Using learned character embeddings alongside word embeddings has shown to be useful for learning word level morphology, as well as mitigating loss of representation for out-of-vocabulary words. Similar to the character encoder we use a BiLSTM to encode the sequence at the word level. The word encoder does not lose resolution, meaning the output at each timestep is the concatenated output of both word LSTMs, .

Decoder and Tagger

Finally, the concatenated output of the word encoder is used as input to the decoder, along with the label embedding of the previous timestep. During training we use teacher forcing [Williams and Zipser1989] to provide the gold standard label as part of the input.


where , is the number of hidden units in the decoder LSTM, and is the number of tags. The model is trained in an end-to-end fashion using a standard cross-entropy objective.

Figure 1: Two decoder model, upper decoder for NER and the lower decoder for negation, where encoder provides same input to both the decoders.
Model Precision Recall
Named Entity
Chalapathy et al. [2016] 0.844 0.834 0.839
Indepedent NER (baseline) 0.857 0.841 0.848
Two Decoder 0.849 0.855 0.851
Shared Decoder 0.852 0.821 0.834
Conditional Decoder 0.854 0.858 0.855
Negex 0.896 0.799 0.845
ABoW Kernel 0.899 0.900 0.900
Indepedent Negation (baseline) 0.81 0.85 0.82
Two Decoder 0.894 0.908 0.899
Shared Decoder 0.87 0.902 0.882
Conditional Decoder 0.919 0.891 0.905
(B) Proprietary Medical Condition Dataset
Model Precision Recall
Named Entity
LSTM:CRF 0.82 0.84 0.83
Indepedent NER 0.88 0.848 0.863
Two Decoder 0.876 0.861 0.868
Shared Decoder 0.864 0.841 0.857
Conditional Decoder 0.878 0.872 0.874
Negex 0.403 0.932 0.563
Indepedent Negation 0.84 0.82 0.83
Two Decoder 0.931 0.865 0.897
Shared Decoder 0.921 0.85 0.878
Conditional Decoder 0.928 0.874 0.899
Table 1: Test set performance during multi-task training. Table 1A displays results from i2b2. Table 1B uses our medical condition data. The baseline is the current state-of-the art optimized architecture.
(A) 2010 i2b2/VA Dataset

Two Decoder Model

To facilitate the multi-task learning setting, we started with a two decoder model consisting of two decoders which use the shared encoder representation to jointly predict entities and negation attribute (Figure 1). This is a standard architecture used in multi-task learning setting which consists of different LSTM’s for equation followed by different softmax. This model mitigates the issues associated with rule-based models that rely solely on surface text, and thus are limited when attempting to capture complex syntactic constructions. With shared contextual encoder representation consisting of character and word embedding based models, the proposed architecture provides an effective solution for knowledge transfer across tasks, thus consolidating the ability to perform well on unseen text. However, this proposed architecture is not scalable, the number of decoders scales linearly with the number of attributes. Another problem we realized with this architecture is the performance degradation when working in an extremely low resource setting, where more parameters prevents the model to generalize well.

Shared Decoder Model

To overcome the issues with two decoder model we propose a shared decoder model (Figure 2). We share the encoder and decoder for the two tasks and the common output from the decoder is fed into two different softmax for entity and negations.

Figure 2: Shared decoder model
Figure 3: Conditional softmax decoder model

Conditional Softmax Decoder Model

While the single decoder model is more scalable, we found that this model did not perform as well for negation as the two decoder model. It can be attributed to the fact that negation occurs less frequently than the entities, thus the decoder primarily focuses on making entity extraction predictions. To mitigate this issue and provide more context to negation attributes, we add an additional input, which is the softmax output from entity extraction (Figure 3). Thus, the model learns more about the input as well as the label distribution from entity extraction prediction. As an example, we use negation only for problem entity in the i2b2 dataset. Providing the entity prediction distribution helps the negation model to make better predictions. The negation model learns that if the prediction probability is not inclined towards the problem entity, then it should not predict negation irrespective of the word representation.


where, is the softmax output of the entity at time step .

Model Precision Recall
5% data
Two Decoder 0.525 0.719 0.607
Conditional Decoder 0.658 0.684 0.671
10% data
Two Decoder 0.720 0.781 0.749
Conditional Decoder 0.824 0.808 0.816
20% data
Two Decoder 0.864 0.797 0.829
Conditional Decoder 0.854 0.828 0.842
Table 2: Conditional softmax decoder is more robust in extreme low resource setting than its two decoder counterpart.


Dataset We evaluated our model on two datasets. First is the 2010 i2b2/VA challenge dataset for “test, treatment, problem” (TTP) entity extraction and assertion detection (i2b2 dataset). Unfortunately, only part of this dataset was made public after the challenge, therefore we cannot directly compare with NegEx and ABoW results. We followed the original data split from [R. Chalapathy and Piccardi2016] of 170 notes for training and 256 for testing. The second dataset is proprietary and consists of 4,200 de-identified annotated clinical notes with medical conditions (proprietary dataset). A summary of the datasets is presented in Table 3.

2010 i2b2/VA Proprietary
Tags 13 37
Notes 426 4200
Tokens 416K 1.5M
Table 3: Overview of the i2b2 and the proprietary medical condition datasets.

Model settings

Word, character and tag embeddings are 100, 25, and 50 dimensions, respectively. Word embeddings are initialized using GloVe , while character and tag embeddings are learned. Character and word encoders have 50, and 100 hidden units, respectively, while the decoder LSTM has a hidden size of 50. Dropout is used after every LSTM, as well as for word embedding input. We use Adam as an optimizer. Our model is built using MXNet. Hyperparameters are tuned using Bayesian Optimization

[Snoek, Larochelle, and Adams2012].

Training details Our models are trained until convergence, and we use the development set for both tasks to evaluate performance for early stopping. We performed two sets of experiments. The first set evaluates the performance of NER and negation assertion of the baseline, two decoder, shared decoder and conditional softmax decoder models on i2b2 and the medical condition datasets. The second set uses low resource settings, where we evaluate the performance of negation assertion of the conditional softmax decoder model on 5%, 10% and 20% of the proprietary medical condition training data. Development and test sets are kept at the original size.


Since there has been no prior work which has solved the two tasks as a joint model, we report the best results for both the individual tasks (Table 1). We observe that our baseline model for NER (Indepedent NER) presented in the methodology section outperforms the best model [R. Chalapathy and Piccardi2016] on the i2b2 challenge. The Two decoder and the conditional softmax decoder (Conditional decoder) model achieve even better results for NER than our baseline model, where the conditional decoder model achieved new state-of-art for 2010 i2b2/VA challenge task. Shared decoder underperformed the other two models. That can be attributed to a single decoder which primarily focuses on making entity extraction predictions which are more frequent than negations. The conditional decoder outperformed the baseline model on the negation prediction task and achieved an improvement of about 8% in F1 score compared to the baseline model, which suggests that modeling named entity and negation task together helps in achieving better results than each of the tasks done independently.

We compare our models for negation detection against NegEx [Chapman et al.2001] and ABoW [Shivade et al.2015], which has the best results for the negation detection task on i2b2 dataset. Conditional decoder model outperforms both NegEx and ABoW (Table 1). Low performance of NegEx and ABoW is mainly attributed to the fact that they use ontology lookup to index findings and negation regular expression search within a fixed scope. A similar trend was observed in the medication condition dataset. The important thing to note is the low F1 score for NegEx. This can primarily be attributed to abbreviations and misspellings in clinical notes which can not be handled well by rule-based systems.

To understand the advantage of conditional decoder, we evaluated our model in extreme low data settings, where we used a sample of our training data. We observed that conditional decoder outperforms the two decoder model and achieved an improvement of 6% in F1 score in those settings (Table 2). As we increase the data size, their performance gap reduces which clearly demonstrates that conditional decoder is robust in low resource settings.


In this paper, we have shown that named entity and negation assertion can be modeled in a multi-task setting. Joint learning with sharing of parameters provides better contextual representation and helps in alleviating problems associated with using neural networks for negation detection thereby achieving better results than the rule-based systems. Our proposed conditional softmax decoder achieves best results across both tasks and is robust to work well in extreme low data settings. For future work, we plan to investigate the model on other related tasks such as relation extraction, normalization as well as the use of advanced conditional models.


  • [Bhatia, Arumae, and Celikkaya2018] Bhatia, P.; Arumae, K.; and Celikkaya, B. 2018. Dynamic transfer learning for named entity recognition. arXiv preprint arXiv:1812.05288.
  • [Bhatia, Guthrie, and Eisenstein2016] Bhatia, P.; Guthrie, R.; and Eisenstein, J. 2016. Morphological priors for probabilistic neural word embeddings. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 490–500.
  • [Bojanowski et al.2016] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
  • [Chapman et al.2001] Chapman, W. W.; Bridewell, W.; Hanbury, P.; Cooper, G. F.; and Buchanan, B. G. 2001. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics 34(5):301–310.
  • [Cheng, Baldwin, and Verspoor2017] Cheng, K.; Baldwin, T.; and Verspoor, K. 2017. Automatic negation and speculation detection in veterinary clinical text. In Proceedings of the Australasian Language Technology Association Workshop 2017, 70–78.
  • [de Bruijn et al.2011] de Bruijn, B.; Cherry, C.; Kiritchenko, S.; Martin, J.; and Zhu, X. 2011. Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. Journal of the American Medical Informatics Association 18(5):557–562.
  • [Fancellu, Lopez, and Webber2016] Fancellu, F.; Lopez, A.; and Webber, B. 2016. Neural networks for negation scope detection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 495–504.
  • [Gkotsis et al.2016] Gkotsis, G.; Velupillai, S.; Oellrich, A.; Dean, H.; Liakata, M.; and Dutta, R. 2016. Don’t let notes be misunderstood: A negation detection method for assessing risk of suicide in mental health records. In Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology, 95–105.
  • [Harkema et al.2009] Harkema, H.; Dowling, J. N.; Thornblade, T.; and Chapman, W. W. 2009. Context: an algorithm for determining negation, experiencer, and temporal status from clinical reports. Journal of biomedical informatics 42(5):839–851.
  • [Jin et al.2018] Jin, M.; Bahadori, M. T.; Colak, A.; Bhatia, P.; Celikkaya, B.; Bhakta, R.; Senthivel, S.; Khalilia, M.; Navarro, D.; Zhang, B.; et al. 2018. Improving hospital mortality prediction with medical named entities and multimodal learning. arXiv preprint arXiv:1811.12276.
  • [Lample et al.2016] Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; and Dyer, C. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, 260–270.
  • [Peng and Dredze2017] Peng, N., and Dredze, M. 2017. Multi-task domain adaptation for sequence tagging. In Proceedings of the 2nd Workshop on Representation Learning for NLP, 91–100.
  • [Peng et al.2018] Peng, Y.; Wang, X.; Lu, L.; Bagheri, M.; Summers, R.; and Lu, Z. 2018. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings 2017:188.
  • [R. Chalapathy and Piccardi2016] R. Chalapathy, E. Z. B., and Piccardi, M. 2016. Bidirectional lstm-crf for clinical concept extraction. arXiv preprint arXiv:1611.08373.
  • [Rumeng, Jagannatha Abhyuday, and Hong2017] Rumeng, L.; Jagannatha Abhyuday, N.; and Hong, Y. 2017. A hybrid neural network model for joint prediction of presence and period assertions of medical events in clinical notes. In AMIA Annual Symposium Proceedings, volume 2017, 1149. American Medical Informatics Association.
  • [Shivade et al.2015] Shivade, C.; de Marneffe, M.-C.; Fosler-Lussier, E.; and Lai, A. M. 2015. Extending negex with kernel methods for negation detection in clinical text. In Proceedings of the Second Workshop on Extra-Propositional Aspects of Meaning in Computational Semantics (ExProM 2015), 41–46.
  • [Snoek, Larochelle, and Adams2012] Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, 2951–2959.
  • [Williams and Zipser1989] Williams, R. J., and Zipser, D. 1989.

    A learning algorithm for continually running fully recurrent neural networks.

    Neural computation 1(2):270–280.
  • [Wu et al.2014] Wu, S.; Miller, T.; Masanz, J.; Coarr, M.; Halgrim, S.; Carrell, D.; and Clark, C. 2014. Negation’s not solved: generalizability versus optimizability in clinical natural language processing. PloS one 9(11):e112774.
  • [Yang, Salakhutdinov, and Cohen2016] Yang, Z.; Salakhutdinov, R.; and Cohen, W. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270.