The field of question answering (QA) has seen significant progress with several resources, models and benchmark datasets. Pre-trained neural language encoders like BERT devlin2018bert and its variants seo2016bidirectional; zhang2019sg have achieved near-human or even better performance on popular open-domain QA tasks such as SQuAD 2.0 rajpurkar2016squad. While there has been some progress in biomedical QA on medical literature vsuster2018clicr; tsatsaronis2012bioasq, existing models have not been similarly adapted to clinical domain on electronic medical records (EMRs).
Community-shared large-scale datasets like emrQA pampari2018emrqa allow us to apply state-of-the-art models, establish benchmarks, innovate and adapt them to clinical domain-specific needs. emrQA enables question answering from electronic medical records (EMRs) where a question is asked by a physician against a patient’s medical record (clinical notes). Thus, we adapt these models for EMR QA while focusing on model generalization via the following. (1) learning to predict the logical form (a structured semantic representation that captures the answering needs corresponding to a natural language question) along with the answer and (2) incorporating medical entity embeddings into models for EMR QA. We now examine the motivation behind these.
A physician interacting with a QA system on EMRs may ask the same question in several different ways; a physician may frame a question as: “Is the patient allergic to penicillin?” whereas the other could frame it as “Does penicillin cause any allergic reactions to the patient?”
. Since paraphrasing is a common form of generalization in natural language processing (NLP)bhagat2009acquiring, a QA model should be able to generalize well to such paraphrased question variants that may not be seen during training (and avoid simply memorizing the questions). However, current state-of-the-art models do not consider the use of meta-information such as the semantic parse or logical form of the questions in unstructured QA. In order to give the model the ability to understand the semantic information about answering needs of a question, we frame our problem in a multitask learning setting where the primary task is extractive QA and the auxiliary task is the logical form prediction of the question.
Fine-tuning on medical copora (MIMIC-III, PubMed johnson2016mimic; lee2020biobert) helps models like BERT align their representations according to medical vocabulary (since they are previously trained on open-domain corpora such as WikiText zhu2015aligning). However, another challenge for developing EMR QA models is that different physicians can use different medical terminology to express the same entity; e.g., “heart attack” vs. “myocardial infarction”. Mapping these phrases to the same UMLS semantic type111https://metamap.nlm.nih.gov/SemanticTypesAndGroups.shtml as disease or syndrome (dsyn) provides common information between such medical terminologies. Incorporating such entity information about tokens in the context and question can further improve the performance of QA models for the clinical domain.
Our contributions are as follows:
We establish state-of-the-art benchmarks for EMR QA on a large clinical question answering dataset, emrQA pampari2018emrqa
We demonstrate that incorporating an auxiliary task of predicting the logical form of a question helps the proposed models generalize well over unseen paraphrases, improving the overall performance on emrQA by over BERT devlin2018bert and by over clinicalBERT alsentzer2019publicly. We support this hypothesis by running our proposed model over both emrQA and another clinical QA dataset, MADE jagannatha2019overview.
The predicted logical form for unseen paraphrases helps in understanding the model better and provides a rationale (explanation) for why the answer was predicted for the provided question. This information is critical in clinical domain as it provides an accompanying answer justification for clinicians.
We incorporate medical entity information by including entity embeddings via the ERNIE zhang2019ernie architecture zhang2019ernie and observe that the model accuracy and ability to generalize goes up by over devlin2018bert.
2 Problem Formulation
We formulate the EMR QA problem as a reading comprehension task. Given a natural language question (asked by a physician) and a context, where the context is a set of contiguous sentences from a patient’s EMR (unstructured clinical notes), the task is to predict the answer span from the given context. Along with the question, context, answer triplet, also available as input are clinical entities extracted from the question and context. Also available as input is the, logical form (LF) that is a structured representation that captures answering needs of the question through entities, attributes and relations required to be in the answer pampari2018emrqa. A question may have multiple paraphrases where all paraphrases map to the same LF (and the same answer, fig. 1).
In this section, we briefly describe BERT devlin2018bert, ERNIE zhang2019ernie and our proposed model.
3.1 Bidirectional Encoder Representations from Transformers (BERT)
BERT devlin2018bert uses multi-layer bidirectional Transformer vaswani2017attention networks to encode contextualised language representations. BERT representations are learned from two tasks: masked language modeling taylor1953cloze and next sentence prediction task. We chose BERT model as pre-trained BERT models can be fine-tuned with just one additional inference layer and it achieved state-of-the-art results for a wide range of tasks such as question answering, such as SQuAD rajpurkar2016squad; rajpurkar2018know, and multiple language inference tasks, such as MultiNLI williams2017broad. clinicalBERT alsentzer2019publicly
yielded superior performance on clinical-related NLP tasks such as i2b2 named entity recognition (NER) challengesuzuner20112010. It was created by further fine-tuning of with biomedical and clinical corpus (MIMIC-III) johnson2016mimic.
3.2 Enhanced Language Representation with Informative Entities (ERNIE)
We adopt the ERNIE framework zhang2019ernie
to integrate the entity-level clinical concept information into the BERT architecture, which has not yet been explored in the previous works. ERNIE has shown significant improvement in different entity typing and relation classification tasks, as it utilises the extra entity information which is provided from knowledge graphs. ERNIE uses BERT for extracting contextualized token embeddings and a multi-head attention model to generate entity embeddings. These two set of embeddings are aligned and provided as an input to an information fusion layer which provides entity-enriched token embeddings. For a token () and its aligned entity (), the information fusion process is as follows:
Here represents the entity enriched token embedding,
is the non-linear activation function,
refers to an affine layer for token embeddings andrefers to an affine layer for entity embeddings. For the tokens without corresponding entities, the information fusion process becomes:
Initially, each entity embedding is assigned randomly and is fine-tuned along with token embeddings throughout the training procedure. The ERNIE architecture would be applicable to the model even if the logical forms are not available.
3.3 Multi-task Learning for Extractive QA
In order to improve the ability of a QA model to generalize better over paraphrases, it helps to provide the model information about the logical form that links these paraphrases. Since the answer to all the paraphrased questions is the same (and hence, logical form is the same), we constructed a multi-task learning framework to incorporate the logical form information into the model. Thus, along with predicting the answer span, we added an auxiliary task to also predict the corresponding logical form of the question. Multi-task learning provides an inductive bias to enhance the primary task’s performance via auxiliary tasks weng2019multimodal. In our setting, the primary task is span detection of the answer and the auxiliary task is logical form prediction for both emrQA and MADE (both datasets are explained in detail in § 4). The final loss for our model is defined as:
where is the weightage given to the loss of auxillary task (), logical form prediction. is loss for answer span prediction and is the final loss for our proposed model. The multi-task learning model can work with both BERT and ERNIE as the base model. Figure 2 depicts the proposed multi-task model to predict both the answer and logical form given a question and ERNIE architecture that is used to learn entity-enriched token embeddings.
We used emrQA222https://github.com/panushri25/emrQA and MADE333https://bio-nlp.org/index.php/projects/39-nlp-challenges datasets for our experiments. We provide a brief summary of each dataset and the methodology followed to split these datasets into train and test sets.
The emrQA corpus pampari2018emrqa is the only community-shared clinical QA dataset that consists of questions, posed by physicians against electronic medical records (EMRs) of a patient, along with their answers. The dataset was developed by leveraging existing annotations available for other clinical natural language processing (NLP) tasks (i2b2 challenge datasets uzuner20112010). It is a credible resource for clinical QA as logical forms that are generated by a physician help slot fill question templates and extract corresponding answers from annotated notes. Multiple question templates can be mapped to the same logical form (LF), as shown in Table 1, and are referred to as paraphrases of each other.
|LF: MedicationEvent (medication) [dosage=x]|
|How much medication does the patient take per day?|
|What is her current dose of medication?|
|What is the current dose of the patient’s medication?|
|What is the current dose of medication?|
|What is the dosage of medication?|
|What was the dosage prescribed of medication?|
The emrQA corpus has over question, logical form, and answer/evidence triplets, an example of a context, question, its logical form and a paraphrase is shown in Fig 1. The evidences are the sentences from the clinical note that are relevant to a particular question. There are total logical forms in the emrQA dataset 444https://github.com/panushri25/emrQA/blob/master/templates/templates-all.csv.
MADE 1.0 jagannatha2019overview dataset was hosted as an adverse drug reactions (ADRs) and medication extraction challenge from EMRs. This dataset was converted into a QA dataset by following the same procedure as enumerated in the literature of emrQA pampari2018emrqa. MADE QA dataset is smaller than emrQA, as emrQA consists of multiple datasets taken from i2b2 uzuner20112010 whereas MADE only has specific relations and entity mentions to that of ADRs and medications. This resulted in a clinical QA dataset which has different properties as compared to emrQA. MADE also has lesser number of logical forms (8 LFs) as compared to emrQA because of fewer entities and relations. The 8 LFs for MADE are provided in Appendix B.
4.1 Train/test splits
The emrQA dataset is generated using a semi-automated process that normalizes real physician questions to create question templates, associates expert annotated logical forms with each template and slot fills them using annotations for various NLP tasks from i2b2 challenge datasets (for e.g., fig. 1). emrQA is rich in paraphrases as physicians often tend to express the same information need in different ways. As shown in Table. 1, all paraphrases of a question map to the same logical form. Thus, if a model has observed some of the paraphrases it should be able to generalize to the others effectively with the help of their shared logical form “MedicationEvent (medication) [dosage=x]”. In order to simulate this, and test the true capability of the model to generalize to unseen paraphrased questions, we create a splitting scheme and refer to it as paraphrase-level split.
The basic idea is that some of question templates would be observed by the model during training and remaining would be used during validation and testing. The steps taken for creating this split are enumerated below:
First, the clinical notes are separated into train, val and test sets. Then the question, logical form and context triplets are generated for each set resulting in the full dataset. Here the context is the set of contiguous sentences from the EMR.
Then for each logical form (LF), of its corresponding question templates are chosen for train dataset and the rest are kept for validation and test dataset. Considering the LF shown in Table 1, four of the question templates () would be assigned for training and two () of them would be assigned for validation/testing. So any sample in training dataset whose question is generated from the question template set would be discarded. Similarly, any sample with a question generated from the question template set would be discarded.
To compare the generalizability performance of our model, we keep the training dataset with both set of question templates () as well. Essentially, a baseline model which has observed all the question templates () should be able to perform better on the set as compared to a model which has only observed set. This comparison would help us in measuring the improvement in performance with the help of logical forms even when a set of question templates are not observed by the model.
The dataset statistics for both emrQA and MADE are shown in Table 2. The training set with both question template sets () is shown with ‘(r)’ appended as suffix, as it is essentially a random split, whereas the training set with the question template () is appended with ‘(pl)’ for paraphrase-level split.
|# Samples (pl)||133,589||21,666||19,401|
|# Samples (r)||198,118||21,666||19,401|
|# Samples (pl)||73,224||4,806||9,235|
|# Samples (r)||113,975||4,806||9,235|
In this section, we briefly discuss the experimental settings, clinical entity extraction method, implementation details of our proposed model and evaluation metrics for our experiments.
5.1 Experimental Setting
As a reading comprehension style task, the model has to identify the span of the answer given the question-context pair. For both emrQA and MADE dataset, the span is marked as the answer to the question and the sentence is marked as the evidence. Hence, we perform extractive question answering at two levels: sentence and paragraph.
For this setting, the evidence sentence which contains the answer span is provided as the context to the question and the model has to predict the span of the answer, given the question.
Clinical notes are noisy and often contain incomplete sentences, lists and embedded tables making it difficult to segment paragraphs in notes. Hence, we decided to define the context as evidence sentence and sentences around it. We randomly chose the length of the paragraph () and another number less than the length of the paragraph (). We chose contiguous sentences which exist prior to the evidence sentence in the EMR and () sentences after the evidence sentence. We adopted this strategy because the model could have benefited from the information that the evidence sentence is exactly in the middle of a fixed length paragraph. The model has to predict the span of the answer from the sentences long paragraph (context) given the question.
The datasets are appended by ‘-p’ and ‘-s’ for paragraph and sentence settings respectively. The sentence setting is a relatively easier setting, for the model, compared to the paragraph setting because the scope of the answer is narrowed down to lesser number of tokens and there is less noise. For both settings, as also mentioned in § 4, we kept the train set where all the question templates (paraphrases) are observed by the model during training and that is referred with ‘(r)’ prefix, suggesting ‘random’ selection and no filtering based on question templates (paraphrases). All these dataset abbreviations are shown in the first column of Table 3.
5.2 Extracting Entity Information
MetaMap aronson2001effective uses a knowledge-intensive approach to discover different clinical concepts referred to in the text according to unified medical language system (UMLS) bodenreider2004unified. The clinical ontologies, such as SNOMED spackman1997snomed and RxNorm liu2005rxnorm, embedded in
MetaMap are quite useful in extracting entities across diagnosis, medication, procedure and sign/symptoms. We shortlisted these entities (semantic types) by mapping them to the entities which were used for creating logical forms of the questions as these are the main entities for which the question has been posed. The selected entities are: acab, aggp, anab, anst, bpoc, cgab, clnd, diap, emod, evnt, fndg, inpo, lbpr, lbtr, phob, qnco, sbst, sosy and topp. Their descriptions are provided in Appendix C.
These filtered entities (Table 7), extracted from
MetaMap, are provided to ERNIE. A separate embedding space is defined for the entity embeddings which are passed through a multi-head attention layer vaswani2017attention before interacting with token embeddings in the information fusion layer. The entity-enriched token embeddings are then used to predict the span of the answer from the context. We fine-tuned these entity embeddings along with the token embeddings, as opposed to using learned entities and not fine-tuning during downstream tasks zhang2019ernie. The architecture is illustrated in Fig 2.
5.3 Implementation Details
The BERT model was released with pre-trained weights as and . has lesser number of parameters but achieved state-of-the-art results on a number of open-domain NLP tasks. We performed our experiments with and hence, from here onwards we refer to as BERT. A fine-tuned version of on clinical notes was released as clinicalBERT (cBERT) alsentzer2019publicly. We use cBERT as the multi-head attention model for getting the token representations in ERNIE. We refer to this version of ERNIE, with entities from
MetaMap, as cERNIE for clinical ERNIE. Our final multi-task learning model, incorporated with an auxillary task of predicting logical forms, is referred to as M-cERNIE for multi-task clinical ERNIE. The code for all the models is provided at https://github.com/emrQA/bionlp_acl20.
For our extractive question answering task, we utilised exact match and F1-score for evaluation as per earlier literature rajpurkar2016squad.
6 Results and Discussion
In this section, we compare the results of all the models that we introduced in § 3. With the help of different experiments, we try to analyse whether the induced entity and logical form information help the model in achieving better performance or not. We also analyse the logical form predictions to understand whether it provides a rationale for the answer predicted by our proposed model. The compiled results for all the models are shown in Table 3. The hyper-parameter values for the best performing models are provided in Appendix A.
|cBERT||74.75 (+2.62)||67.25 (+1.44)|
|cERNIE||77.39 (+5.26)||70.17 (+4.36)|
|M-cERNIE||79.87 (+7.74)||71.86 (+6.05)|
|cBERT||65.45 (+1.26)||57.58 (+1.28)|
|cERNIE||66.15 (+1.96)||59.80 (+3.5)|
|M-cERNIE||67.21 (+3.02)||61.22 (+4.92)|
|cBERT||70.19 (+1.74)||62.00 (+1.27)|
|cERNIE||71.51 (+3.06)||65.31 (+4.58)|
|M-cERNIE||73.83 (+5.38)||67.53 (+6.8)|
|cBERT||64.97 (+1.58)||58.94 (+1.45)|
|cERNIE||65.71 (+2.32)||60.55 (+3.06)|
|M-cERNIE||64.58 (+1.19)||59.39 (+1.9)|
Does clinical entity information improve models’ performance?
Across all settings, the F1-score of cERNIE improves by over BERT and over cBERT. The exact match performance improved by over BERT and over cBERT. Also, as expected, the performance in sentence setting (-s) improved relatively more than it did in paragraph-setting. The entity-enriched tokens help in identifying the tokens which are required by the question. For example, in Fig. 3, the token ‘infiltrative’ in the question as well as the context get highlighted with the help of the identified entity ‘topp’ (therapeutic or preventive procedure) and then relevant tokens in the context, chest x ray, get highlighted with the relevant entity ‘diap’ (diagnostic procedure). This information aids the model in narrowing down its focus to highlighted diagnostic procedures in the context for answer extraction.
Does logical form information help the model generalize better?
In order to answer this question, we compared the performance of our M-cERNIE model to cERNIE model and observed an improvement of in F1-score and an improvement of in exact match performance. Here as well, the performance improvement is more for sentence setting (-s) as compared to the paragraph setting (-p). This helps the model in understanding the information need expressed in the question and helps in narrowing down its focus to certain tokens as the candidate answer. As seen in example 3, the logical form helps in understanding that the ‘dose’ of ‘medication’ needs to be extracted from the context where ‘dose’ was already highlighted with the help of the entity embedding of ‘qnco’.
Overall, the performance of our proposed model improves the F1-score by and exact-match by over BERT model. Thus, embedding clinical entity information with the help of further fine-tuning, entity-enriching and logical form prediction help the model in performing better over the unseen paraphrases by a significant margin. For emrQA, the performance of M-cERNIE is still below the upper bound performance of the cBERT model which is achieved when all the question templates are observed (emrQA-s/p (r)) by the model but for MADE, in sentence setting (-s), the performance of M-cERNIE is even better than the upper bound model performance. For MADE-p dataset the performance dropped a little when the LF prediction information is added to the model which might be because MADE-p only has logical forms (Appendix B) in total, resulting in low variety between the questions. Thus, the auxiliary task did not add much value to the learning of the base model (cERNIE) at paragraph level.
Does the model provide a supporting rationale via logical form (LF) prediction?
We analyzed the performance of M-cERNIE on MADE-s and emrQA-s datasets for logical form prediction, as we saw most improvement in sentence setting (-s). We calculated macro-weighted precision, recall and F1-score for logical form classification. The model achieved a F1-score of for both datasets, as shown in Table 4, exact
match setting. We analysed the confusion matrix of predicted LF and observed that the model mainly gets confused between the logical forms which convey similar semantic information as shown in Fig.4.
As we can see in Fig. 4 that both logical forms refer to quite similar information, hence, we decided to obtain performance metrics (precision, recall and F1-score) in relaxed setting. We designed this relaxed setting to create a more realistic setting, where the tokens of predicted and actual logical forms are matched rather than the whole logical form. An example of logical form tokenization is shown in Fig. 5.
The model achieves a F1-score of for emrQA-s and for MADE-s in relaxed setting (Table 4). This suggests that the model can efficiently identify important semantic information from the question, which is critical for efficient QA. During inference, the M-cERNIE models yield a rationale regarding a new test question (unseen paraphrase) by predicting the logical form of the question as an auxiliary task. For ex, the LF in Fig. 1 provides a rationale that any lab or procedure event related to the condition event needs to be extracted from the EMR for diagnosis.
Can logical form information be induced in multi-class QA tasks as well?
To answer this question, we performed another experiment where the model has to classify the evidence sentences from the non-evidence sentences making it a two-class classification task. The model would be provided a tuple of question and a sentence and it has to predict whether the sentence is evidence or not? The final loss of the model () changes to:
where is the weightage given to the loss of auxillary task (), logical form prediction. is loss for evidence classification and is the final loss for our proposed model. We conducted our experiments on emrQA dataset as evidence sentences were provided in it. In the multi-class setting, the
[CLS] token representation would be used for evidence classification as well as logical form prediction.
The multi-task entity enriched model (M-cERNIE) achieved an absolute improvement of over cBERT and over cERNIE. This suggests that the inductive bias introduced via LF prediction does help in improving the overall performance of the model for multi-class QA as well.
7 Related Work
In the general domain, BERT-based models are on the top of different leader boards across various tasks, including QA tasks rajpurkar2018know; rajpurkar2016squad. The authors of nogueira2019passage applied BERT to the MS-MARCO passage retrieval QA task and observed improvement over state of the art results. nogueira2019document further extended the work by combining BERT with re-ranking of predictions for queries that will be issued for each document. However, BERT-based models have not been adapted to answering physician questions on EMRs.
In case of domain-specific QA, logical forms or semantic parse are typically used to integrate the domain knowledge associated with a KB-based (knowledge base) structured QA datasets, where a model is learnt for mapping a natural language question to a LF. GeoQuery zelle1996learning, and ATIS dahl1994expanding, are the oldest known manually generated question-LF annotations on closed-domain databases. QALD lopez2013evaluating, FREE 917 cai2013large, SIMPLEQuestions bordes2015large contain hundreds of hand-crafted questions and their corresponding database queries. Prior work has also used LFs as a way to generate questions via crowd-sourcing wang2015building. WEBQuestions berant2013semantic contains thousands of questions from Google search where the LFs are learned as latent representations in helping answer questions from Freebase. Prior work has not investigated the utility of logical forms in unstructured QA, especially as a means to generalize the QA model across different paraphrases of a question.
There have been efforts on using multi-task learning for efficient question answering, such as the authors of mccann2018natural tried to learn multiple tasks together resulting in an overall boost in the performance of the model on SQuAD rajpurkar2016squad. Similarly, the authors of lu201912 also utilised the information across different tasks which lie at the intersection of vision and natural language processing to improve the performance of their model across all tasks. The authors of rawat2019naranjo utilised weak supervision to the model while predicting the answer but not much work has been done to incorporate the logical form of the question for unstructured question answering in a multi-task setting. Hence, we decided to explore this direction and incorporate the structured semantic information of the questions for extractive question answering.
The proposed entity-enriched QA models trained with an auxiliary task improve over the state-of-the-art models by about across the large-scale clinical QA dataset, emrQA pampari2018emrqa (as well as MADE jagannatha2019overview). We also show that multitask learning for logical forms along with the answer results in better generalizing over unseen paraphrases for EMR QA. The predicted logical forms also serve as an accompanying justification to the answer and help in adding credibility to the predicted answer for the physician.
This work is supported by MIT-IBM Watson AI Lab, Cambridge, MA USA.
Appendix A Model Hyper-parameters
Most of the hyper-parameters across our models remained same: learning rate: , weight decay: , warm-up proportion:
and hidden dropout probability:. The parameters that varied across models for different datasets are enumerated in the Table 6. The hyper-parametsrs provided in Table 6 are for all models in a particular dataset. This also suggests that even after adding an auxiliary task, the proposed model doesn’t need a lot of hyper-parameter tuning.
Appendix B Logical forms (LFs) for MADE dataset
1. MedicationEvent (medication) [sig=x]
2. MedicationEvent (medication) causes ConditionEvent (x) OR SymptomEvent (x)
3. MedicationEvent (medication) given ConditionEvent (x) OR SymptomEvent (x)
4. [ProcedureEvent (treatment) given/conducted ConditionEvent (x) OR SymptomEvent (x)] OR [MedicationEvent (treatment) given ConditionEvent (x) OR SymptomEvent (x)]
5. MedicationEvent (x) CheckIfNull ([enddate]) OR MedicationEvent (x) [enddatecurrentDate] OR ProcedureEvent (x) [date=x] given ConditionEvent (problem) OR SymptomEvent (problem)
6. MedicationEvent (x) CheckIfNull ([enddate]) OR MedicationEvent (x) [enddatecurrentDate] given ConditionEvent (problem) OR SymptomEvent (problem)
7. MedicationEvent (treatment) OR ProcedureEvent (treatment) given ConditionEvent (x) OR SymptomEvent (x)
8. MedicationEvent (treatment) OR ProcedureEvent (treatment) improves/worsens/causes ConditionEvent (x) OR SymptomEvent (x)
Appendix C Selected entities from MetaMap
The list of selected semantic types in the form of entities and their brief descriptors are provided in Table 7.
|bpoc||Body Part, Organ, or Organ Component|
|emod||Experimental Model of Disease|
|inpo||Injury or Poisoning|
|lbtr||Laboratory or Test Result|
|sosy||Sign or Symptom|
|topp||Therapeutic or Preventive Procedure|