has seen great strides due to the rise of pre-trained modelsDevlin et al. (2019). But in high-stakes domains like medical information extraction (Irvin et al., 2019; McDermott et al., 2020; Smit et al., 2020)
, machine learning models are still too error-prone to use broadly. Since they are not perfect, they typically play the role of assisting users in tasks like building cohortsPons et al. (2016) or in providing clinical decision support Demner-Fushman et al. (2009). To be most usable in conjunction with users, these systems should not just produce a decision, but a correctly-sourced justification that can be checked Rudie et al. (2019).
Our goal is to study document-level information extraction systems that are both accurate and which make predictions based on the correct information (Doshi-Velez and Kim, 2017). This process involves identifying what evidence the model actually used, verifying the model’s prediction based on that evidence, and checking whether that evidence aligns with what humans would use, which would allow a user to more quickly see if the system is correct. For example, in Figure 1, localizing the prediction of mass effect (a feature expressing whether there is evidence of brain displacement by a mass like a tumor) to the first two sentences allows a trained user in a clinical decision support setting to easily verify what was extracted here. Our evidence extraction hews to principles of both faithfulness and plausibility (Jain et al., 2020; Jacovi and Goldberg, 2020; Miller, 2019).
Rather than use complex approaches with intermediate latent variables for extraction Lei et al. (2016), we focus on what can be done with off-the-shelf pre-trained models Liu et al. (2019) using post-hoc interpretation. We explore various techniques for attribution but primarily use DeepLIFT Shrikumar et al. (2019) to find key parts of each document that were used by the model. We ask two questions: first, can we identify the document sentences that truly contributed to the prediction (faithfulness)? Using the ranking of sentences provided by DeepLIFT, we extract a set of sentences where the model returns nearly the same prediction as before, thus verifying that these sentences are a sufficient explanation for the model. Second, do these document sentences align with what users annotated (plausibility)? Unsurprisingly, we find that this alignment is low in a basic Transformer model.
To further improve the alignment with human annotation, we consider injecting small amounts of token-level labeled data. Critically, in the brain MRI extraction setting we consider (see Table 1), large-scale token-level annotation is not available; most instances in the dataset only have document-level labels from existing clinical decision support systems, making it a weakly-supervised setting (Pruthi et al., 2020a; Patel et al., 2020). We explore two methods for using this small amount of annotation, chiefly based around supervising or regularizing the model’s behavior. One notion is entropy maximization that the model should be uncertain when it isn’t exposed to sufficient evidence (Feng et al., 2019). Another is attention regularization where the model’s attentions should focus on the key pieces of evidence; while not a perfect cue (Jain and Wallace, 2019), we can investigate whether this then leads to a model whose explanations leverage this information more heavily.
We validate our methods first on a small dataset of radiologists’ observations from brain MRIs. These reports are annotated with document-level key features related to different aspects of the report, which we want to extract in a faithful way. We see positive results here even in a small-data condition, but to understand how this method would scale with larger amounts of data, we adapt the DocRED relation extraction task to be a document-level classification task. The question of which sentence in the document describes the relation between the two entities, if there even is one, is still quite challenging, and we show our techniques can lead to improvements in a weakly-labeled setting here as well.
Our contributions are (1) We apply evidence extraction methods to document-level IE, emphasizing a new brain MRI dataset that we annotate. (2) We explore using weak sentence-level supervision in two techniques adapted from prior work; (3) We evaluate pre-trained models and evidence extraction through DeepLIFT for plausibility compared to human annotation, while ensuring faithfulness of the evidence.
| Severe encephalomalacia in the temporal lobes and frontal lobes bilaterally with reactive gliosis in the left frontal lobe.  Moderate enlargement of the ventricular system.  No abnormal enhancement.  Near complete opacification of the left maxillary sinus. …|
We start with an example from brain MRI reports in Table 1. Medical information extraction involves tasks such as identifying important medical terms from text (Irvin et al., 2019; Smit et al., 2020) and normalizing names into standard concepts using domain-specific ontologies (Cho et al., 2017). One application in clinical decision support, shown here, requires extracting the values of certain key features (clinical findings) from these reports or medical images Rudie et al. (2021); Duong et al. (2019). This extraction should be accurate, but it should also make predictions that are correctly-sourced, to facilitate review by a radiologist or someone else using the system Rauschecker et al. (2020); Cook et al. (2018).
The finding section of a brain MRI radiology report often describes these key features in both explicit and implicit ways. For instance, contrast enhancement, one of our key features, is mentioned explicitly much of the time; see no abnormal enhancement
in the third sentence. A rule-based system can detect this type of evidence easily. But some key features are harder to identify and require reasoning over context and draw on implicit cues. For example,severe encephalomalacia in the first sentence and enlargement of the ventricular system in the following sentence are both implicit signs of positive mass effect and either is sufficient to infer the label. It is significantly harder to built a rule-based extractor for this case. Learning-based systems have the potential to do much better here, but lack of understanding about their behavior can lead to hard-to-predict failure modes, such as acausal prediction of key features (e.g., inferring evidence about mass effect from a hypothesized diagnosis somewhere in the report, where the causality is backwards).
Our work aims to leverage the ability of learning-based systems to capture implicit features while improving their ability to make correctly-sourced predictions.
2.2 Problem Setting
The problem we tackle in this work is document-level information extraction. Let be a document consisting of sentences. The document is annotated with a set of labels where is an auxiliary input specifying a particular task for this document (e.g., mass effect) and is the label associated with that task from a discrete label space. In our adaptation of the DocRED task, we consider
to classify the relationship (if any) between a pair of entitiesin a document, defined in Section 4.1.2.
Our method takes a pair and then computes the label from a predictor . We can then extract evidence post-hoc using a separate procedure such as a feature attribution method:
In addition to the labels , we assume access to a small number of examples with additional supervision in each domain. That is, for a triple, we also assume we are given a set of ground-truth evidence with sentence indices . This evidence should be sufficient to compute the label, but not always necessary; for example, if multiple sentences can contribute to the prediction, they might all be listed as supporting evidence here. See Section 3.3 for more details.
2.3 Related Work
Our work fits into a broader thread of relation extraction (Han et al., 2020). Due to the cost of collecting large-scale data with good quality, distant supervision (DS) (Mintz et al., 2009) and ways to denoise auto-labeled data from DS (Surdeanu et al., 2012; Wang et al., 2018) have been widely explored. However, the sentence-level setting typically features much less ambiguity about evidence needed to predict a relation compared to the document-level setting we explore. Several document-level RE datasets (Li et al., 2016a; Peng et al., 2017) have been proposed as well as efforts to tackle these tasks (Christopoulou et al., 2019; Xiao et al., 2020; Guoshun et al., 2020), which we explicitly build off of.
To identify the sentences that the model considers as evidence, we draw on a recent body of work in explainable NLP focused on identifying salient features of the input. These primarily consist of input attribution techniques, such as LIME (Ribeiro et al., 2016), input reductions (Li et al., 2016b; Feng et al., 2018), attention-based explanations (Bahdanau et al., 2015) and gradient-based methods (Simonyan et al., 2014; Selvaraju et al., 2017; Sundararajan et al., 2017; Shrikumar et al., 2017). In present work, we extract rationales using the DeepLIFT (DL) method (Sundararajan et al., 2017). Rather than focus on comparing techniques, we instead focus on doing a thorough evaluation of the capabilities of DL.111We found our qualitative conclusions to be the same with integrated gradients Sundararajan et al. (2017), but DeepLIFT overall performed better.
Frameworks for interpretable pipelines
Our goal of building a system grounded in evidence draws heavily on recent work on attribution techniques and model explanations, particularly notions of faithfulness and plausibility. Faithfulness refers to how accurately the explanation provided by the model truly reflects the information it used in the reasoning process (Jain et al., 2020). On the other hand, plausibility indicates to what extent the interpretation provided by the model makes sense to a person.222The ERASER benchmark DeYoung et al. (2020) is a notable recent effort to evaluate explanation plausibility. However, we do not consider it here; we focus on the document-level IE setting, and many of the ERASER tasks are not suitable or relevant for the approaches we consider, either being not natural (FEVER) or not having the same challenges as document-level classification.
“Select-then-predict” approaches are one way to enforce faithfulness in pipelines Jain et al. (2020): important snippets from inputs are extracted and passed through a classifier to make predictions. Past work has used hard (Lei et al., 2016) or soft (Zhang et al., 2016) rationales, and other work has explicitly looked at tradeoffs in the amount of text extracted (Paranjape et al., 2020).
Jacovi and Goldberg (2020) note several problems with this setup. Our work aims to align model behavior with what cues we expect a model to use (plausibility), but uses the predict-select-verify paradigm Jacovi and Goldberg (2020) to ensure that these are actually sufficient cues for the model. Like our work, Pruthi et al. (2020a) simultaneously trained a BERT-based model (Devlin et al., 2019) for the prediction task and a linear-CRF (Lafferty et al., 2001) module on top of it for the evidence extraction task with shared parameters. Compared to their work, we focus explicitly on what can be done with pre-trained models alone, not augmenting the model for evidence extraction.
The systems we devise take pairs as input and return (a) predicted labels for each ; (b) sets of extracted evidence sentences from an interpretation method. Figure 1 shows the basic setting.
3.1 Transformer Classification Model
We use RoBERTa (Liu et al., 2019) as our document classifier due to its strong performance in classification, training to minimize log loss in a standard way. For each of our two domains, we use different pre-trained weights, as described in the training details in Appendix A.1. The task inputs are described in Section 4.1
3.2 Interpretation for Evidence Extraction
Our base technique for evidence extraction uses the DeepLIFT method Shrikumar et al. (2019) to identify key input tokens. From our model , we compute attribution scores with respect to the predicted class for each token in the RoBERTa input representation. DeepLIFT attributes the change in the output from a reference output in terms of the difference in input from the reference input 333Our reference consists of replacing the inputs in with [MASK] tokens from RoBERTa..
We average over the absolute value of attribution score for each token in that sentence to give sentence-level scores . These give us a ranking of the sentences. Given a fixed number of evidence sentences to extract, we can extract the top sentences by these scores.
To verify the extracted evidence Jacovi and Goldberg (2020), our main technique (Sufficient) feeds the model increasingly large subsets of the document ranked by attribution scores (e.g., first , then , etc.) until it (a) makes the same prediction as when taking the whole document as input and (b) assigns that prediction at least
times the probability444The value of is a tolerance hyper-parameter for selecting sentences and it set to throughout the experiments. when the whole document is taken as input. We consider this attribution faithful: it is a subset of the input supporting the model’s decision judged as important by the attribution method.
3.3 Improving Evidence Extraction
While many document-level extraction settings do not have token-level attributions labeled for every decision, one can in practice annotate a small fraction of a dataset with such ground-truth rationales. This is indeed the case for our brain MRI case study. Past work has shown significant benefits from integrating this supervision into learning Strout et al. (2019); Dua et al. (2020); Pruthi et al. (2020b).
Assume that a subset of our labeled data consists of tuples with ground truth evidence sentence indices . We consider two modifications to our model training, namely attention regularization (Pruthi et al., 2020b), entropy maximization (Feng et al., 2018), and their combination. An illustration of both methods is shown in Figure 2.
Attention regularization encourages our model to leverage more information from . Specifically, let be a set of attentions from the [CLS] token in the final layer to all tokens in , where
is a vector of attentions for each token in sentence. During learning, we add the following loss over all evidence sentences to the training objective: , encouraging the model to attend to the labeled evidence sentences.
When there is no sufficient information contained in the text to infer any predictions, entropy maximization encourages a model to be uncertain, represented by a uniform probability distribution across all classes(DeYoung et al., 2020; Feng et al., 2019). Doing so should encourage the model to not make predictions based on irrelevant sentences. We can achieve this by taking a reduced document as input by removing evidence from original document . We treat pairs as extra training examples where we aim to maximize the entropy over all possible .
4.1 Datasets and Evaluation Metrics
We investigate our methods on (a) a small collection of brain MRI reports from radiologists’ observations; and (b) a modified version of the DocRED datatset. The statistics for both datatsets are included in Appendix B. For both datasets, we evaluate on task accuracy (captured by either accuracy or prediction macro-F1) as well as evidence selection accuracy (macro-F1) or precision, measuring how well the model’s evidence selection aligns with human annotations. We will use the Sufficient method defined in Section 3.2 to select evidence sentences which guarantee that our predictions on the given evidence subsets will match the model’s predictions on the full document. For the brain MRI report dataset, we evaluate evidence extraction by precision since human annotators typically only need to refer to one sentence to reach the conclusion but our model and baselines may extract more than one sentence.
4.1.1 Brain MRI Reports
We present a new dataset of radiology reports from brain MRIs. It consists of the “findings” sections of reports, which present observations about the image, with labels for pre-selected key features by attending physicians and fellows. Crucially, these features are labeled based on the original radiology image, not the report. The document-level labels are therefore noisy because the radiologists’ labels may disagree with the findings written in the report.
A key feature is an observable variable , which can take on emission values . We focus on the evaluation of two key features, namely contrast enhancement and mass effect, since they appear in most of manually annotated reports. For our RoBERTa classification model, we only feed the document and train separate classifiers for each key feature, with no shared parameters between these.
We have a moderate number (327) of reports that have noisy labels from the process above. We treat these as our training set. However, all of these labels are document-level.
To evaluate models’ performance on more fine-grained evidence labels, we randomly select unlabeled reports (not overlapping with the 327 for training) and asked four radiology residents to (1) assign key feature labels and reach consensus, while (2) highlighting sentences that support their decision making. We use Prodigy555https://prodi.gy as our annotation interface. See Appendix C for more details about our annotation instructions.
Pseudo sentence-level supervision
Since we only have limited number of annotated reports for evaluation, we need a way to prepare weak sentence-level supervision while training. To achieve this, we use sentences selected by a rule-based system as pseudo evidence to supervise models’ behavior. We use 10% of this as supervision while training for consistency with the DocRED setting.
Our rule-based system uses keyword matching to identify instances of mass effect and contrast enhancement in the reports. We use negspaCy666https://spacy.io/universe/project/negspacy to detect negations of these key features.
For the results in Section 5, we evaluate on reports that contain ground truth fine-grained annotations for either contrast enhancement or mass effect, respectively. There are 64 and 68 out of 86 documents total in each of these categories. We call this the BrainMRI set. When we restrict to this set for evaluation, all of the documents we study where the annotators labeled something related to contrast enhancement end up having an explicit mention of it. However, for mass effect, this is not always the case; Table 6 shows an example where mass effect is discussed implicitly in the first sentence.
4.1.2 Adapted DocRED
DocRED (Yao et al., 2019) is a document-level relation extraction (RE) dataset with large scale human annotation of relevant evidence sentences. Unlike sentence-level RE tasks (Qin et al., 2018; Alt et al., 2020), it requires reading multiple sentences and reasoning about complex interactions between entities. We adapt this to a document-level relation classification task: a document and two entity mentions within the document are provided and the task is to predict the relation between and . We synthesize these examples from the original dataset, and also sample random entity pairs from documents to which we assign an NA class in order to construct negative pairs exhibiting no relation.
The model input is represented as: [CLS]<ent-1>[SEP]<ent-2>[SEP]<doc>[SEP].
To make the setting more realistic, we do not use the large-scale evidence annotation and assume there is limited sentence-level supervision available. To be specific, we include 10% fine-grained annotations in our adapted DocRED dataset.
|Model Names||Input Text|
|Ent||Sentences containing at least one of the two query entities|
|First2||First two sentences|
|First3||First three sentences|
|BestPair||Two sentences yielding highest prediction prob. (incl. variants using regularization)|
|Sufficient||Sufficient sentences selected by DL (incl. variants using regularization)|
Due to the richer and higher-quality supervision in the DocRED setting, we conduct a larger set of ablations and comparisons there. We compare against a subset of these models in the radiology setting.
We consider a number of baselines for adapted DocRED which return both predicted labels as well as evidence. (1) Direct predicts the relation directly from the entity pairs without any sentences as input, using a model trained with just these inputs. (2) FullDoc takes the full document as selected evidence and uses the base RoBERTa model (3) Ent takes all sentences with entity mentions and as input; (4) First2, First3 retrieve the first and sentences, respectively; and (5) BestPair chooses the best sentence pair by first taking each individual sentence as input to the model and then picking top two sentences having highest probabilities on their predictions. This approximates an erasure-based method like LIME Ribeiro et al. (2016) in contrast to our DeepLIFT method.
Sufficient is our main method for both datasets, which we then augment with additional supervision as described in Section 3.3. We use subscripts attn, entropy, both and none to represent our types of regularization, which stands for attention regularization, entropy maximization, the combination of two, and neither.
We report both the accuracy and F for the model as well as the evidence selection F comparing to human judgments. We also report results in the reduced setting, where only the selected evidence sentences are fed to the RoBERTa model (trained over whole documents) as input. For our Sufficient method, this accuracy is the same as the full method by construction, but note that it can differ for other methods. This reduced setting serves as a sanity check for the faithfulness of our explanation techniques.
|Full Doc||Reduced Doc|
|Sufficientnone||66.6||42.1||Identical to Full Doc||16.5||2.84|
|Sufficientnone||69.5||60.9||Identical to Full Doc||33.5||2.84|
5.1 Results on Brain MRI
Table 3 shows the performance of our models and baselines in terms of label prediction and evidence extraction. In the mass effect setting, our Sufficientboth model achieves the highest evidence extraction precision of the learning-based models and nearly matches that of the rule-based system. It is difficult to be more reliable than a rule-based system, which will nearly always make correctly-sourced predictions. But this model is able to combine that reliability with the higher F of a learned model. Note that due to the high base rates of certain findings, we focus on F instead of accuracy. We see a similar pattern on contrast enhancement as well, although the evidence precision is lower in that case.
|Full Doc||Reduced Doc|
|Sufficientnone||83.0||66.0||Identical to Full Doc||67.2||1.42|
These results show that learning-based systems make accurate predictions in this domain, and that their evidence extraction can be improved with better training, even in spite of the small size of the training set. In section 5.2, we focus on the adapted DocRED setting, which allows us to examine our model’s performance in a higher-data regime.
Attribution scores are more peaked at the occurrence of key terms.
We conduct analysis on how the attribution scores from Sufficientboth are peaked around the correct evidence compare to that from Sufficientnone using our manually annotated set BrainMRI. To quantify this analysis, we take the mean of instance-wise average and maximum of the normalized attribution mass falling into a few explicit tokens. In particular, we consider enhancement for contrast enhancement and effect for mass effect, which are common explicit indicators in the context of specified key features. The results in Table 5 show attribution scores being peaked around the correct terms, highlighting that these models can be guided to not only make correct predictions but attend to the right information.
|Model||Mass Effect||Ctr. Enhance.|
|Model||An Example of mass effect, label: positive, evidence: 0 or 6|
|Sufficientnone|| These images show evidence of downward displacement of the brain stem with collapse of the interpeduncular cistern and caudal displacement of the mammary bodies typical for intracranial hypertension.  There is diffuse pachymeningeal enhancement evident.  Bilateral extra axial collections are evident the do not conform to the imaging characteristics of CSF are seen overlying the hemispheres.  These likely reflect blood tinged hygromas and there does appear to be a blood products in the deep tendon portion of the right sided collection on the patient’s left see image 14 series 2. There does appear to be a discrete linear subdural hematoma along the right tentorial leaf.  Subdural collection is noted on both sides of the falx as well.  There is mass effect at the level of the tentorial incisure due to transtentorial herniation with deformity of the midbrain.  There is no evidence an acute infarct.  No parenchymal hemorrhage is evident.  Apart from the meningeal enhancement there is no abnormal enhancement noted.|
|Sufficientboth|| These images show evidence of downward displacement of the brain stem with collapse of the interpeduncular cistern and caudal displacement of the mammary bodies typical for intracranial hypertension.  There is diffuse pachymeningeal enhancement evident.  Bilateral extra axial collections are evident the do not conform to the imaging characteristics of CSF are seen overlying the hemispheres.  These likely reflect blood tinged hygromas and there does appear to be a blood products in the deep tendon portion of the right sided collection on the patient’s left see image 14 series 2.  There does appear to be a discrete linear subdural hematoma along the right tentorial leaf.  Subdural collection is noted on both sides of the falx as well.  There is mass effect at the level of the tentorial incisure due to transtentorial herniation with deformity of the midbrain.  There is no evidence an acute infarct.  No parenchymal hemorrhage is evident.  Apart from the meningeal enhancement there is no abnormal enhancement noted.|
Table 6 shows visualizations of attribution scores for an example in BrainMRI using DeepLIFT. Notice that even though baseline models make correct predictions, their attribution mass is diffuse over the document. For example, imaging characteristics of CSF and see image 14 series 2 are unrelated to the diagnosis but may reflect a spurious correlation in the dataset; the high accuracy of this model may be for the wrong reasons. With the help of regularization, our model is capable of capturing implicit cues such as downward displacement of the brain stem, although it is trained on an extremely small training set with only explicit cues like mass effect in a weak sentence-level supervision framework.
5.2 Results on Adapted DocRED
Comparison to baselines
We see that the Ent baseline is quite strong at DocRED evidence extraction. However, our best method still exceeds this method on both label accuracy as well as evidence extraction while extracting more succinct explanations. We see that the ability to extract a variable-length explanation is key, with First2, First3 and BestPair performing poorly. Notably, these methods exhibit a drop in accuracy in the reduced doc setting for each method compared to the full doc setting, showing that the explanations extracted are not faithful.
Learning-based models with appropriate regularization perform relatively better in this larger-data setting
From Table 3 and Table 4, we can observe that various regularization techniques applied to Sufficient models maintain or improve overall model performance on both key feature and relation classification. We see that our Sufficient methods do not compromise on accuracy but make predictions based on plausible evidence sets, which is more evident when we have richer training data. We perform error analysis to probe further into our model’s behavior in Appendix D.
Faithfulness of techniques
One may be concerned that, like attention values Jain and Wallace (2019), our feature attribution methods may not faithfully reflect the computation of the model. We emphasize again that the Sufficient paradigm on top of the DeepLIFT method is faithful by some definition. For a model , we measure the faithfulness by checking the agreement between and , where is the extracted evidence we feed into the same model under the reduced document setting. This is shown for all methods in the “Reduced doc” columns in Tables 3 and 4. We see a drop in performance from techniques such as BestPair: the full model does not make the same judgment on these evidence subsets, but by definition it does in the Sufficient setting.
As further evidence of faithfulness, we note that only a relatively small number of evidence sentences, in line with human annotations, are extracted in the Sufficient method. These small subsets are indicated by feature attribution methods and sufficient to reproduce the original model predictions with high confidence. We believe this constitutes strong evidence that these explanations are faithful.
In this work, we develop techniques to employ small amount of token-annotated data to improve reliability of document-level IE systems in two domains. We systematically evaluate our model from perspectives of faithfulness and plausibility and show that we can substantially improve models’ capability in focusing on supporting evidence while maintaining their prediction performance, leading to models that are “right for the right reasons” and avoid learning spurious patterns.
This work was partially supported by NSF Grant IIS-1814522 and a Texas Health Catalyst grant. Thanks to Scott Rudkin, Gregory Mittl, Raghav Mattay, and Chuan Liang for assistance with the annotation.
- Probing linguistic features of sentence-level representations in neural relation extraction. In Proceedings of ACL, External Links: Cited by: §4.1.2.
- Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.3.
- A method for named entity normalization in biomedical articles: application to diseases and plants. BMC Bioinformatics 18 (1). External Links: Cited by: §2.1.
Connecting the dots: document-level neural relation extraction with edge-oriented graphs.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), External Links: Cited by: §1, §2.3.
- Bayesian network interface for assisting radiology interpretation and education. In Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications, J. Zhang and P. Chen (Eds.), External Links: Cited by: §2.1.
- What can natural language processing do for clinical decision support?. Journal of Biomedical Informatics 42 (5), pp. 760–772. Note: Biomedical Natural Language Processing External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1, §2.3.
- ERASER: a benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, External Links: Cited by: §3.3, footnote 2.
- Towards a rigorous science of interpretable machine learning. arXiv: Machine Learning. Cited by: §1.
- Benefits of intermediate annotations in reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5627–5634. External Links: Cited by: §3.3.
- Convolutional neural network for automated FLAIR lesion segmentation on clinical brain MR imaging. American Journal of Neuroradiology 40 (8), pp. 1282–1290. External Links: Cited by: §2.1.
- Misleading failures of partial-input baselines. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, External Links: Cited by: §1, §3.3.
- Pathologies of neural models make interpretations difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: §2.3, §3.3.
- Reasoning with latent structure refinement for document-level relation extraction. In Proceedings of ACL, Cited by: §1, §2.3.
Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of ACL, Cited by: §A.1.
- More data, more relations, more context and more openness: a review and outlook for relation extraction. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, pp. 745–758. External Links: Cited by: §2.3.
CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison.
Proceedings of the AAAI Conference on Artificial Intelligence33, pp. 590–597. External Links: Cited by: §1, §2.1.
- Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, External Links: Cited by: §1, §2.3, §3.2.
- Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3543–3556. External Links: Cited by: §1, §5.2.
- Learning to faithfully rationalize by construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, External Links: Cited by: §1, §2.3, §2.3.
- Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 282–289. External Links: Cited by: §2.3.
- Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: §1, §2.3.
- BioCreative v CDR task corpus: a resource for chemical disease relation extraction. Database 2016, pp. baw068. External Links: Cited by: §2.3.
Understanding neural networks through representation erasure. CoRR abs/1612.08220. External Links: Cited by: §2.3.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. In arXiv, Cited by: §1, §3.1.
- Decoupled weight decay regularization. In ICLR, Cited by: §A.1.
- CheXpert++: approximating the chexpert labeler for speed, differentiability, and probabilistic output. CoRR abs/2006.15229. External Links: Cited by: §1.
- Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. External Links: Cited by: §1.
- Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Cited by: §2.3.
- An information bottleneck approach for controlling conciseness in rationale extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Cited by: §2.3.
- Weakly supervised medication regimen extraction from medical conversations. In ClinicalNLP@EMNLP, Cited by: §1.
- Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics 5, pp. 101–115. External Links: Cited by: §2.3.
- Natural language processing in radiology: a systematic review. Radiology 279 (2), pp. 329–343. Note: PMID: 27089187 External Links: Cited by: §1.
- Weakly- and semi-supervised evidence extraction. In Findings of the Association for Computational Linguistics: EMNLP 2020, External Links: Cited by: §1, §2.3.
- Evaluating explanations: how much do explanations from the teacher aid students?. ArXiv abs/2012.00893. Cited by: §3.3, §3.3.
Robust distant supervision relation extraction via deep reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Cited by: §4.1.2.
- Artificial intelligence system approaching neuroradiologist-level differential diagnosis accuracy at brain MRI. Radiology 295 (3), pp. 626–637. External Links: Cited by: §2.1.
- "Why should i trust you?": explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1135–1144. External Links: Cited by: §2.3, §4.2.
- . Journal of Digital Imaging 34 (4), pp. 1049–1058. External Links: Cited by: §2.1.
- Artificial Intelligence System for Automated Brain MR Diagnosis Performs at Level of Academic Neuroradiologists and Augments Resident Performance. In Proceedings of the Society for Imaging Informatics in Medicine (SIIM), Cited by: §1.
Grad-CAM: visual explanations from deep networks via gradient-based localization.
2017 IEEE International Conference on Computer Vision (ICCV), External Links: Cited by: §2.3.
- Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 3145–3153. Cited by: §2.3.
- Learning important features through propagating activation differences. External Links: Cited by: §1, §3.2.
- Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034. Cited by: §2.3.
- CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. In EMNLP, Cited by: §1, §2.1.
- Do human rationales improve machine explanations?. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 56–62. External Links: Cited by: §3.3.
- Axiomatic attribution for deep networks. External Links: Cited by: §2.3, footnote 1.
- Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 455–465. External Links: Cited by: §2.3.
- Adversarial multi-lingual neural relation extraction. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1156–1166. External Links: Cited by: §2.3.
- Denoising relation extraction from document-level distant supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Cited by: §1, §2.3.
- DocRED: a large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, External Links: Cited by: Making Document-Level Information Extraction Right for the Right Reasons, §1, §4.1.2.
- Rationale-augmented convolutional neural networks for text classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: §2.3.
Appendix A Reproducibility
a.1 Implementation Details
We train all RoBERTa models for 15 epochs with early stopping usingTITAN-Xp GPU. We use AdamW Loshchilov and Hutter (2019) as our optimizer and initialize the model with roberta-base for DocRED and biomed-roberta-base (Gururangan et al., 2020) for brain MRI data, both with M parameters. The batch size is set to 16, and the learning rate is 1e-5 with linear schedule warmup.
The maximum number of tokens in each document is capped at 296 for modified DocRED and 360 for radiology reports. These numbers are chosen such that the number of tokens for around 95% of the documents is within these limits. We do not perform extensive hyperparameter tuning in this work. The hidden state of the [CLS] token from the final layer is fed as input to a linear projection head to make predictions.
Appendix B Dataset statistics
We provide the statistics for both adapted DocRED and brain MRI reports dataset in Table 7. Both datasets are in English and the DocRED dataset is publicly available at https://github.com/thunlp/DocRED.
Appendix C Annotation Instructions
The annotation instructions are provided in Figure 3. These were developed jointly with the annotators. In particular, decisions to exclude normal brain activity and confounders such as SVID were made to increase interannotator agreement after an initial round of annotation, making it easier for the labeling to focus on a single core disease or diagnosis per report.
Appendix D Error Analysis
|Dataset||Setting||# doc.||# inst.||# word/inst.||# sent./inst.||# relation||# NA%|
|Predicts correctly and extracts right evidence|| Delphine “Delphi” Greenlaw is a fictional character on the New Zealand soap opera Shortland Street, who was portrayed by Anna Hutchison between 2002 and 2004. …|
|Predicts debatably correct answer, extracts reasonable evidence|| Anton Erhard Martinelli (1684 – September 15 , 1747) was an Austrian architect and master - builder of Italian descent.  Martinelli was born in Vienna. …  Anton Erhard Martinelli supervised the construction of several important buildings in Vienna, such as …  He designed …  He died in Vienna in 1747.|
|Predict incorrect example on examples requiring high amount of reasoning|| Kurt Tucholsky (9 January 1890 – 21 December 1935) was a German - Jewish journalist, satirist, and writer.  He also wrote under the pseudonyms Kaspar Hauser (after the historical figure), Peter Panter, Theobald Tiger and Ignaz Wrobel. …|
|Selecting more sentences than are needed|| Henri de Boulainvilliers … was a French nobleman, writer and historian. …  Primarily remembered as an early modern historian of the French State, Boulainvilliers also published an early French translation of Spinoza’s Ethics and …  The Comte de Boulainvilliers traced his lineage to …  Much of Boulainvilliers’ historical work …|
The first example in Table 8 shows a representative case where our model predicts the correct relation and extracts reasonable supporting evidence. Unsurprisingly, this happens most often in simple cases when reasoning over the interaction of sentences is not required.
We observe a few common types of errors. First, are potential alternatives for relations or evidence extraction. From around of our randomly selected error cases, our model either predicts debatably correct relations or picks up sentences that are related but not perfectly aligned with human annotations. The second row in Table 8 illustrates an example where the two entities exhibit multiple relationships; the model’s prediction is correct (Vienna is place where Martinelli was both born and died), but differs from the annotated ground truth and supporting evidence. Such relations are relatively frequent in this dataset; a more complex multi-label prediction format is necessary to fully support these.
Another type of error is complex logical reasoning. Even if our model can extract right evidence, it still fails in around of random error cases requiring high-level reasoning capability. For example, to correctly predict the relation between Theobald Tiger and 21 December 1935 in the third example in Table 8, a model needs to recognize that Theobald Tiger and Kurt Tucholsky are in fact the same entity by referring to pseudonym, which is a challenging relation to recognize.
Finally, the model sometimes selects more sentences than we truly need. Interestingly, this is an error in terms of evidence plausibility but not in terms of prediction. The number of extracted sentences is very high in around of the random error cases. The last row from Table 8 is one of representative examples with this kind of error. Although our model possibly has already successfully extracted right evidence in the first two steps, it continues selecting unnecessary sentences because the prediction confidence is not high enough, a drawback in our way of selecting evidence mentioned in Section 4.2. Moreover, our model extracts one more sentence on average when predicting incorrect relations, suggesting that in these cases it does not cleanly focus on the correct information.