Making Document-Level Information Extraction Right for the Right Reasons

Document-level information extraction is a flexible framework compatible with applications where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in radiology a report may not be explicitly stated, but nevertheless can be inferred from the report's text. However, document-level neural models can easily learn spurious correlations from irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inferences in an auditable way: beyond just being right, are these models "right for the right reasons?" We experiment with post-hoc evidence extraction in a predict-select-verify framework using feature attribution techniques. While this basic approach can extract reasonable evidence, it can be regularized with small amounts of evidence supervision during training, which substantially improves the quality of extracted evidence. We evaluate on two domains: a small-scale labeled dataset of brain MRI reports and a large-scale modified version of DocRED (Yao et al., 2019) and show that models' plausibility can be improved with no loss in accuracy.


ArgFuse: A Weakly-Supervised Framework for Document-Level Event Argument Aggregation

Most of the existing information extraction frameworks (Wadden et al., 2...

Automatic Error Analysis for Document-level Information Extraction

Document-level information extraction (IE) tasks have recently begun to ...

Why do you think that? Exploring Faithful Sentence-Level Rationales Without Supervision

Evaluating the trustworthiness of a model's prediction is essential for ...

Document-Level Relation Extraction with Sentences Importance Estimation and Focusing

Document-level relation extraction (DocRE) aims to determine the relatio...

Sequential Principal Curves Analysis

This work includes all the technical details of the Sequential Principal...

Information Extraction with Character-level Neural Networks and Free Noisy Supervision

We present an architecture for information extraction from text that aug...

Zooming Network

Structural information is important in natural language understanding. A...

1 Introduction

Document-level information extraction (Yao et al., 2019; Christopoulou et al., 2019; Xiao et al., 2020; Guoshun et al., 2020)

has seen great strides due to the rise of pre-trained models

Devlin et al. (2019). But in high-stakes domains like medical information extraction (Irvin et al., 2019; McDermott et al., 2020; Smit et al., 2020)

, machine learning models are still too error-prone to use broadly. Since they are not perfect, they typically play the role of assisting users in tasks like building cohorts

Pons et al. (2016) or in providing clinical decision support Demner-Fushman et al. (2009). To be most usable in conjunction with users, these systems should not just produce a decision, but a correctly-sourced justification that can be checked Rudie et al. (2019).

Figure 1: Our basic model setup. A Transformer-based model makes document-level predictions on an example of our brain MRI reports. An interpretation method extracts the evidence sentences used by the model. Three criteria (accuracy, faithfulness, and plausibility) govern our system.

Our goal is to study document-level information extraction systems that are both accurate and which make predictions based on the correct information (Doshi-Velez and Kim, 2017). This process involves identifying what evidence the model actually used, verifying the model’s prediction based on that evidence, and checking whether that evidence aligns with what humans would use, which would allow a user to more quickly see if the system is correct. For example, in Figure 1, localizing the prediction of mass effect (a feature expressing whether there is evidence of brain displacement by a mass like a tumor) to the first two sentences allows a trained user in a clinical decision support setting to easily verify what was extracted here. Our evidence extraction hews to principles of both faithfulness and plausibility (Jain et al., 2020; Jacovi and Goldberg, 2020; Miller, 2019).

Rather than use complex approaches with intermediate latent variables for extraction Lei et al. (2016), we focus on what can be done with off-the-shelf pre-trained models Liu et al. (2019) using post-hoc interpretation. We explore various techniques for attribution but primarily use DeepLIFT Shrikumar et al. (2019) to find key parts of each document that were used by the model. We ask two questions: first, can we identify the document sentences that truly contributed to the prediction (faithfulness)? Using the ranking of sentences provided by DeepLIFT, we extract a set of sentences where the model returns nearly the same prediction as before, thus verifying that these sentences are a sufficient explanation for the model. Second, do these document sentences align with what users annotated (plausibility)? Unsurprisingly, we find that this alignment is low in a basic Transformer model.

To further improve the alignment with human annotation, we consider injecting small amounts of token-level labeled data. Critically, in the brain MRI extraction setting we consider (see Table 1), large-scale token-level annotation is not available; most instances in the dataset only have document-level labels from existing clinical decision support systems, making it a weakly-supervised setting (Pruthi et al., 2020a; Patel et al., 2020). We explore two methods for using this small amount of annotation, chiefly based around supervising or regularizing the model’s behavior. One notion is entropy maximization that the model should be uncertain when it isn’t exposed to sufficient evidence (Feng et al., 2019). Another is attention regularization where the model’s attentions should focus on the key pieces of evidence; while not a perfect cue (Jain and Wallace, 2019), we can investigate whether this then leads to a model whose explanations leverage this information more heavily.

We validate our methods first on a small dataset of radiologists’ observations from brain MRIs. These reports are annotated with document-level key features related to different aspects of the report, which we want to extract in a faithful way. We see positive results here even in a small-data condition, but to understand how this method would scale with larger amounts of data, we adapt the DocRED relation extraction task to be a document-level classification task. The question of which sentence in the document describes the relation between the two entities, if there even is one, is still quite challenging, and we show our techniques can lead to improvements in a weakly-labeled setting here as well.

Our contributions are (1) We apply evidence extraction methods to document-level IE, emphasizing a new brain MRI dataset that we annotate. (2) We explore using weak sentence-level supervision in two techniques adapted from prior work; (3) We evaluate pre-trained models and evidence extraction through DeepLIFT for plausibility compared to human annotation, while ensuring faithfulness of the evidence.

Report Finding
[0] Severe encephalomalacia in the temporal lobes and frontal lobes bilaterally with reactive gliosis in the left frontal lobe. [1] Moderate enlargement of the ventricular system. [2] No abnormal enhancement. [3] Near complete opacification of the left maxillary sinus. …
mass_effect: negative evid: [0, 1] implicit
side: bilateral evid: [0] explicit
t2: increased evid: [0] implicit
contrast_enhancement: No evid: [2] explicit
Table 1: Example from annotated brain MRI reports. Labels and supporting evidence for key features are annotated for this example report presented. “Explicit” means the label of given key feature can be directly inferred by the highlighted terms; “implicit” instead indicates that it requires domain knowledge and potential reasoning skills to label. We want the model to identify implicit features while not leveraging dataset biases or reasoning incorrectly about explicit ones.

2 Background

2.1 Motivation

We start with an example from brain MRI reports in Table 1. Medical information extraction involves tasks such as identifying important medical terms from text (Irvin et al., 2019; Smit et al., 2020) and normalizing names into standard concepts using domain-specific ontologies (Cho et al., 2017). One application in clinical decision support, shown here, requires extracting the values of certain key features (clinical findings) from these reports or medical images Rudie et al. (2021); Duong et al. (2019). This extraction should be accurate, but it should also make predictions that are correctly-sourced, to facilitate review by a radiologist or someone else using the system Rauschecker et al. (2020); Cook et al. (2018).

The finding section of a brain MRI radiology report often describes these key features in both explicit and implicit ways. For instance, contrast enhancement, one of our key features, is mentioned explicitly much of the time; see no abnormal enhancement

in the third sentence. A rule-based system can detect this type of evidence easily. But some key features are harder to identify and require reasoning over context and draw on implicit cues. For example,

severe encephalomalacia in the first sentence and enlargement of the ventricular system in the following sentence are both implicit signs of positive mass effect and either is sufficient to infer the label. It is significantly harder to built a rule-based extractor for this case. Learning-based systems have the potential to do much better here, but lack of understanding about their behavior can lead to hard-to-predict failure modes, such as acausal prediction of key features (e.g., inferring evidence about mass effect from a hypothesized diagnosis somewhere in the report, where the causality is backwards).

Our work aims to leverage the ability of learning-based systems to capture implicit features while improving their ability to make correctly-sourced predictions.

2.2 Problem Setting

The problem we tackle in this work is document-level information extraction. Let be a document consisting of sentences. The document is annotated with a set of labels where is an auxiliary input specifying a particular task for this document (e.g., mass effect) and is the label associated with that task from a discrete label space. In our adaptation of the DocRED task, we consider

to classify the relationship (if any) between a pair of entities

in a document, defined in Section 4.1.2.

Our method takes a pair and then computes the label from a predictor . We can then extract evidence post-hoc using a separate procedure such as a feature attribution method:


In addition to the labels , we assume access to a small number of examples with additional supervision in each domain. That is, for a triple, we also assume we are given a set of ground-truth evidence with sentence indices . This evidence should be sufficient to compute the label, but not always necessary; for example, if multiple sentences can contribute to the prediction, they might all be listed as supporting evidence here. See Section 3.3 for more details.

2.3 Related Work

Our work fits into a broader thread of relation extraction (Han et al., 2020). Due to the cost of collecting large-scale data with good quality, distant supervision (DS) (Mintz et al., 2009) and ways to denoise auto-labeled data from DS (Surdeanu et al., 2012; Wang et al., 2018) have been widely explored. However, the sentence-level setting typically features much less ambiguity about evidence needed to predict a relation compared to the document-level setting we explore. Several document-level RE datasets (Li et al., 2016a; Peng et al., 2017) have been proposed as well as efforts to tackle these tasks (Christopoulou et al., 2019; Xiao et al., 2020; Guoshun et al., 2020), which we explicitly build off of.

Explanation techniques

To identify the sentences that the model considers as evidence, we draw on a recent body of work in explainable NLP focused on identifying salient features of the input. These primarily consist of input attribution techniques, such as LIME (Ribeiro et al., 2016), input reductions (Li et al., 2016b; Feng et al., 2018), attention-based explanations (Bahdanau et al., 2015) and gradient-based methods (Simonyan et al., 2014; Selvaraju et al., 2017; Sundararajan et al., 2017; Shrikumar et al., 2017). In present work, we extract rationales using the DeepLIFT (DL) method (Sundararajan et al., 2017). Rather than focus on comparing techniques, we instead focus on doing a thorough evaluation of the capabilities of DL.111We found our qualitative conclusions to be the same with integrated gradients Sundararajan et al. (2017), but DeepLIFT overall performed better.

Frameworks for interpretable pipelines

Our goal of building a system grounded in evidence draws heavily on recent work on attribution techniques and model explanations, particularly notions of faithfulness and plausibility. Faithfulness refers to how accurately the explanation provided by the model truly reflects the information it used in the reasoning process (Jain et al., 2020). On the other hand, plausibility indicates to what extent the interpretation provided by the model makes sense to a person.222The ERASER benchmark DeYoung et al. (2020) is a notable recent effort to evaluate explanation plausibility. However, we do not consider it here; we focus on the document-level IE setting, and many of the ERASER tasks are not suitable or relevant for the approaches we consider, either being not natural (FEVER) or not having the same challenges as document-level classification.

Select-then-predict” approaches are one way to enforce faithfulness in pipelines Jain et al. (2020): important snippets from inputs are extracted and passed through a classifier to make predictions. Past work has used hard (Lei et al., 2016) or soft (Zhang et al., 2016) rationales, and other work has explicitly looked at tradeoffs in the amount of text extracted (Paranjape et al., 2020).

Jacovi and Goldberg (2020) note several problems with this setup. Our work aims to align model behavior with what cues we expect a model to use (plausibility), but uses the predict-select-verify paradigm Jacovi and Goldberg (2020) to ensure that these are actually sufficient cues for the model. Like our work, Pruthi et al. (2020a) simultaneously trained a BERT-based model (Devlin et al., 2019) for the prediction task and a linear-CRF (Lafferty et al., 2001) module on top of it for the evidence extraction task with shared parameters. Compared to their work, we focus explicitly on what can be done with pre-trained models alone, not augmenting the model for evidence extraction.

3 Methods

The systems we devise take pairs as input and return (a) predicted labels for each ; (b) sets of extracted evidence sentences from an interpretation method. Figure 1 shows the basic setting.

3.1 Transformer Classification Model

We use RoBERTa (Liu et al., 2019) as our document classifier due to its strong performance in classification, training to minimize log loss in a standard way. For each of our two domains, we use different pre-trained weights, as described in the training details in Appendix A.1. The task inputs are described in Section 4.1

3.2 Interpretation for Evidence Extraction

Our base technique for evidence extraction uses the DeepLIFT method Shrikumar et al. (2019) to identify key input tokens. From our model , we compute attribution scores with respect to the predicted class for each token in the RoBERTa input representation. DeepLIFT attributes the change in the output from a reference output in terms of the difference in input from the reference input 333Our reference consists of replacing the inputs in with [MASK] tokens from RoBERTa..

We average over the absolute value of attribution score for each token in that sentence to give sentence-level scores . These give us a ranking of the sentences. Given a fixed number of evidence sentences to extract, we can extract the top sentences by these scores.

To verify the extracted evidence Jacovi and Goldberg (2020), our main technique (Sufficient) feeds the model increasingly large subsets of the document ranked by attribution scores (e.g., first , then , etc.) until it (a) makes the same prediction as when taking the whole document as input and (b) assigns that prediction at least

times the probability

444The value of is a tolerance hyper-parameter for selecting sentences and it set to throughout the experiments. when the whole document is taken as input. We consider this attribution faithful: it is a subset of the input supporting the model’s decision judged as important by the attribution method.

Figure 2: An illustration of attention regularization and entropy maximization using the example in Table 1. The model is predicting the label for key feature t2.

3.3 Improving Evidence Extraction

While many document-level extraction settings do not have token-level attributions labeled for every decision, one can in practice annotate a small fraction of a dataset with such ground-truth rationales. This is indeed the case for our brain MRI case study. Past work has shown significant benefits from integrating this supervision into learning Strout et al. (2019); Dua et al. (2020); Pruthi et al. (2020b).

Assume that a subset of our labeled data consists of tuples with ground truth evidence sentence indices . We consider two modifications to our model training, namely attention regularization (Pruthi et al., 2020b), entropy maximization (Feng et al., 2018), and their combination. An illustration of both methods is shown in Figure 2.

Attention regularization

Attention regularization encourages our model to leverage more information from . Specifically, let be a set of attentions from the [CLS] token in the final layer to all tokens in , where

is a vector of attentions for each token in sentence

. During learning, we add the following loss over all evidence sentences to the training objective: , encouraging the model to attend to the labeled evidence sentences.

Entropy maximization

When there is no sufficient information contained in the text to infer any predictions, entropy maximization encourages a model to be uncertain, represented by a uniform probability distribution across all classes

(DeYoung et al., 2020; Feng et al., 2019). Doing so should encourage the model to not make predictions based on irrelevant sentences. We can achieve this by taking a reduced document as input by removing evidence from original document . We treat pairs as extra training examples where we aim to maximize the entropy over all possible .

4 Experiments

4.1 Datasets and Evaluation Metrics

We investigate our methods on (a) a small collection of brain MRI reports from radiologists’ observations; and (b) a modified version of the DocRED datatset. The statistics for both datatsets are included in Appendix B. For both datasets, we evaluate on task accuracy (captured by either accuracy or prediction macro-F1) as well as evidence selection accuracy (macro-F1) or precision, measuring how well the model’s evidence selection aligns with human annotations. We will use the Sufficient method defined in Section 3.2 to select evidence sentences which guarantee that our predictions on the given evidence subsets will match the model’s predictions on the full document. For the brain MRI report dataset, we evaluate evidence extraction by precision since human annotators typically only need to refer to one sentence to reach the conclusion but our model and baselines may extract more than one sentence.

4.1.1 Brain MRI Reports

We present a new dataset of radiology reports from brain MRIs. It consists of the “findings” sections of reports, which present observations about the image, with labels for pre-selected key features by attending physicians and fellows. Crucially, these features are labeled based on the original radiology image, not the report. The document-level labels are therefore noisy because the radiologists’ labels may disagree with the findings written in the report.

A key feature is an observable variable , which can take on emission values . We focus on the evaluation of two key features, namely contrast enhancement and mass effect, since they appear in most of manually annotated reports. For our RoBERTa classification model, we only feed the document and train separate classifiers for each key feature, with no shared parameters between these.


We have a moderate number (327) of reports that have noisy labels from the process above. We treat these as our training set. However, all of these labels are document-level.

To evaluate models’ performance on more fine-grained evidence labels, we randomly select unlabeled reports (not overlapping with the 327 for training) and asked four radiology residents to (1) assign key feature labels and reach consensus, while (2) highlighting sentences that support their decision making. We use Prodigy555 as our annotation interface. See Appendix C for more details about our annotation instructions.

Pseudo sentence-level supervision

Since we only have limited number of annotated reports for evaluation, we need a way to prepare weak sentence-level supervision while training. To achieve this, we use sentences selected by a rule-based system as pseudo evidence to supervise models’ behavior. We use 10% of this as supervision while training for consistency with the DocRED setting.

Rule-based system

Our rule-based system uses keyword matching to identify instances of mass effect and contrast enhancement in the reports. We use negspaCy666 to detect negations of these key features.

Data split

For the results in Section 5, we evaluate on reports that contain ground truth fine-grained annotations for either contrast enhancement or mass effect, respectively. There are 64 and 68 out of 86 documents total in each of these categories. We call this the BrainMRI set. When we restrict to this set for evaluation, all of the documents we study where the annotators labeled something related to contrast enhancement end up having an explicit mention of it. However, for mass effect, this is not always the case; Table 6 shows an example where mass effect is discussed implicitly in the first sentence.

4.1.2 Adapted DocRED

DocRED (Yao et al., 2019) is a document-level relation extraction (RE) dataset with large scale human annotation of relevant evidence sentences. Unlike sentence-level RE tasks (Qin et al., 2018; Alt et al., 2020), it requires reading multiple sentences and reasoning about complex interactions between entities. We adapt this to a document-level relation classification task: a document and two entity mentions within the document are provided and the task is to predict the relation between and . We synthesize these examples from the original dataset, and also sample random entity pairs from documents to which we assign an NA class in order to construct negative pairs exhibiting no relation.

The model input is represented as: [CLS]<ent-1>[SEP]<ent-2>[SEP]<doc>[SEP].

To make the setting more realistic, we do not use the large-scale evidence annotation and assume there is limited sentence-level supervision available. To be specific, we include 10% fine-grained annotations in our adapted DocRED dataset.

Model Names Input Text
Direct None
FullDoc Full document
Ent Sentences containing at least one of the two query entities
First2 First two sentences
First3 First three sentences
BestPair Two sentences yielding highest prediction prob. (incl. variants using regularization)
Sufficient Sufficient sentences selected by DL (incl. variants using regularization)
Table 2: Model names used in the experiments and their associated evidence given as inputs.

4.2 Models

Due to the richer and higher-quality supervision in the DocRED setting, we conduct a larger set of ablations and comparisons there. We compare against a subset of these models in the radiology setting.


We consider a number of baselines for adapted DocRED which return both predicted labels as well as evidence. (1) Direct predicts the relation directly from the entity pairs without any sentences as input, using a model trained with just these inputs. (2) FullDoc takes the full document as selected evidence and uses the base RoBERTa model (3) Ent takes all sentences with entity mentions and as input; (4) First2, First3 retrieve the first and sentences, respectively; and (5) BestPair chooses the best sentence pair by first taking each individual sentence as input to the model and then picking top two sentences having highest probabilities on their predictions. This approximates an erasure-based method like LIME Ribeiro et al. (2016) in contrast to our DeepLIFT method.

Sufficient is our main method for both datasets, which we then augment with additional supervision as described in Section 3.3. We use subscripts attn, entropy, both and none to represent our types of regularization, which stands for attention regularization, entropy maximization, the combination of two, and neither.

Table 2 summarizes the abbreviated names of models and their inputs for quick reference. The training details of our models are described in Appendix A.1.


We report both the accuracy and F for the model as well as the evidence selection F comparing to human judgments. We also report results in the reduced setting, where only the selected evidence sentences are fed to the RoBERTa model (trained over whole documents) as input. For our Sufficient method, this accuracy is the same as the full method by construction, but note that it can differ for other methods. This reduced setting serves as a sanity check for the faithfulness of our explanation techniques.

Model Label Evidence
Full Doc Reduced Doc
Acc F1 Acc F1 Pre Len
Mass Effect
Rule 77.9 11.8 77.9 11.8 84.8 1.46
Sufficientnone 66.6 42.1 Identical to Full Doc 16.5 2.84
Sufficientattn 69.2 47.6 65.6 2.31
Sufficiententropy 45.3 0.0 15.8 2.50
Sufficientboth 76.7 60.0 77.8 1.51
Contrast Enhancement
Rule 68.8 56.5 68.8 56.5 87.1 1.67
Sufficientnone 69.5 60.9 Identical to Full Doc 33.5 2.84
Sufficientattn 85.8 81.0 60.7 2.48
Sufficiententropy 71.5 59.5 25.2 2.55
Sufficientboth 90.8 87.2 71.7 1.50
Table 3: Model performance on BrainMRI. Models are evaluated under two settings by taking (a) the full document (Full Doc); (b) selected evidence (Reduced Doc) as inputs. Rule is the baseline mentioned in Section 4.1.1. Pre stands for the precision of evidence selection, and Len is the average number of sentences extracted.

5 Results

5.1 Results on Brain MRI

Table 3 shows the performance of our models and baselines in terms of label prediction and evidence extraction. In the mass effect setting, our Sufficientboth model achieves the highest evidence extraction precision of the learning-based models and nearly matches that of the rule-based system. It is difficult to be more reliable than a rule-based system, which will nearly always make correctly-sourced predictions. But this model is able to combine that reliability with the higher F of a learned model. Note that due to the high base rates of certain findings, we focus on F instead of accuracy. We see a similar pattern on contrast enhancement as well, although the evidence precision is lower in that case.

Model Label Evidence
Full Doc Reduced Doc
Acc F1 Acc F1 F1 Len
Direct 66.4 45.3
FullDoc 83.0 66.0 83.0 66.0 34.9 8.03
First2 75.3 58.1 47.9 2.00
First3 77.5 60.7 44.6 3.00
Ent 82.4 65.4 61.5 3.93
BestPairnone 83.0 66.0 73.9 55.3 39.2 2.00
BestPairattn 83.2 65.0 73.4 53.5 43.9 2.00
BestPairentropy 81.8 64.2 78.5 58.2 52.3 2.00
BestPairboth 82.7 66.5 81.6 65.3 66.2 2.00
Sufficientnone 83.0 66.0 Identical to Full Doc 67.2 1.42
Sufficientattn 83.2 65.0 70.3 1.45
Sufficiententropy 81.8 64.2 69.9 1.65
Sufficientboth 82.7 66.5 73.1 1.65
human 1.59
Table 4: Model performance on adapted DocRED. Models are evaluated under two settings as in BrainMRI.

These results show that learning-based systems make accurate predictions in this domain, and that their evidence extraction can be improved with better training, even in spite of the small size of the training set. In section 5.2, we focus on the adapted DocRED setting, which allows us to examine our model’s performance in a higher-data regime.

Attribution scores are more peaked at the occurrence of key terms.

We conduct analysis on how the attribution scores from Sufficientboth are peaked around the correct evidence compare to that from Sufficientnone using our manually annotated set BrainMRI. To quantify this analysis, we take the mean of instance-wise average and maximum of the normalized attribution mass falling into a few explicit tokens. In particular, we consider enhancement for contrast enhancement and effect for mass effect, which are common explicit indicators in the context of specified key features. The results in Table 5 show attribution scores being peaked around the correct terms, highlighting that these models can be guided to not only make correct predictions but attend to the right information.

Model Mass Effect Ctr. Enhance.
Mean Max Mean Max
Sufficientnone 7.3 7.4 28.6 29.8
Sufficientboth 18.9 19.2 37.9 42.0
Table 5: Distributions of attribution mass over explicit cues (“enhancement” for contrast enhancement and “effect” for mass effect) for our best model and the baseline. Mean/Max is the mean of instance-wise average/maximum of the normalized attribution mass falling on the given token. Concentration on these cues increases with attention regularization.
Model An Example of mass effect,  label: positive,  evidence: 0 or 6
Sufficientnone [0] These images show evidence of downward displacement of the brain stem with collapse of the interpeduncular cistern and caudal displacement of the mammary bodies typical for intracranial hypertension. [1] There is diffuse pachymeningeal enhancement evident. [2] Bilateral extra axial collections are evident the do not conform to the imaging characteristics of CSF are seen overlying the hemispheres. [3] These likely reflect blood tinged hygromas and there does appear to be a blood products in the deep tendon portion of the right sided collection on the patient’s left see image 14 series 2. [4]There does appear to be a discrete linear subdural hematoma along the right tentorial leaf. [5] Subdural collection is noted on both sides of the falx as well. [6] There is mass effect at the level of the tentorial incisure due to transtentorial herniation with deformity of the midbrain. [7] There is no evidence an acute infarct. [8] No parenchymal hemorrhage is evident. [9] Apart from the meningeal enhancement there is no abnormal enhancement noted.
Sufficientboth [0] These images show evidence of downward displacement of the brain stem with collapse of the interpeduncular cistern and caudal displacement of the mammary bodies typical for intracranial hypertension. [1] There is diffuse pachymeningeal enhancement evident. [2] Bilateral extra axial collections are evident the do not conform to the imaging characteristics of CSF are seen overlying the hemispheres. [3] These likely reflect blood tinged hygromas and there does appear to be a blood products in the deep tendon portion of the right sided collection on the patient’s left see image 14 series 2. [4] There does appear to be a discrete linear subdural hematoma along the right tentorial leaf. [5] Subdural collection is noted on both sides of the falx as well. [6] There is mass effect at the level of the tentorial incisure due to transtentorial herniation with deformity of the midbrain. [7] There is no evidence an acute infarct. [8] No parenchymal hemorrhage is evident. [9] Apart from the meningeal enhancement there is no abnormal enhancement noted.
Table 6: An illustration of models’ attribution scores over a report from BrainMRI using DeepLift with and w/o regularization techniques. Sufficientboth appears to leverage more information from right sentences.

Table 6 shows visualizations of attribution scores for an example in BrainMRI using DeepLIFT. Notice that even though baseline models make correct predictions, their attribution mass is diffuse over the document. For example, imaging characteristics of CSF and see image 14 series 2 are unrelated to the diagnosis but may reflect a spurious correlation in the dataset; the high accuracy of this model may be for the wrong reasons. With the help of regularization, our model is capable of capturing implicit cues such as downward displacement of the brain stem, although it is trained on an extremely small training set with only explicit cues like mass effect in a weak sentence-level supervision framework.

5.2 Results on Adapted DocRED

Comparison to baselines

We see that the Ent baseline is quite strong at DocRED evidence extraction. However, our best method still exceeds this method on both label accuracy as well as evidence extraction while extracting more succinct explanations. We see that the ability to extract a variable-length explanation is key, with First2, First3 and BestPair performing poorly. Notably, these methods exhibit a drop in accuracy in the reduced doc setting for each method compared to the full doc setting, showing that the explanations extracted are not faithful.

Learning-based models with appropriate regularization perform relatively better in this larger-data setting

From Table 3 and Table 4, we can observe that various regularization techniques applied to Sufficient models maintain or improve overall model performance on both key feature and relation classification. We see that our Sufficient methods do not compromise on accuracy but make predictions based on plausible evidence sets, which is more evident when we have richer training data. We perform error analysis to probe further into our model’s behavior in Appendix D.

Faithfulness of techniques

One may be concerned that, like attention values Jain and Wallace (2019), our feature attribution methods may not faithfully reflect the computation of the model. We emphasize again that the Sufficient paradigm on top of the DeepLIFT method is faithful by some definition. For a model , we measure the faithfulness by checking the agreement between and , where is the extracted evidence we feed into the same model under the reduced document setting. This is shown for all methods in the “Reduced doc” columns in Tables 3 and 4. We see a drop in performance from techniques such as BestPair: the full model does not make the same judgment on these evidence subsets, but by definition it does in the Sufficient setting.

As further evidence of faithfulness, we note that only a relatively small number of evidence sentences, in line with human annotations, are extracted in the Sufficient method. These small subsets are indicated by feature attribution methods and sufficient to reproduce the original model predictions with high confidence. We believe this constitutes strong evidence that these explanations are faithful.

6 Conclusion

In this work, we develop techniques to employ small amount of token-annotated data to improve reliability of document-level IE systems in two domains. We systematically evaluate our model from perspectives of faithfulness and plausibility and show that we can substantially improve models’ capability in focusing on supporting evidence while maintaining their prediction performance, leading to models that are “right for the right reasons” and avoid learning spurious patterns.


This work was partially supported by NSF Grant IIS-1814522 and a Texas Health Catalyst grant. Thanks to Scott Rudkin, Gregory Mittl, Raghav Mattay, and Chuan Liang for assistance with the annotation.


  • C. Alt, A. Gabryszak, and L. Hennig (2020) Probing linguistic features of sentence-level representations in neural relation extraction. In Proceedings of ACL, External Links: Link Cited by: §4.1.2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.3.
  • H. Cho, W. Choi, and H. Lee (2017) A method for named entity normalization in biomedical articles: application to diseases and plants. BMC Bioinformatics 18 (1). External Links: Document, Link Cited by: §2.1.
  • F. Christopoulou, M. Miwa, and S. Ananiadou (2019) Connecting the dots: document-level neural relation extraction with edge-oriented graphs. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    External Links: Document, Link Cited by: §1, §2.3.
  • T. Cook, J. C. Gee, R. N. Bryan, J. T. Duda, P. Chen, E. Botzolakis, S. Mohan, A. Rauschecker, J. Rudie, and I. Nasrallah (2018) Bayesian network interface for assisting radiology interpretation and education. In Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications, J. Zhang and P. Chen (Eds.), External Links: Document, Link Cited by: §2.1.
  • D. Demner-Fushman, W. W. Chapman, and C. J. McDonald (2009) What can natural language processing do for clinical decision support?. Journal of Biomedical Informatics 42 (5), pp. 760–772. Note: Biomedical Natural Language Processing External Links: ISSN 1532-0464, Document, Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.3.
  • J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, and B. C. Wallace (2020) ERASER: a benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, External Links: Document, Link Cited by: §3.3, footnote 2.
  • F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv: Machine Learning. Cited by: §1.
  • D. Dua, S. Singh, and M. Gardner (2020) Benefits of intermediate annotations in reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5627–5634. External Links: Link, Document Cited by: §3.3.
  • M.T. Duong, J.D. Rudie, J. Wang, L. Xie, S. Mohan, J.C. Gee, and A.M. Rauschecker (2019) Convolutional neural network for automated FLAIR lesion segmentation on clinical brain MR imaging. American Journal of Neuroradiology 40 (8), pp. 1282–1290. External Links: Document, Link Cited by: §2.1.
  • S. Feng, E. Wallace, and J. Boyd-Graber (2019) Misleading failures of partial-input baselines. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, External Links: Document, Link Cited by: §1, §3.3.
  • S. Feng, E. Wallace, A. G. II, M. Iyyer, P. Rodriguez, and J. Boyd-Graber (2018) Pathologies of neural models make interpretations difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, External Links: Document, Link Cited by: §2.3, §3.3.
  • N. Guoshun, G. Zhijiang, S. Ivan, and L. Wei (2020) Reasoning with latent structure refinement for document-level relation extraction. In Proceedings of ACL, Cited by: §1, §2.3.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020)

    Don’t stop pretraining: adapt language models to domains and tasks

    In Proceedings of ACL, Cited by: §A.1.
  • X. Han, T. Gao, Y. Lin, H. Peng, Y. Yang, C. Xiao, Z. Liu, P. Li, J. Zhou, and M. Sun (2020) More data, more relations, more context and more openness: a review and outlook for relation extraction. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, pp. 745–758. External Links: Link Cited by: §2.3.
  • J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, J. Seekins, D. Mong, S. Halabi, J. Sandberg, R. Jones, D. Larson, C. Langlotz, B. Patel, M. Lungren, and A. Ng (2019) CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33, pp. 590–597.
    External Links: Document Cited by: §1, §2.1.
  • A. Jacovi and Y. Goldberg (2020) Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, External Links: Document, Link Cited by: §1, §2.3, §3.2.
  • S. Jain and B. C. Wallace (2019) Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3543–3556. External Links: Link, Document Cited by: §1, §5.2.
  • S. Jain, S. Wiegreffe, Y. Pinter, and B. C. Wallace (2020) Learning to faithfully rationalize by construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, External Links: Document, Link Cited by: §1, §2.3, §2.3.
  • J. D. Lafferty, A. McCallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 282–289. External Links: ISBN 1558607781 Cited by: §2.3.
  • T. Lei, R. Barzilay, and T. Jaakkola (2016) Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, External Links: Document, Link Cited by: §1, §2.3.
  • J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, and Z. Lu (2016a) BioCreative v CDR task corpus: a resource for chemical disease relation extraction. Database 2016, pp. baw068. External Links: Document, Link Cited by: §2.3.
  • J. Li, W. Monroe, and D. Jurafsky (2016b)

    Understanding neural networks through representation erasure

    CoRR abs/1612.08220. External Links: Link, 1612.08220 Cited by: §2.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. In arXiv, Cited by: §1, §3.1.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR, Cited by: §A.1.
  • M. B. A. McDermott, T. H. Hsu, W. Weng, M. Ghassemi, and P. Szolovits (2020) CheXpert++: approximating the chexpert labeler for speed, differentiability, and probabilistic output. CoRR abs/2006.15229. External Links: Link, 2006.15229 Cited by: §1.
  • T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. External Links: Document, Link Cited by: §1.
  • M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Link Cited by: §2.3.
  • B. Paranjape, M. Joshi, J. Thickstun, H. Hajishirzi, and L. Zettlemoyer (2020) An information bottleneck approach for controlling conciseness in rationale extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Document, Link Cited by: §2.3.
  • D. Patel, S. Konam, and S. P. Selvaraj (2020) Weakly supervised medication regimen extraction from medical conversations. In ClinicalNLP@EMNLP, Cited by: §1.
  • N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. Yih (2017) Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics 5, pp. 101–115. External Links: Document, Link Cited by: §2.3.
  • E. Pons, L. M. M. Braun, M. G. M. Hunink, and J. A. Kors (2016) Natural language processing in radiology: a systematic review. Radiology 279 (2), pp. 329–343. Note: PMID: 27089187 External Links: Document, Link, Cited by: §1.
  • D. Pruthi, B. Dhingra, G. Neubig, and Z. C. Lipton (2020a) Weakly- and semi-supervised evidence extraction. In Findings of the Association for Computational Linguistics: EMNLP 2020, External Links: Document, Link Cited by: §1, §2.3.
  • D. Pruthi, B. Dhingra, L. B. Soares, M. Collins, Z. C. Lipton, G. Neubig, and W. W. Cohen (2020b) Evaluating explanations: how much do explanations from the teacher aid students?. ArXiv abs/2012.00893. Cited by: §3.3, §3.3.
  • P. Qin, W. Xu, and W. Y. Wang (2018)

    Robust distant supervision relation extraction via deep reinforcement learning

    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Document, Link Cited by: §4.1.2.
  • A. M. Rauschecker, J. D. Rudie, L. Xie, J. Wang, M. T. Duong, E. J. Botzolakis, A. M. Kovalovich, J. Egan, T. C. Cook, R. N. Bryan, I. M. Nasrallah, S. Mohan, and J. C. Gee (2020) Artificial intelligence system approaching neuroradiologist-level differential diagnosis accuracy at brain MRI. Radiology 295 (3), pp. 626–637. External Links: Document, Link Cited by: §2.1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) "Why should i trust you?": explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1135–1144. External Links: ISBN 9781450342322, Link, Document Cited by: §2.3, §4.2.
  • J. D. Rudie, J. Duda, M. T. Duong, P. Chen, L. Xie, R. Kurtz, J. B. Ware, J. Choi, R. R. Mattay, E. J. Botzolakis, J. C. Gee, R. N. Bryan, T. S. Cook, S. Mohan, I. M. Nasrallah, and A. M. Rauschecker (2021)

    Brain MRI deep learning and bayesian inference system augments radiology resident performance

    Journal of Digital Imaging 34 (4), pp. 1049–1058. External Links: Document, Link Cited by: §2.1.
  • J. Rudie, L. Xie, J. Wang, J. Duda, J. Choi, R. Mattay, P. Chen, R. N. Bryan, E. Botzolakis, I. Nasrallah, T. Cook, S. Mohan, J. Gee, and A. Rauschecker (2019) Artificial Intelligence System for Automated Brain MR Diagnosis Performs at Level of Academic Neuroradiologists and Augments Resident Performance. In Proceedings of the Society for Imaging Informatics in Medicine (SIIM), Cited by: §1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In

    2017 IEEE International Conference on Computer Vision (ICCV)

    External Links: Document, Link Cited by: §2.3.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 3145–3153. Cited by: §2.3.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2019) Learning important features through propagating activation differences. External Links: 1704.02685 Cited by: §1, §3.2.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034. Cited by: §2.3.
  • A. Smit, S. Jain, P. Rajpurkar, A. Pareek, A. Ng, and M. Lungren (2020) CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. In EMNLP, Cited by: §1, §2.1.
  • J. Strout, Y. Zhang, and R. Mooney (2019) Do human rationales improve machine explanations?. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 56–62. External Links: Link, Document Cited by: §3.3.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. External Links: 1703.01365 Cited by: §2.3, footnote 1.
  • M. Surdeanu, J. Tibshirani, R. Nallapati, and C. D. Manning (2012) Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 455–465. External Links: Link Cited by: §2.3.
  • X. Wang, X. Han, Y. Lin, Z. Liu, and M. Sun (2018) Adversarial multi-lingual neural relation extraction. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1156–1166. External Links: Link Cited by: §2.3.
  • C. Xiao, Y. Yao, R. Xie, X. Han, Z. Liu, M. Sun, F. Lin, and L. Lin (2020) Denoising relation extraction from document-level distant supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Document, Link Cited by: §1, §2.3.
  • Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, and M. Sun (2019) DocRED: a large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, External Links: Document, Link Cited by: Making Document-Level Information Extraction Right for the Right Reasons, §1, §4.1.2.
  • Y. Zhang, I. Marshall, and B. C. Wallace (2016) Rationale-augmented convolutional neural networks for text classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, External Links: Document, Link Cited by: §2.3.

Appendix A Reproducibility

a.1 Implementation Details

We train all RoBERTa models for 15 epochs with early stopping using

TITAN-Xp GPU. We use AdamW Loshchilov and Hutter (2019) as our optimizer and initialize the model with roberta-base for DocRED and biomed-roberta-base (Gururangan et al., 2020) for brain MRI data, both with M parameters. The batch size is set to 16, and the learning rate is 1e-5 with linear schedule warmup.

The maximum number of tokens in each document is capped at 296 for modified DocRED and 360 for radiology reports. These numbers are chosen such that the number of tokens for around 95% of the documents is within these limits. We do not perform extensive hyperparameter tuning in this work. The hidden state of the [CLS] token from the final layer is fed as input to a linear projection head to make predictions.

Appendix B Dataset statistics

We provide the statistics for both adapted DocRED and brain MRI reports dataset in Table 7. Both datasets are in English and the DocRED dataset is publicly available at

Appendix C Annotation Instructions

The annotation instructions are provided in Figure 3. These were developed jointly with the annotators. In particular, decisions to exclude normal brain activity and confounders such as SVID were made to increase interannotator agreement after an initial round of annotation, making it easier for the labeling to focus on a single core disease or diagnosis per report.

Appendix D Error Analysis

Dataset Setting # doc. # inst. # word/inst. # sent./inst. # relation # NA%
Adapted DocRED train 3053 38180 203 8.1 96+1 33
val 1000 12323 203 8.1 96+1 33
Brain MRI train 327 327 177 11.6
val 86 86 132 10.1
Table 7: Statistics of the two document-level IE datasets. Each document may have multiple entity pairs of interest, giving rise to multiple instances in the adapted DocRED setting. For adapted DocRED, we have 96 relations from the data plus an NA relation that we introduce for 1/3 of the data.
Figure 3: Annotation instructions.
Type Example
Predicts correctly and extracts right evidence [0] Delphine “Delphi” Greenlaw is a fictional character on the New Zealand soap opera Shortland Street, who was portrayed by Anna Hutchison between 2002 and 2004. …
Predicted relation: country of origin Relation: country of origin
Extracted Evidence: [0] Annotated Evidence: [0]
Predicts debatably correct answer, extracts reasonable evidence [0] Anton Erhard Martinelli (1684 – September 15 , 1747) was an Austrian architect and master - builder of Italian descent. [1] Martinelli was born in Vienna. … [3] Anton Erhard Martinelli supervised the construction of several important buildings in Vienna, such as … [4] He designed … [6] He died in Vienna in 1747.
Predicted relation: place of birth Relation: place of death
Extracted Evidence: [1] Annotated Evidence: [0, 6]
Predict incorrect example on examples requiring high amount of reasoning [0] Kurt Tucholsky (9 January 1890 – 21 December 1935) was a German - Jewish journalist, satirist, and writer. [1] He also wrote under the pseudonyms Kaspar Hauser (after the historical figure), Peter Panter, Theobald Tiger and Ignaz Wrobel. …
Predicted relation: NA Relation: date of death
Extracted Evidence: [0] Annotated Evidence: [0]
Selecting more sentences than are needed [0] Henri de Boulainvilliers … was a French nobleman, writer and historian. … [2] Primarily remembered as an early modern historian of the French State, Boulainvilliers also published an early French translation of Spinoza’s Ethics and … [3] The Comte de Boulainvilliers traced his lineage to … [5] Much of Boulainvilliers’ historical work …
Predicted relation: country of citizenship Relation: country of citizenship
Extracted Evidence: [2, 0, 1, 5, 4, 3] Annotated Evidence: [0, 2]
Table 8: Four types of representative examples that show models’ behavior. In our adapted DocRED task, models are asked to predict relations among heads and tails. Here we use model for illustrations, which has the best evidence extraction performance. Sentences in extracted evidence are ranked by DL.

The first example in Table 8 shows a representative case where our model predicts the correct relation and extracts reasonable supporting evidence. Unsurprisingly, this happens most often in simple cases when reasoning over the interaction of sentences is not required.

We observe a few common types of errors. First, are potential alternatives for relations or evidence extraction. From around of our randomly selected error cases, our model either predicts debatably correct relations or picks up sentences that are related but not perfectly aligned with human annotations. The second row in Table 8 illustrates an example where the two entities exhibit multiple relationships; the model’s prediction is correct (Vienna is place where Martinelli was both born and died), but differs from the annotated ground truth and supporting evidence. Such relations are relatively frequent in this dataset; a more complex multi-label prediction format is necessary to fully support these.

Another type of error is complex logical reasoning. Even if our model can extract right evidence, it still fails in around of random error cases requiring high-level reasoning capability. For example, to correctly predict the relation between Theobald Tiger and 21 December 1935 in the third example in Table 8, a model needs to recognize that Theobald Tiger and Kurt Tucholsky are in fact the same entity by referring to pseudonym, which is a challenging relation to recognize.

Finally, the model sometimes selects more sentences than we truly need. Interestingly, this is an error in terms of evidence plausibility but not in terms of prediction. The number of extracted sentences is very high in around of the random error cases. The last row from Table 8 is one of representative examples with this kind of error. Although our model possibly has already successfully extracted right evidence in the first two steps, it continues selecting unnecessary sentences because the prediction confidence is not high enough, a drawback in our way of selecting evidence mentioned in Section 4.2. Moreover, our model extracts one more sentence on average when predicting incorrect relations, suggesting that in these cases it does not cleanly focus on the correct information.