CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

04/20/2020 ∙ by Akshay Smit, et al. ∙ Stanford University 9

The extraction of labels from radiology text reports enables large-scale training of medical imaging models. Existing approaches to report labeling typically rely either on sophisticated feature engineering based on medical domain knowledge or manual annotations by experts. In this work, we investigate BERT-based approaches to medical image report labeling that exploit both the scale of available rule-based systems and the quality of expert annotations. We demonstrate superior performance of a BERT model first trained on annotations of a rule-based labeler and then finetuned on a small set of expert annotations augmented with automated backtranslation. We find that our final model, CheXbert, is able to outperform the previous best rules-based labeler with statistical significance, setting a new SOTA for report labeling on one of the largest datasets of chest x-rays.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The extraction of labels from radiology text reports enables important clinical applications, including large-scale training of medical imaging models Wang et al. (2017)

. Many natural language processing systems have been designed to label reports using sophisticated feature engineering of medical domain knowledge

Pons et al. (2016). On chest x-rays, the most common radiological exam, rule-based methods have been engineered to label some of the largest available datasets Johnson et al. (2019). While these methods have generated considerable advances, they have been unable to capture the full diversity of complexity, ambiguity and subtlety of natural language in the context of radiology reporting.

Figure 1: The method for training CheXbert combines existing medical report labelers with hand-annotations for accurate radiology report labeling.

More recently, Transformers have demonstrated success in end-to-end radiology report labeling Drozdov et al. (2020); Wood et al. (2020). However, these methods have shifted the burden from feature engineering to hand-annotation, which requires considerable expert time and expertise for high quality labels. Moreover, they do not take advantage of existing feature-engineered labelers, which represent state-of-the-art on many medical tasks.

In this work, we introduce a simple method for gaining the benefits of both existing radiology report labelers with hand-annotations to achieve highly accurate automated radiology report labeling. This approach begins with a pretrained BERT model Devlin et al. (2019) trained on the outputs of an existing labeler, then is further finetuned on a small corpus of expert annotations augmented with automated backtranslation. We apply this approach to the task of radiology report labeling of chest x-rays, and call our resulting model CheXbert.

CheXbert outperforms the previous best reported labeler on an external dataset, MIMIC-CXR Johnson et al. (2019), with an improvement of 0.052 (95% CI 0.037, 0.067) on the F1 metric, and only 0.010 F1 away from a radiologist performance benchmark. We expect this method of training medical report labelers is broadly useful within the medical domain, where collection of expert labels is expensive, and feature engineered labelers already exist for many medical retrieval tasks.

2 Related Work

Many natural language processing systems have been developed to extract structured labels from free-text radiology reports Pons et al. (2016); Yadav et al. (2016); Hassanpour et al. (2017); Annarumma et al. (2019); Savova et al. (2010); Wang et al. (2018); Chen et al. (2018); Bozkurt et al. (2019)

. In many cases, these methods have relied on heavy feature engineering that include controlled vocabulary and grammatical rules to find and classify properties of radiological findings. NegEx

Chapman et al. (2001), a popular component of rule-based methods, uses simple regular expressions for detecting negation of findings and is often used in combination with ontologies such as the Unified Medical Language System (UMLS) Bodenreider (2004). NegBio Peng et al. (2017), an extension to NegEx, utilizes universal dependencies for pattern definition and subgraph matching for graph traversal search, includes uncertainty detection in addition to negation detection for multiple pathologies in chest x-ray reports, and is used to generate labels for the ChestX-Ray14 dataset Wang et al. (2017).

The CheXpert labeler Irvin et al. (2019) improves upon NegBio on chest x-ray report classification through more controlled extraction of mentions and an improved NLP pipeline and rule set for uncertainty and negation extraction. The CheXpert labeler has been applied to generate labels for the CheXpert dataset and MIMIC-CXR Johnson et al. (2019)

, which are amongst the largest chest x-ray datasets publicly available. Deep learning approaches have also been trained using expert-annotated sets of radiology reports

Xue et al. (2019). In these cases, training set size, often driving the performance of deep learning approaches, is limited by radiological time and expertise. Chen et al. (2017) trained CNNs with GloVe embeddings Pennington et al. (2014) on 1000 radiologist-labeled reports for classification of pulmonary embolism in chest CT reports and improved upon the previous ruled-based SOTA, peFinder Chapman et al. (2011). Bustos et al. (2019) train both recurrent and convolutional networks in combination with attention mechanisms on 27,593 physician-labeled radiology reports and apply their labeler to generate labels.

More recently, Transformer-based models have also been applied to the task of radiology report labeling. Drozdov et al. (2020) train classifiers using BERT Devlin et al. (2019) and XLNet Yang et al. (2020) on 3,856 radiologist labeled reports to detect normal and abnormal labels. Wood et al. (2020) develop ALARM, an MRI head report classifier on head MRI data using BioBERT Lee et al. (2019) models trained on 1500 radiologist-labeled reports, and demonstrate improvement over simpler fixed embedding and word2vec-based Mikolov et al. (2013) models Zech et al. (2018).

Our work is closely related to approaches to reduce the number of expert annotations required for training medical report labelers Callahan et al. (2019); Ratner et al. (2020); Banerjee et al. (2018). A method of weak supervision known as data programming Ratner et al. (2018)

has seen successful application to medical report labeling: in this method, users write heuristic labelling functions that programmatically label training data.

Saab et al. (2019) use data programming to incorporate labeling functions consisting of regular expressions that look for phrases in radiology reports, developed with the help of a clinical expert in a limited time window, to label for intracranial hemorrhage in head CTs. Dunnmon et al. (2019) demonstrate that in under 8 hours of cumulative clinician time, a data programming method can approach the efficacy of large hand-labeled training sets annotated over months or years for training medical imaging models, including chest x-ray classifiers on the task of normal / abnormal detection. Beyond data programming approaches, Drozdov et al. (2020)

develop a fully unsupervised approach utilizing a Siamese Neural Network and Gaussian Mixture Models, reporting performance similar to the CheXpert labeler without requiring any radiologist-labeled reports on the simplified task of assigning normal and abnormal labels.

Figure 2: Labeler architecture

3 Methods

3.1 Task

The report labeling task is to extract the presence of one or more clinically important observations (e.g. consolidation, edema) from a free-text radiology report. More formally, a labeler takes in as inputs sentences from a radiology report and outputs for 13 observations one of the following classes: blank, positive, negative, and uncertain. For the 14th observation corresponding to No Finding, the labeler only outputs one of the two following classes: blank or positive.

3.2 Data

Two large datasets of chest x-rays, CheXpert Irvin et al. (2019) (consisting of 224,316 images), and MIMIC-CXR Johnson et al. (2019) (consisting of 377,110 images) are used in this study. Both datasets have corresponding radiology reports that have been labeled for the same set of 14 observations using the CheXpert labeler Irvin et al. (2019), from the Impression section, or other parts of the radiology report.

A subset of both of these datasets have been annotated by expert radiologists. On CheXpert, a total of 1000 reports (CheXpert manual set) were reviewed by 2 board certified radiologists with disagreement resolution through consensus. On MIMIC-CXR, a total of 687 reports (MIMIC-CXR test set) were reviewed by 2 board certified radiologists and manually labeled for the same 14 medical observations as in CheXpert. In this study, CheXpert is used for the development of models, and the MIMIC-CXR test set is used for evaluation.

3.3 Model Architecture

All models use a modification of the BERT-base architecture Devlin et al. (2019) with 14 linear heads (as shown in Figure 2): 12 heads correspond to various medical abnormalities, 1 to medical support devices, and 1 to No Finding. Each radiology report text is tokenized, and the maximum number of tokens in each input sequence is capped at 512. The final-layer’s hidden state corresponding to the CLS token is then fed as input to each of the linear heads.

3.4 Training Details

For all our models, unless otherwise specified, we finetune all layers of the BERT model, including the embeddings, and feed the CLS token into the 14 linear heads to generate class scores for each medical observation. All models are trained using cross-entropy loss and Adam optimization with a learning rate of , in accordance with Devlin et al. (2019) for fine-tuning tasks. The cross-entropy losses for each of the 14 observations are added to produce the final loss. During training, we periodically evaluated our model on the dev set and saved the checkpoint with the highest performance averaged over all 14 observations. All models are trained using 3 TITAN-XP GPUs with a batch size of 18, as larger batches do not fit in GPU memory.

3.5 Evaluation

Models are evaluated on their average performance on three retrieval tasks: positive extraction, negative extraction, and uncertainty extraction. For each of the tasks, the class of interest (e.g. negative for the negative extraction and uncertain for the uncertainty extraction) is treated as the positive class, and the other classes are considered negative. For each of the 14 radiological observations, we compute a weighted average of the F1 scores on each of the above three tasks, weighted by the support for each class, which we call the weighted-F1 metric, henceforth simply abbreviated to F1.

We report the simple average of the F1 across all of the radiological observations. We include the 95% two-sided confidence intervals of the F1 using the nonparametric percentile bootstrap method

Efron and Tibshirani (1986) with 1000 bootstrap replicates.

Model F1 (95% CI)
Manual T-rad-cls 0.286 (0.265, 0.305)
T-rad-tokens 0.396 (0.374, 0.416)
T-rad-biobert 0.616 (0.587, 0.639)
T-rad-clinicalbert 0.677 (0.651, 0.699)
T-rad 0.705 (0.680, 0.725)
T-rad-bt 0.729 (0.702, 0.749)
Current SOTA CheXpert 0.743 (0.719, 0.764)
Auto T-auto 0.755 (0.731, 0.774)
Hybrid T-hybrid 0.775 (0.753, 0.795)
CheXbert (T-hybrid-bt) 0.795 (0.772, 0.815)
Benchmark Radiologist 0.805 (0.784, 0.823)
Table 1: Performance of models in ascending order of average F1 score with 95% confidence intervals included, with comparisons to CheXpert and radiologist benchmark.
Category CheXbert Improvement over CheXpert
Pleural Other 0.652 (0.477, 0.792) 0.174 (0.084, 0.287)
Pneumonia 0.797 (0.748, 0.845) 0.113 (0.061, 0.162)
Fracture 0.782 (0.645, 0.892) 0.111 (0.017, 0.215)
Consolidation 0.855 (0.782, 0.916) 0.083 (0.000, 0.163)
No Finding 0.608 (0.448, 0.721) 0.065 (-0.026, 0.143)
Pneumothorax 0.937 (0.903, 0.967) 0.055 (0.020, 0.090)
Enlarged Cardiom. 0.666 (0.579, 0.739) 0.053 (-0.016, 0.123)
Cardiomegaly 0.810 (0.760, 0.856) 0.046 (0.015, 0.082)
Edema 0.887 (0.849, 0.920) 0.023 (-0.002, 0.050)
Lung Lesion 0.689 (0.573, 0.798) 0.006 (-0.037, 0.044)
Support Devices 0.873 (0.837, 0.907) 0.006 (-0.011, 0.022)
Atelectasis 0.919 (0.887, 0.949) 0.002 (-0.031, 0.038)
Pleural Effusion 0.904 (0.875, 0.931) -0.001 (-0.019, 0.014)
Lung Opacity 0.751 (0.698, 0.800) -0.011 (-0.048, 0.025)
Average 0.795 (0.772, 0.815) 0.052 (0.037, 0.067)
Table 2: The F1 scores for CheXbert as well as improvements over the CheXpert model on the MIMIC-CXR test set, reported with 95% confidence intervals.

4 Experiments

4.1 Supervision Strategies

We investigate models trained using three strategies: trained only on radiologist-labeled reports, trained only on labels generated automatically by the CheXpert labeler Irvin et al. (2019), and trained on a combination of the two.

Radiologist Labels

T-rad is obtained by training the model on the CheXpert manual set, finetuning all layers. As baselines, we also train models that freeze all weights in the BERT layers, and only update the weights in the linear heads: T-rad-cls is identical to T-rad in architecture, while T-rad-tokens

averages the non-padding output tokens as the input into the linear heads rather than using the CLS token output. All models are trained using a random 75%-25% train-dev split on this set, and are trained until convergence.

Automatic Labels

T-auto is obtained using labels generated by the rule-based CheXpert labeler, described in Irvin et al. (2019)

. T-auto is trained using a random 85%-15% train-dev split of the CheXpert training set, different from the models trained on radiologist labels. T-auto is trained for 6 epochs, higher than the usual 2-4 epochs for BERT fine-tuning tasks, for which slightly higher training performance is observed.

Hybrid Labels

T-hybrid is obtained by initializing a model with the weights of T-auto, and then finetuning it on radiologist-labeled reports, as for T-rad.

Results

As shown in Table 1, T-rad achieves an F1 of , significantly higher than the performance of the baselines with T-rad-cls at , and T-rad-tokens at . T-auto achieves a higher F1 of . Superior performance is obtained by T-hybrid, with an F1 of .

4.2 Biomedical Language Representations

We investigate the effect of having models pre-trained on biomedical data. For the following models, we use an identical training procedure to T-rad, but initialize the weights differently. T-rad-biobert is obtained by using BioBERT weight initializations Lee et al. (2019). BioBERT was obtained by further pretraining the BERT weights on a large biomedical corpus comprising PubMed abstracts (4.5 billion words) and PMC full-text articles (13.5 billion words). Similarly, T-rad-clinicalbert is obtained by using Clinical BioBERT weight initializations Alsentzer et al. (2019), which were obtained by further pretraining the BioBERT weights on 2 million clinical notes from the MIMIC-III v1.4 database.

Results

As shown in Table 1, T-rad-biobert achieves an F1 of and T-rad-clinicalbert achieves an F1 of , both lower performance than T-rad by a margin of .

4.3 Data Augmentation using backtranslation

We investigate the use of backtranslation to improve the performance of the models. Backtranslation is designed to generate alternate formulations of sentences by translating them to another language and back. Although backtranslation has been successfully used to augment text data in a variety of NLP tasks (Yu et al., 2018; Poncelas et al., 2018), to the best of our knowledge, the technique is yet to be applied to a medical report extraction task. In this experiment, we augment the CheXpert manual set using Facebook-FAIR’s winning submission to the WMT’19 news translation task Ng et al. (2019) to generate backtranslations. Although this submission includes models that produce German/English and Russian/English translations, initial experiments with Russian did not demonstrate semantically correct translations, so we only report experiments with German. We use beam search with a beam size of 1 to select the single most likely translation. T-rad-bt is obtained by using an identical training procedure to T-rad on the augmented dataset that is twice the size of the original CheXpert manual set. Similarly, T-hybrid-bt is obtained using an identical training procedure to T-hybrid.

Results

As shown in Table 1, T-rad-bt achieves an F1 of , higher than that of T-rad. Similarly, T-hybrid-bt achieves an F1 of .

4.4 Comparison to previous SOTA and radiologist benchmark

We compare the performance of our best model to the previous best reported labeler, the CheXpert labeler Irvin et al. (2019), and to a radiologist benchmark. CheXpert is an automated rule-based labeler that extracts mentions of conditions like pneumonia by searching against a large manually curated list of words associated with the condition and then classifies mentions as uncertain, negative, or positive using rules on a universal dependency parse of the report. For the radiologist benchmark, we collect annotations on the MIMIC-CXR test set from an additional board-certified radiologist. In this study, T-hybrid-bt is the best model, which we call CheXbert. We report the improvement of CheXbert over the CheXpert labeler by computing the paired differences in F1 scores on 1000 bootstrap replicates and provide the mean difference along with a 95% two-sided confidence interval.

Results

We observe that CheXbert has a statistically significant improvement over the current SOTA, CheXpert, which achieves a score of . Table 2 shows the F1 per class (along with 95% confidence intervals) for CheXbert and for the improvements over CheXpert. CheXbert records an increase in all but 2 medical conditions: the largest improvements are observed for Pleural Other [ ], Pneumonia [ )], Fracture [ ] and Consolidation [ ]. Further significant improvements are observed for Pneumothorax [ ] and Cardiomegaly [ ]. Overall, CheXbert achieves a statistically significant improvement on F1 of . The board-certified radiologist achieves an F1 of , 0.010 F1 points higher than the performance of CheXbert.

Report Segment and Labels Reasoning
…two views of chest demonstrate cariomegaly with no focal consolidation… Cardiomegaly CheXpert: Blank ✗ T-auto: Positive ✓ T-auto, in contrast to CheXpert, recognizes conditions with misspellings in the report like “cariomegaly” in place of “cardiomegaly”.
consistent with acute and/or chronic pulmonary edema…. Edema CheXpert: Positive ✓ T-auto: Uncertain ✗ T-auto incorrectly detects uncertainty in the edema label, likely from the “and/or”; CheXpert correctly classifies this example as positive.
Cardiomediastinal silhouette stable Enlarged Cardiomediastinum CheXpert: Positive ✗ T-auto: Positive ✗ CheXbert: Uncertain ✓ T-auto and CheXpert both incorrectly label this example as positive for enlarged cardiomediastinum; CheXbert correctly classifies as positive, likely recognizing that “stable” is associated with uncertainty of the condition.
Table 3: This table contains phrases from reports where CheXpert, T-auto, and CheXbert provide different labels. The correct label is indicated by a checkmark in the first column. The CheXpert versus T-auto comparisons are conducted on the CheXpert manual set. The CheXbert versus T-auto and CheXpert comparisons are conducted on the MIMIC-CXR test set.

5 Analysis

5.1 T-auto versus CheXpert

We analyze whether T-auto, which is trained exclusively on labels from CheXpert (a rules-based labeler) can generalize beyond those rules.

We look at specific examples where T-auto correctly labels a condition while CheXpert mislabels that condition on the CheXpert manual set. In one example, T-auto is able to correctly detect uncertainty in the phrase “cannot be entirely excluded,” which CheXpert is not able to detect because the phrase does not match any rule in its ruleset. Similarly, in another example, for the phrase “no evidence of pneumothorax or bony fracture,” T-auto detects the negation of fracture indicated by “no evidence of”. On the other hand, CheXpert marks fracture as positive since the phrasing does match any negation construct part of its ruleset. T-auto, in contrast to CheXpert, also recognizes conditions with misspellings in the report like “cariomegaly” in place of “cardiomegaly” and “mediastnium” in place of “mediastinum”. Examples of T-auto correctly labeling conditions mislabeled by CheXpert are provided in Table A3 of the Appendix. Table A4 of the Appendix contains examples of CheXpert correctly labeling conditions mislabeled by T-auto.

5.2 CheXbert versus T-auto and CheXpert

We analyze how CheXbert improves upon T-auto and CheXpert. We analyze several examples where CheXbert is more accurate than both T-auto and CheXpert. In one example, while T-auto incorrectly marks “mild cardiomegaly” as uncertain, CheXbert marks “mild cardiomegaly” as positive for cardiomegaly, which is consistent with the ground truth read. In another example with the phrase “Right hilum appears slightly more prominent,” CheXbert correctly classifies enlarged cardiomediastinum as positive, while T-auto and CheXpert fail to detect any mention of this condition. Furthermore, CheXbert correctly labels certain nuanced statements that both CheXpert and T-auto mislabel. On the example containing the sentence “Pulmonary edema on ____ has almost entirely cleared” while both T-auto and CheXpert label edema as negative, CheXbert correctly labels edema as positive. In another example containing the phrase “edema has cleared from much of the lungs”, CheXbert correctly labels edema as positive in contrast to T-auto and CheXpert. In an example containing the phrase “The left aspect of the heart border is unremarkable,” CheXbert, correctly marks cardiomegaly as negative in the phrase, unlike T-auto and CheXpert. Additionally, CheXbert is able to detect uncertainty indicated by phrases like “could well reflect” or the “possibility” of X condition would be “impossible to exclude.” In an example, “New bibasilar opacities, which given the clinical history are suspicious for aspiration,” CheXbert correctly identifies this by marking lung opacity as positive while CheXpert and T-auto incorrectly detect uncertainty (associating “suspicious” as a property associated with “opacities”). Examples with full report impressions in which CheXbert correctly predicts on examples for which mistakes are made by both CheXpert and T-auto can be found in Table A5 of the Appendix.

5.3 Report Changes with Backtranslation

We analyze the phrasing and vocabulary changes that backtranslation introduces into the reports. Backtranslation frequently rephrases text. For instance, the sentence “redemonstration of multiple right-sided rib fractures” is backtranslated to “redemonstration of several rib fractures of the right side”. Backtranslation also introduces some error: the phrase “left costophrenic angle” becomes “left costophrine angle” (“costophrine” is not a word), and the phrase “left anterior chest wall pacer in place” is backtranslated to “pacemaker on the left front of the chest wall”, which omits the critical attribute of being in place. In many examples, the backtranslated text paraphrases medical vocabulary into semantic equivalents: “cutaneous” becomes “skin”, “left clavicle” becomes “left collarbone”, “osseous” becomes “bone” or “bony”, “anterior” becomes “front”, and “rib fracture” becomes “broken ribs”. More backtranslations with analyses are provided in Table A6 of the Appendix.

6 Limitations

Our study has several limitations. First, our hybrid/auto approaches rely on the existence of an existing labeler to generate labels. Second, our report labeler has a maximum input token size of 512 tokens and would need to include further engineering to work on longer medical/radiology reports. Third, our task is limited to the 14 observations labeled for, and we do not test for the model’s ability to label rarer conditions. However, CheXbert can mark No Finding as blank, which can indicate the presence of another condition if the other 13 conditions are also blank. Fourth, the ground truth labels for the MIMIC-CXR test set were provided by a single board-certified radiologist. By using more radiologists (with a majority vote or decision by consensus), we could potentially demonstrate a less biased comparison to the radiologist benchmark. Fifth, while we do test performance on a dataset from a hospital unseen in training, additional datasets across institutions could be useful in further establishing model ability to generalize.

7 Conclusion

In this study, we propose a simple method for combining existing report labelers with hand-annotations for accurate radiology report labeling. In this method, a pretrained BERT model is first trained on the outputs of a labeler, and then further finetuned on the manual annotations, the set of which is augmented using backtranslation. We report five findings with the resulting model, which we call CheXbert. First, we find that CheXbert outperforms models only trained on radiologist-labeled reports, or on the labeler outputs. Second, we find that CheXbert outperforms models pretrained on biomedical data. Third, we find that CheXbert outperforms models which do not use backtranslation. Fourth, we find that CheXbert outperforms the previous best labeler, CheXpert (which was rules-based), with an improvement of 0.052 (95% CI 0.037, 0.067) on the F1 metric. Fifth, we find that CheXbert is 0.010 F1 points from the radiologist performance benchmark, suggesting that the gap to ceiling performance is narrow.

We expect this method of training medical report labelers is broadly useful within the medical domain, where collection of expert labels can produce a small set of high quality labels, and existing feature engineered labelers can produce labels at scale. Extracting highly accurate labels from medical reports by taking advantage of both sources can enable many important downstream tasks, including the development of more accurate and robust medical imaging models required for deployment.

Acknowledgments

We would like to acknowledge the Stanford Machine Learning Group (

stanfordmlgroup.github.io

) and the Stanford Center for Artificial Intelligence in Medicine and Imaging (

AIMI.stanford.edu) for infrastructure support. Thanks to Alistair Johnson for support in the radiologist benchmark, to Jeremy Irvin for support in the CheXpert labeler, and Alex Tamkin for helpful comments.

References

  • E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. External Links: Link, Document Cited by: §4.2.
  • M. Annarumma, S. J. Withey, R. J. Bakewell, E. Pesce, V. Goh, and G. Montana (2019) Automated Triaging of Adult Chest Radiographs with Deep Artificial Neural Networks. Radiology 291 (1), pp. 196–202. Note: Publisher: Radiological Society of North America External Links: ISSN 0033-8419, Link, Document Cited by: §2.
  • I. Banerjee, M. C. Chen, M. P. Lungren, and D. L. Rubin (2018) Radiology report annotation using intelligent word embeddings: Applied to multi-institutional chest CT cohort. Journal of Biomedical Informatics 77, pp. 11–20 (eng). External Links: ISSN 1532-0480, Document Cited by: §2.
  • O. Bodenreider (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32 (Database issue), pp. D267–D270. External Links: ISSN 0305-1048, Link, Document Cited by: §2.
  • S. Bozkurt, E. Alkim, I. Banerjee, and D. L. Rubin (2019) Automated Detection of Measurements and Their Descriptors in Radiology Reports Using a Hybrid Natural Language Processing Algorithm. Journal of Digital Imaging 32 (4), pp. 544–553 (en). External Links: ISSN 1618-727X, Link, Document Cited by: §2.
  • A. Bustos, A. Pertusa, J. Salinas, and M. de la Iglesia-Vayá (2019) PadChest: A large chest x-ray image dataset with multi-label annotated reports. arXiv:1901.07441 [cs, eess]. Note: arXiv: 1901.07441 External Links: Link Cited by: §2.
  • A. Callahan, J. A. Fries, C. Ré, J. I. Huddleston, N. J. Giori, S. Delp, and N. H. Shah (2019) Medical device surveillance with electronic health records. npj Digital Medicine 2 (1), pp. 1–10 (en). Note: Number: 1 Publisher: Nature Publishing Group External Links: ISSN 2398-6352, Link, Document Cited by: §2.
  • B. E. Chapman, S. Lee, H. P. Kang, and W. W. Chapman (2011) Document-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithm. Journal of Biomedical Informatics 44 (5), pp. 728–737 (en). External Links: ISSN 1532-0464, Link, Document Cited by: §2.
  • W. W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, and B. G. Buchanan (2001) A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries. Journal of Biomedical Informatics 34 (5), pp. 301–310 (en). External Links: ISSN 1532-0464, Link, Document Cited by: §2.
  • M. C. Chen, R. L. Ball, L. Yang, N. Moradzadeh, B. E. Chapman, D. B. Larson, C. P. Langlotz, T. J. Amrhein, and M. P. Lungren (2017) Deep Learning to Classify Radiology Free-Text Reports. Radiology 286 (3), pp. 845–852. Note: Publisher: Radiological Society of North America External Links: ISSN 0033-8419, Link, Document Cited by: §2.
  • P. Chen, H. Zafar, M. Galperin-Aizenberg, and T. Cook (2018) Integrating Natural Language Processing and Machine Learning Algorithms to Categorize Oncologic Response in Radiology Reports. Journal of Digital Imaging 31 (2), pp. 178–184 (en). External Links: ISSN 1618-727X, Link, Document Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs]. Note: arXiv: 1810.04805 External Links: Link Cited by: §1, §2, §3.3, §3.4.
  • I. Drozdov, D. Forbes, B. Szubert, M. Hall, C. Carlin, and D. J. Lowe (2020) Supervised and unsupervised language modelling in Chest X-Ray radiological reports. PLOS ONE 15 (3), pp. e0229963 (en). Note: Publisher: Public Library of Science External Links: ISSN 1932-6203, Link, Document Cited by: §1, §2, §2.
  • J. Dunnmon, A. Ratner, N. Khandwala, K. Saab, M. Markert, H. Sagreiya, R. Goldman, C. Lee-Messer, M. Lungren, D. Rubin, and C. Ré (2019) Cross-Modal Data Programming Enables Rapid Medical Machine Learning. arXiv:1903.11101 [cs, eess, stat]. Note: arXiv: 1903.11101 External Links: Link Cited by: §2.
  • B. Efron and R. Tibshirani (1986)

    Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy

    .
    Statistical Science 1 (1), pp. 54–75 (EN). Note: Publisher: Institute of Mathematical Statistics External Links: ISSN 0883-4237, 2168-8745, Link, Document, MathReview Entry Cited by: §3.5.
  • S. Hassanpour, C. P. Langlotz, T. J. Amrhein, N. T. Befera, and M. P. Lungren (2017)

    Performance of a Machine Learning Classifier of Knee MRI Reports in Two Large Academic Radiology Practices: A Tool to Estimate Diagnostic Yield

    .
    American Journal of Roentgenology 208 (4), pp. 750–753. Note: Publisher: American Roentgen Ray Society External Links: ISSN 0361-803X, Link, Document Cited by: §2.
  • J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, and A. Y. Ng (2019) CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. arXiv:1901.07031 [cs, eess]. Note: arXiv: 1901.07031 External Links: Link Cited by: §2, §3.2, §4.1, §4.1, §4.4.
  • A. E. W. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng (2019) MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv:1901.07042 [cs, eess]. Note: arXiv: 1901.07042 External Links: Link Cited by: §1, §1, §2, §3.2.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, pp. btz682. Note: arXiv: 1901.08746 External Links: ISSN 1367-4803, 1460-2059, Link, Document Cited by: §2, §4.2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. External Links: Link Cited by: §2.
  • N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov (2019) Facebook FAIR’s WMT19 News Translation Task Submission. arXiv:1907.06616 [cs]. Note: arXiv: 1907.06616 External Links: Link Cited by: §4.3.
  • Y. Peng, X. Wang, L. Lu, M. Bagheri, R. Summers, and Z. Lu (2017) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. arXiv:1712.05898 [cs]. Note: arXiv: 1712.05898 External Links: Link Cited by: §2.
  • J. Pennington, R. Socher, and C. Manning (2014)

    Glove: Global Vectors for Word Representation

    .
    In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §2.
  • A. Poncelas, D. Shterionov, A. Way, G. M. d. B. Wenniger, and P. Passban (2018)

    Investigating Backtranslation in Neural Machine Translation

    .
    arXiv:1804.06189 [cs]. Note: arXiv: 1804.06189 External Links: Link Cited by: §4.3.
  • E. Pons, L. M. M. Braun, M. G. M. Hunink, and J. A. Kors (2016) Natural Language Processing in Radiology: A Systematic Review. Radiology 279 (2), pp. 329–343. Note: Publisher: Radiological Society of North America External Links: ISSN 0033-8419, Link, Document Cited by: §1, §2.
  • A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré (2018) Snorkel MeTaL: Weak Supervision for Multi-Task Learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM’18, Houston, TX, USA, pp. 1–4. External Links: ISBN 978-1-4503-5828-6, Link, Document Cited by: §2.
  • A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré (2020) Snorkel: rapid training data creation with weak supervision. The VLDB Journal 29 (2), pp. 709–730 (en). External Links: ISSN 0949-877X, Link, Document Cited by: §2.
  • K. Saab, J. Dunnmon, R. Goldman, A. Ratner, H. Sagreiya, C. Ré, and D. Rubin (2019) Doubly Weak Supervision of Deep Learning Models for Head CT. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Lecture Notes in Computer Science, Cham, pp. 811–819 (en). External Links: ISBN 978-3-030-32248-9, Document Cited by: §2.
  • G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G. Chute (2010) Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association : JAMIA 17 (5), pp. 507–513. External Links: ISSN 1067-5027, Link, Document Cited by: §2.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 3462–3471.
    Note: arXiv: 1705.02315 External Links: Link, Document Cited by: §1, §2.
  • Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn, and H. Liu (2018) Clinical information extraction applications: A literature review. Journal of Biomedical Informatics 77, pp. 34–49 (en). External Links: ISSN 1532-0464, Link, Document Cited by: §2.
  • D. A. Wood, J. Lynch, S. Kafiabadi, E. Guilhem, A. A. Busaidi, A. Montvila, T. Varsavsky, J. Siddiqui, N. Gadapa, M. Townend, M. Kiik, K. Patel, G. Barker, S. Ourselin, J. H. Cole, and T. C. Booth (2020)

    Automated Labelling using an Attention model for Radiology reports of MRI scans (ALARM)

    .
    arXiv:2002.06588 [cs]. Note: arXiv: 2002.06588 External Links: Link Cited by: §1, §2.
  • K. Xue, Y. Zhou, Z. Ma, T. Ruan, H. Zhang, and P. He (2019) Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text. (en). External Links: Link Cited by: §2.
  • K. Yadav, E. Sarioglu, H. A. Choi, W. B. Cartwright, P. S. Hinds, and J. M. Chamberlain (2016) Automated Outcome Classification of Computed Tomography Imaging Reports for Pediatric Traumatic Brain Injury. Academic Emergency Medicine: Official Journal of the Society for Academic Emergency Medicine 23 (2), pp. 171–178 (eng). External Links: ISSN 1553-2712, Document Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2020) XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237 [cs]. Note: arXiv: 1906.08237 External Links: Link Cited by: §2.
  • A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv:1804.09541 [cs]. Note: arXiv: 1804.09541 External Links: Link Cited by: §4.3.
  • J. Zech, M. Pain, J. Titano, M. Badgeley, J. Schefflein, A. Su, A. Costa, J. Bederson, J. Lehar, and E. K. Oermann (2018) Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports. Radiology 287 (2), pp. 570–580. Note: Publisher: Radiological Society of North America External Links: ISSN 0033-8419, Link, Document Cited by: §2.

Appendix A Appendix

Condition Positive Negative Uncertain Blank
Atelectasis 29,818 (15.66%) 1,018 (0.53%) 29,832 (15.66%) 129,792 (68.15%)
Cardiomegaly 23,302 (12.23%) 7,809 (4.10%) 6,682 (3.51%) 152,667 (80.16%)
Consolidation 12,977 (6.81%) 19,397 (10.18%) 24,345 (12.78%) 133,741 (70.22%)
Edema 49,725 (26.11%) 15,867 (8.33%) 11,746 (6.17%) 113,122 (59.39%)
Enlarged Cardiomed. 9,129 (4.79%) 15,165 (7.96%) 10,278 (5.40%) 155,888 (81.85%)
Fracture 7,364 (3.87%) 1,960 (1.03%) 488 (0.26%) 180,648 (94.85%)
Lung Lesion 6,955 (3.65%) 758 (0.40%) 1,084 (0.57%) 181,663 (95.38%)
Lung Opacity 94,156 (49.44%) 5,006 (2.63%) 4,404 (2.31%) 86,894 (45.62%)
No Finding 16,795 (8.82%) NA NA 173,665 (91.18%)
Pleural Effusion 77,028 (40.44%) 25,097 (13.18%) 9,565 (5.02%) 78,770 (41.36%)
Pleural Other 2,481 (1.30%) 210 (0.11%) 1,801 (0.95%) 185,968 (97.64%)
Pneumonia 4,647 (2.44%) 1,851 (0.97%) 15,907 (8.35%) 168,055 (88.24%)
Pneumothorax 17,688 (9.29%) 47,566 (24.97%) 2,704 (1.42%) 122,502 (64.32%)
Support Devices 107,601 (56.50%) 5,319 (2.79%) 910 (0.48%) 76,630 (40.23%)
Table A1: After removing duplicate reports from the CheXpert dataset (excluding the CheXpert manual set), we are left with a total of 190,460 reports, whose class prevalences are displayed in this table for each medical condition. This comprises the train and dev sets used to train T-auto.
T-auto > CheXpert CheXbert > CheXpert CheXbert > Radiologist Benchmark
Blank 0 20 47
Positive -22 0 45
Negative 14 51 15
Uncertain 16 34 -15
Total 8 105 92
Table A2: This table records the differences in the number of times labels were correctly assigned by one model versus another model. For example, in the first column named “T-auto CheXpert,” we report the difference between the number of times T-auto correctly classifies a label and the number of times CheXpert correctly classifies a label. We record the differences between a pair of models by category (blank, positive, negative, uncertain) and by total. These occurrences are obtained on the MIMIC-CXR test set.
Example Labels Reasoning
1. life support devices unchanged in position. 2. redemonstration of diffuse nodular air space opacities which are unchanged from prior examination which may represent air space pulmonary edema versus infection, as clinically correlated. 3. cardiomediastinal silhouette within normal limits for size and unchanged. 4. redemonstration of comminuted mid left clavicular fracture again noted.
Edema
Radiologist: Uncertain
CheXpert: Positive
T-auto: Uncertain
T-auto appears to detect uncertainties indicated by words like ”may” and ”versus” on conditions. In this case, this phrase did not match an uncertainty detection rule in the CheXpert classifier.
1. initial chest x-ray performed on _____ @ ___ hours demonstrates stable position of the right subclavian line, with a new right upper extremity picc line in place with the tip directed towards the right brachiocephalic confluence. there has been interval development of left basilar patchy airspace opacity, which likely represents atelectasis, although consolidation cannot be entirely excluded. 2. subsequent chest x-ray performed @ 1747 hours demonstrates the right upper extremity picc line tip now extending into the left brachiocephalic confluence. there is no other interval change. 3. subsequent chest x-ray performed @ 1903 hours demonstrates the right upper extremity picc line tip now at the right brachiocephalic confluence, with no other interval change.
Consolidation
Radiologist: Uncertain
CheXpert: Positive
T-auto: Uncertain
Unlike CheXpert, T-auto correctly detects uncertainty conveyed in the phrase ”cannot be entirely excluded”.
1. no radiographic evidence of acute cardiopulmonary disease. 2. no evidence of pneumothorax or bony fracture.
Fracture
Radiologist: Negative
CheXpert: Positive
T-auto: Negative
In this example, T-auto is able to detect a negation indicated by ”no evidence of”. Chexpert is not able to pick up this negation construction as part of its ruleset.
Table A3: This table contains examples where T-auto correctly assigns a label while CheXpert misassigns that label on the CheXpert manual set. We include speculative reasoning as to why this may have happened.
Example Labels Reasoning
1.persistent low lung volumes. patchy basilar and mid lung airspace opacities, left greater than right, likely represent atelectasis. minimal improved aeration. no new focal consolidation. 2.mild cardiomegaly. persistent small bilateral pleural effusions, left greater than right. 3.sternotomy wires are intact. partially calcified left breast implant is noted.
Cardiomegaly
Radiologist: Positive
CheXpert: Positive
T-auto: Uncertain
T-auto mistakenly labels ”mild cardiomegaly” as uncertain for cardiomegaly.
1.frontal and lateral views of the chest demonstrate marked enlargement of the cardiac silhouette. 2.there are diffuse increased interstitial markings and prominence of the central vasculature, consistent with acute and/or chronic pulmonary edema. 3.small bilateral pleural effusions. bibasilar opacities most likely represent atelectasis.
Edema
Radiologist: Positive
CheXpert: Positive
T-auto: Uncertain
T-auto may have incorrectly detected uncertainty from “and/or,” which is a conjunction between “acute” and “chronic”.
Table A4: This table contains examples where CheXpert correctly assigns a label while T-auto misassigns that label on the CheXpert manual set. We include speculative reasoning as to why this may have happened.
Example Labels Reasoning
Worsening, now severe, bilateral pulmonary edema. Supervening pneumonia can certainly not be excluded in the appropriate clinical setting. Interval removal of endotracheal tube. Cardiomediastinal silhouette stable.
Enlarged Cardiomediast.
Radiologist: Uncertain
CheXpert: Positive
T-auto: Positive
CheXbert: Uncertain
In the labeling instructions for radiologists, if a condition is “stable” or shows no change compared to previous readings, the radiologist is told to mark the condition as uncertain. While T-auto does not capture this complexity, CheXbert correctly labels enlarged cardiomediastinum as uncertain.
Removal of dialysis catheter with no evidence of pneumothorax. Heart is mildly enlarged and is accompanied by vascular engorgement and new\n septal lines consistent with interstitial edema. Small pleural effusions have\n increased in size in the interval.
Cardiomegaly
Radiologist: Positive
CheXpert: Blank
T-auto: Blank
CheXbert: Positive
While T-auto and CheXpert do not detect cardiomegaly based on descriptions like “heart is mildly enlarged” or “heart is enlarged,” CheXbert more consistently labels examples with this phrase as positive for cardiomegaly.
New bibasilar opacities, which given the clinical history are suspicious for aspiration, possibly developing pneumonia.
Lung Opacity
Radiologist: Positive
CheXpert: Uncertain
T-auto: Uncertain
CheXbert: Positive
The word “suspicious” does not modify “opacities” in this sentence. Although CheXbert correctly identifies this, CheXpert and T-auto misclassify the “opacities” as uncertain.
PA and lateral chest compared to ___ through ___: Pulmonary edema on ___ has almost entirely cleared, with a small perihilar residual and persistence of small bilateral pleural effusions. Moderate-to-severe cardiomegaly is longstanding.
Edema
Radiologist: Positive
CheXpert: Negative
T-auto: Negative
CheXbert: Positive
In this example, the edema has “almost entirely cleared,” which implies that it is still present. Both T-auto and CheXpert miss this subtlety, while CheXbert correctly classifies the condition as positive.
In comparison with study of ___, there has been removal of pleural fluid from the left hemithorax. No evidence of pneumothorax. Coalescent areas in the left upper and lower zones could well reflect regions of consolidation. The right lung is essentially clear. Right IJ central catheter extends to the lower portion of the SVC.
Consolidation
Radiologist: Uncertain
CheXpert: Positive
T-auto: Positive
CheXbert: Uncertain
CheXbert correctly detects that consolidation is uncertain, as indicated by the phrase “could well reflect”.
Table A5: This table contains examples where CheXbert correctly assigns a label while both T-auto and CheXpert misassign that label on the MIMIC-CXR test set. We include speculative reasoning as to why this may have happened.
Example (cont.) Labels (cont.) Reasoning (cont.)
No previous images. There is hyperexpansion of the lungs suggestive of chronic pulmonary disease. Prominence of engorged and ill-defined pulmonary vessels is consistent with the clinical diagnosis of pulmonary vascular congestion, though in the absence of previous images it is difficult to determine whether any this appearance could reflect underlying chronic pulmonary disease. The possibility of supervening consolidation would be impossible to exclude on this single study, especially without a lateral view. No evidence of pneumothorax.
Consolidation
Radiologist: Uncertain
CheXpert: Positive
T-auto: Positive
CheXbert: Uncertain
CheXbert correctly detects uncertainty for consolidation indicated by the word “possibility”. Both T-auto and CheXpert misclassify consolidation.
1. Left suprahilar opacity and fiducial seeds are again seen, although appears slightly less prominent/small in size, although as mentioned on the prior study, could be further evaluated by chest CT or PET-CT. 2. Right hilum appears slightly more prominent as compared to the prior study, which may be due to patient positioning, although increased right hilar lymphadenopathy is not excluded.
Enlarged Cardiomediast.
Radiologist: Positive
CheXpert: Blank
T-auto: Blank
CheXbert: Positive
The right hilum appearing more prominent is an indicator of enlarged cardiomediastinum, which is clinically understood. If the hilum is growing, then the entire mediastinum is growing. Although both CheXpert and T-auto mislabeled this report impression, CheXbert successfully labeled it positive for enlarged cardiomediastinum.
Removal of dialysis catheter with no evidence of pneumothorax. Heart is mildly enlarged and is accompanied by vascular engorgement and new septal lines consistent with interstitial edema. Small pleural effusions have\n increased in size in the interval.
Cardiomegaly
Radiologist: Positive
CheXpert: Blank
T-auto: Blank
CheXbert: Positive
Due to a limitation of its ruleset for mention detection, CheXpert only looks at “the heart” or “heart size” but not “heart” independently when checking for mentions of cardiomegaly. However, CheXbert recognizes mentions of cardiomegaly implied by phrases like “heart is mildly enlarged”.
As compared to the previous radiograph, there is no relevant change. Large fluid or pneumothorax on the right with air-fluid level in the posterior aspect of the lung. Massive generalized right-sided pleural thickening with slight decrease of the right hemithorax. Fibrotic changes of the lung parenchyma. On the left, there is no abnormality of the pleura or lung parenchyma. The left aspect of the heart border is unremarkable.
Cardiomegaly
Radiologist: Negative
CheXpert: Positive
T-auto: Positive
CheXbert: Negative
The phrase “left aspect of the heart border is unremarkable” is a relatively nonstandard way of saying that there is no evidence of cardiomegaly. CheXbert correctly labels cardiomegaly as negative, while CheXpert and T-auto mislabel cardiomegaly as positive.
Example (cont.) Labels (cont.) Reasoning (cont.)
… A lordotic view might be definitive. Lungs are otherwise clear of focal opacities. There is no pleural effusion or evidence of central adenopathy. Cement and fusions are present in two lower thoracic vertebral bodies, with only minimal loss of height, unchanged since ___. Findings were posted to the online record of critical radiology findings for direct notification of the referring physician, at the time of this dictation.
Lung Opacity
Radiologist: Negative
CheXpert: Positive
T-auto: Positive
CheXbert: Negative
CheXbert correctly detects a negation of lung opacity indicated by the phrase “clear of focal opacities”.
The patient has received a new nasogastric tube. The tube is coiled in the oropharynx and does not reach the esophagus. The stomach is moderately distended and filled with gas. Known left carotid stent. The pre-existing signs indicative of interstitial lung edema have decreased. No evidence of complications, notably no pneumothorax.
Lung Opacity
Radiologist: Blank
CheXpert: Positive
T-auto: Positive
CheXbert: Blank
Interstitial lung is a phrase used for detection of mention of lung opacity in the CheXpert rulesset. So, while CheXpert and T-auto mark lung opacity as positive, CheXbert correctly marks it blank (edema is being described instead).
AP chest compared to ___ through ___ at 2:01 p.m.: Previous pulmonary edema is not recurred. There is no pneumothorax or pleural effusion. Heart is not enlarged. Right PIC line ends in the upper SVC. THE STUDY AND THE REPORT WERE REVIEWED BY THE STAFF RADIOLOGIST.
Edema
Radiologist: Negative
CheXpert: Positive
T-auto: Positive
CheXbert: Negative
CheXbert correctly detects the negation described by the statement that “edema is not recurred,” which both CheXpert and T-auto mislabel.
AP chest compared to ___ and ___: Moderately severe pulmonary edema has cleared from much of the lungs since ___ at 5:57 p.m.
Edema
Radiologist: Positive
CheXpert: Negative
T-auto: Negative
CheXbert: Positive
Since “edema has cleared from much of the lungs,” the report implies that edema is still present. CheXbert correctly marks this phrase as positive, while CheXpert and T-auto both mislabel edema as negative.
Sternotomy wires are unchanged. The heart and mediastinal contours are within normal limits and stable. There has been interval decrease in a left-sided pleural effusion with some persisting left basilar atelectasis. The right lung is clear. A line between the posterior aspects of the left third and fourth rib space is more compatible with a skin fold rather than the visceral pleura of the lung, so pneumothorax is not favored. However, given the recent instrumentation, if growing clinical concern for pneumothorax exists, short-interval followup may be considered.
Pneumothorax
Radiologist: Uncertain
CheXpert: Positive
T-auto: Positive
CheXbert: Uncertain
CheXbert correctly detects uncertainty in the phrase “pneumothorax is not favored,” which is missed by both CheXpert and T-auto.
Original Report Backtranslation Changes
1. marked cardiomegaly with a configuration that raises concern for a pericardial effusion. possible mild edema.

2. healed left-sided rib fractures.
1. pronounced cardiomegaly with a configuration that raises concerns about a pericardial effusion. possible mild edema.

2. healed left-sided rib fractures.
“marked” is changed to the synonym “pronounced”, and “raises concern for” is rephrased as “raises concerns about”.
1. redemonstration of right side pleural effusion and bibasilar atelectasis unchanged from comparison.

2. redemonstration of multiple right-sided rib fractures.
1. redemonstration of the pleural effusion of the right side and the bibasilar atelectasia unchanged compared to the comparison.

2. redemonstration of several rib fractures of the right side.
“right side pleural effusion” is rephrased as “pleural effusion of the right side”, “unchanged from comparison” is rephrased to “compared to the comparison” and “multiple right-sided rib fractures” is rephrased as “several rib fractures of the right side”.

However, “atelectasis” is incorrectly changed to “atelectasia”.
1. single ap portable semiupright view of the chest demonstrates no change in medical support devices.

2. persistent dense retrocardiac opacity and small to moderate left pleural effusion present. right perihilar opacity appears resolved.

3. stable cardiomediastinal silhouette. no pulmonary edema.

4. multilevel degenerative changes of the spine.
1. single ap portable semi-upright view of the breast showing no change in medical aids.

2. persistent dense retrocardiac opacity and small to moderate left pleural effusion presented. right perihilar opacity appears resolved.

3. stable cardiomediastinal silhouette. no pulmonary edema.

4. multi-level degenerative changes of the spine.
“semiupright” becomes “semi-upright”, “medical support devices” is changed to “medical aids”, “present” is changed to “presented” and “multilevel” is changed to “multi-level”.

However, “chest” is incorrectly changed to “breast”.
Table A6: This table contains examples of additional data samples generated using backtranslation on radiologist-annotated reports from the CheXpert manual set. Augmenting our relatively small set of radiologist-annotated reports with backtranslation proved useful in improving performance of our labeler on the MIMIC-CXR test set.
Original Report (cont.) Backtranslation (cont.) Changes (cont.)
1. single frontal view of the chest demonstrates a surgical drain projecting over the neck, a tracheostomy tube, a feeding tube which extends below the diaphragm and beyond the inferior margin of the film. cutaneous staples project over the left clavicle, and surgical clips are seen within the left neck. no evidence of pneumothorax.

2. a dense retrocardiac opacity may represent atelectasis versus consolidation.,small bilateral pleural effusions are present. A convex opacity at the right paratracheal region is of uncertain significance; recommend upright pa and lateral for further evaluation when the patient is able.

3. the cardiomediastinal silhouette and pulmonary vasculature are unremarkable.
1. a single frontal view of the breast shows a surgical drain extending over the neck, a tracheostolic tube, a feeding tube extending under the diaphragm and over the lower edge of the film. skin clamps protrude over the left collarbone, and surgical clips are visible in the left cervical area. no indication of pneumothorax.

2. dense retrocardiac opacity may represent ateltasia versus consolidation. small bilateral pleural effusions are present. convex opacity in the right paratracheal area is of uncertain importance; recommend upright pa and lateral for further assessment if the patient is able to do so.

3. the cardiastinal silhouette and pulmonary vasculature are unobtrusive.
“demonstrates a surgical drain projecting over” rephrased to “shows a surgical drain extending over”, “a feeding tube which extends below the diaphragm and beyond the inferior margin of the film” rephrased to “a feeding tube extending under the diaphragm and over the lower edge of the film”, “surgical clips are seen within the left neck” changed to the semantically equivalent “surgical clips are visible in the left cervical area”, “region is of uncertain significance” rephrased as “area is of uncertain importance”, “further evaluation when the patient is able” is rephrased as “further assessment if the patient is able to do so”, and “pulmonary vasculature are unremarkable” is changed to the semantically close “pulmonary vasculature are unobtrusive”.

However “chest” incorrectly changed to “breast, “tracheostomy tube” incorrectly changed to “tracheostolic tube”, “cutaneous staples project over the left clavicle” changed to the semantically similar “skin clamps protrude over the left collarbone”, but “skin clamps” is suboptimal, “atelectasis” incorrectly changed to “ateltasia”, “cardiomediastinal” is incorrectly changed to “cardiastinal”.
Original Report (cont.) Backtranslation (cont.) Changes (cont.)
1. single ap view of the chest demonstrates hyperinflation of the lungs.

2. there are prominent interstitial opacities which are stable. there is a residual tiny left apical pneumothorax without interval change.

3. cardiomediastinal silhouette is stable.

4. there is nonvisualization of the left costophrenic angle limiting its evaluation and if concerned, repeat study can be performed.
1. a single view of the breast shows hyperinflation of the lungs.

2. there are prominent interstitial opacities that are stable. there is a remaining tiny left apical pneumothorax without interval change.

3. the cardiomediastinal silhouette is stable

4. there is no visualization of the left costophrine angle that restricts its assessment, and if affected, a repeat study can be conducted.
“demonstrates hyperinflation” is rephrased as “shows hyperinflation”, “residual” is changed to the synonym “remaining”, and “angle limiting its evaluation and if concerned, repeat study can be performed” is rephrased to “angle that restricts its assessment, and if affected, a repeat study can be conducted”. The replacement of “concerned” with “affected” appears suboptimal.

However, “ap” is incorrectly removed from the phrase “single ap view of the chest”, “chest” is incorrectly changed to “breast”, and ”costophrenic angle” is incorrectly changed to “cortophrine angle”.