1 Introduction and Related Work
X-rays are among the most prevalent imaging modalities in medical diagnosis. Consequently, most deep learning medical imaging applications detect anomalies on X-ray images(Wang et al., 2017; Yao et al., 2017, 2018, 2019). However, expert radiologists agree that X-ray is one of the least specific imaging modalities for clinical diagnosis when compared with other imaging modalities such as MRI and CT (Smith et al., 2016). As a result, X-ray radiology reports inherently express a high degree of uncertainty.
Nevertheless, most traditional natural language processing (NLP) or rule-based systems extract labels from reports yielding dichotomous output for the presence (1) or absence (0) of abnormalities without mechanisms to express the associated degree of uncertainty(Olatunji et al., 2019) (Hassanpour and Langlotz, 2016) Attaluri et al. (2018). At best, tools like NegBio (Peng et al., 2018), CheXpert labeller (Rajpurkar et al., 2017) and cTakes (Savova et al., 2010) output a third class representing "uncertainty". When these hard targets are used downstream to train image models, it forces models to make a definitive prediction on all cases regardless of the confidence in the original radiology reports. This may lead to sub-optimal performance (Hinton et al., 2015).
Inspired by Hinton et al. (2015), we address the aforementioned issue by training sentence-based and report-based Long-Short Term Memory Networks (LSTMs) to augment discrete labels generated from the rule-based system with a continuous score which in turn may be interpreted as model’s uncertainty or confidence. And we do so without sacrificing sensitivity and specificity.
In particular, a rule-based in-house NLP tool is used to first classify a report into either normal or abnormal. Given such discrete binary labels, LSTMs are then trained to reproduce them. As the by-product of training, the continuous predictions from LSTMs may be used to capture the confidence and uncertainty of a binary prediction.
For training, we use a private dataset that covers 6 body regions (abdomen, chest, spine, upper extremity, lower extremity and head/neck), a total of about 900,000 reports. For testing, we use two datasets, one public and one private. The public dataset from OpenI consists of 7,468 chest X-ray reports along with their ground truth labels (Demner-Fushman et al., 2012) while the private dataset had 2,185 reports hand-labelled by 3 expert radiologists.
2.2 NLP labeling
We extract labels from the reports using domain-specific rule-based NLP tools. We developed the NLP tool in 3 steps. (1) Extraction: we extract findings in the report using NIH’s METAMAP (Aronson and Lang, 2010) adding further heuristics to improve sensitivity and specificity. (2) Negation detection: We craft negation rules based on the output of Stanford’s CoreNLP dependency parser (Manning et al., 2014). (3) Classification: we craft rules to filter findings based on negation detection results and return a global label (normal/abnormal) for each report and for each report sentence.
2.3 LSTM training
With each sentence and its binary label, we train a Bidirectional LSTM (Hochreiter and Schmidhuber, 1997)
from scratch in Keras using Tensorflow backend. The embedding layer is a matrix of 100 (embedding dimension) by 22,000 (vocabulary size), followed by a 1D spatial dropout layer of 0.2. This was followed by a BiLSTM layer with 256 hidden units and recurrent dropout of 0.4 for regularization. A dense layer with sigmoid activation then outputs model predictions. Training minimizes the binary cross entropy loss using adaptive moment(Kingma and Ba, 2014)
as the optimizer with an initial learning rate of 0.001, beta1 as 0.9, beta2 as 0.999, and epsilon as 1e-07. We train on 8 Tesla V100 GPUS with minibatch size of 32 samples over 20 epochs using a patience of 5 for early stopping. At test time, we ensemble the predictions for each sentence in a report by taking the maximum (maxpooling) and compare this against ground truth report labels.
In the first example above (left top), dichotomous labels sometimes mislead model prediction. However, the second report (left middle) and third report (left bottom) show reports where the report uncertainty is retained despite binary targets. The model instead produces uncertainty estimates, soft targets, instead of hard binary targets that serve as a confidence score otherwise unavailable to downstream image models.
4 Discussion: major implications of using soft labels
Soft target as a regularizer
Soft targets effectively scale the learning gradients, resulting in smoother updates to model weights, and increasing the models’ robustness to labeling noise (Hinton et al., 2015).
With a continuous prediction, sensitivity and specificity can be adjusted based on specific use cases. For instance, a good triage model would select a threshold that focuses on achieving high sensitivity instead of specificity.
Unlike binary predictions, uncertainty estimates naturally enable case prioritization in a clinical environment. Abnormal cases with high confidence may be reviewed in a more timely fashion. On the other hand, cases with high uncertainty may be diverted to more experienced clinicians for better diagnosis.
inherent complexity of rule-based systems make them less efficient at test time when compared with model predictions. For context, our rule-based system takes an average of 254 seconds to process 1000 reports using 80 2.5GHz CPU cores. By contrast, it took about 239 secs to label 7486 openi reports using 8 Tesla V100 GPUS. On our dataset of 900,000 reports, that could be a 10x speedup.
- Aronson and Lang (2010) Alan R Aronson and François-Michel Lang. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229–236, 2010.
- Attaluri et al. (2018) Nithya Attaluri, Ahmed Nasir, Carolynne Powe, Harold Racz, Ben Covington, Li Yao, Jordan Prosky, Eric Poblenz, Tobi Olatunji, and Kevin Lyman. Efficient and accurate abnormality mining from radiology reports with customized false positive reduction. arXiv preprint arXiv:1810.00967, 2018.
- Demner-Fushman et al. (2012) Dina Demner-Fushman, Sameer Antani, Matthew Simpson, and George R Thoma. Design and development of a multimodal biomedical information retrieval system. Journal of Computing Science and Engineering, 6(2):168–177, 2012.
- Hassanpour and Langlotz (2016) Saeed Hassanpour and Curtis P Langlotz. Information extraction from multi-institutional radiology reports. Artificial intelligence in medicine, 66:29–39, 2016.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014. URL http://www.aclweb.org/anthology/P/P14/P14-5010.
- Olatunji et al. (2019) Tobi Olatunji, Li Yao, Ben Covington, Alexander Rhodes, and Anthony Upton. Caveats in generating medical imaging labels from radiology reports. arXiv preprint arXiv:1905.02283, 2019.
- Peng et al. (2018) Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers, and Zhiyong Lu. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings, 2017:188, 2018.
- Rajpurkar et al. (2017) Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis P. Langlotz, Katie Shpanskaya, Matthew P. Lungren, and Andrew Y. Ng. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
- Savova et al. (2010) Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 17(5):507–513, 2010.
- Smith et al. (2016) Brandon J. Smith, Grant S Buchanan, and Franklin D. Shuler. A comparison of imaging modalities for the diagnosis of osteomyelitis. In A comparison of imaging modalities for the diagnosis of osteomyelitis, 2016.
- Wang et al. (2017) Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, et al. Clinical information extraction applications: a literature review. Journal of biomedical informatics, 2017.
- Yao et al. (2017) Li Yao, Eric Poblenz, Dmitry Dagunts, Ben Covington, Devon Bernard, and Kevin Lyman. Learning to diagnose from scratch by exploiting dependencies among labels, 2017.
- Yao et al. (2018) Li Yao, Jordan Prosky, Eric Poblenz, Ben Covington, and Kevin Lyman. Weakly supervised medical diagnosis and localization from multiple resolutions, 2018.
- Yao et al. (2019) Li Yao, Jordan Prosky, Ben Covington, and Kevin Lyman. A strong baseline for domain adaptation and generalization in medical imaging, 2019.