Learning to estimate label uncertainty for automatic radiology report parsing

10/01/2019 ∙ by Tobi Olatunji, et al. ∙ Enlitic, Inc. 0

Bootstrapping labels from radiology reports has become the scalable alternative to provide inexpensive ground truth for medical imaging. Because of the domain specific nature, state-of-the-art report labeling tools are predominantly rule-based. These tools, however, typically yield a binary 0 or 1 prediction that indicates the presence or absence of abnormalities. These hard targets are then used as ground truth to train image models in the downstream, forcing models to express high degree of certainty even on cases where specificity is low. This could negatively impact the statistical efficiency of image models. We address such an issue by training a Bidirectional Long-Short Term Memory Network to augment heuristic-based discrete labels of X-ray reports from all body regions and achieve performance comparable or better than domain-specific NLP, but with additional uncertainty estimates which enable finer downstream image model training.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

X-rays are among the most prevalent imaging modalities in medical diagnosis. Consequently, most deep learning medical imaging applications detect anomalies on X-ray images

(Wang et al., 2017; Yao et al., 2017, 2018, 2019). However, expert radiologists agree that X-ray is one of the least specific imaging modalities for clinical diagnosis when compared with other imaging modalities such as MRI and CT (Smith et al., 2016). As a result, X-ray radiology reports inherently express a high degree of uncertainty.

Nevertheless, most traditional natural language processing (NLP) or rule-based systems extract labels from reports yielding dichotomous output for the presence (1) or absence (0) of abnormalities without mechanisms to express the associated degree of uncertainty

(Olatunji et al., 2019) (Hassanpour and Langlotz, 2016) Attaluri et al. (2018). At best, tools like NegBio (Peng et al., 2018), CheXpert labeller (Rajpurkar et al., 2017) and cTakes (Savova et al., 2010) output a third class representing "uncertainty". When these hard targets are used downstream to train image models, it forces models to make a definitive prediction on all cases regardless of the confidence in the original radiology reports. This may lead to sub-optimal performance (Hinton et al., 2015).

Inspired by Hinton et al. (2015), we address the aforementioned issue by training sentence-based and report-based Long-Short Term Memory Networks (LSTMs) to augment discrete labels generated from the rule-based system with a continuous score which in turn may be interpreted as model’s uncertainty or confidence. And we do so without sacrificing sensitivity and specificity.

In particular, a rule-based in-house NLP tool is used to first classify a report into either normal or abnormal. Given such discrete binary labels, LSTMs are then trained to reproduce them. As the by-product of training, the continuous predictions from LSTMs may be used to capture the confidence and uncertainty of a binary prediction.

2 Experiments

2.1 Datasets

For training, we use a private dataset that covers 6 body regions (abdomen, chest, spine, upper extremity, lower extremity and head/neck), a total of about 900,000 reports. For testing, we use two datasets, one public and one private. The public dataset from OpenI consists of 7,468 chest X-ray reports along with their ground truth labels (Demner-Fushman et al., 2012) while the private dataset had 2,185 reports hand-labelled by 3 expert radiologists.

2.2 NLP labeling

We extract labels from the reports using domain-specific rule-based NLP tools. We developed the NLP tool in 3 steps. (1) Extraction: we extract findings in the report using NIH’s METAMAP (Aronson and Lang, 2010) adding further heuristics to improve sensitivity and specificity. (2) Negation detection: We craft negation rules based on the output of Stanford’s CoreNLP dependency parser (Manning et al., 2014). (3) Classification: we craft rules to filter findings based on negation detection results and return a global label (normal/abnormal) for each report and for each report sentence.

2.3 LSTM training

With each sentence and its binary label, we train a Bidirectional LSTM (Hochreiter and Schmidhuber, 1997)

from scratch in Keras using Tensorflow backend. The embedding layer is a matrix of 100 (embedding dimension) by 22,000 (vocabulary size), followed by a 1D spatial dropout layer of 0.2. This was followed by a BiLSTM layer with 256 hidden units and recurrent dropout of 0.4 for regularization. A dense layer with sigmoid activation then outputs model predictions. Training minimizes the binary cross entropy loss using adaptive moment

(Kingma and Ba, 2014)

as the optimizer with an initial learning rate of 0.001, beta1 as 0.9, beta2 as 0.999, and epsilon as 1e-07. We train on 8 Tesla V100 GPUS with minibatch size of 32 samples over 20 epochs using a patience of 5 for early stopping. At test time, we ensemble the predictions for each sentence in a report by taking the maximum (maxpooling) and compare this against ground truth report labels.

3 Results

Figure 1: Left: Examples of uncertainty in reports. Right: BiLSTM Performance compared with Rule-based NLP single operating point on Public (OpenI) and Private Datasets

In the first example above (left top), dichotomous labels sometimes mislead model prediction. However, the second report (left middle) and third report (left bottom) show reports where the report uncertainty is retained despite binary targets. The model instead produces uncertainty estimates, soft targets, instead of hard binary targets that serve as a confidence score otherwise unavailable to downstream image models.

4 Discussion: major implications of using soft labels

Soft target as a regularizer

Soft targets effectively scale the learning gradients, resulting in smoother updates to model weights, and increasing the models’ robustness to labeling noise (Hinton et al., 2015).

Application-specific thresholding

With a continuous prediction, sensitivity and specificity can be adjusted based on specific use cases. For instance, a good triage model would select a threshold that focuses on achieving high sensitivity instead of specificity.

Study prioritization

Unlike binary predictions, uncertainty estimates naturally enable case prioritization in a clinical environment. Abnormal cases with high confidence may be reviewed in a more timely fashion. On the other hand, cases with high uncertainty may be diverted to more experienced clinicians for better diagnosis.

Labeling Efficiency

inherent complexity of rule-based systems make them less efficient at test time when compared with model predictions. For context, our rule-based system takes an average of 254 seconds to process 1000 reports using 80 2.5GHz CPU cores. By contrast, it took about 239 secs to label 7486 openi reports using 8 Tesla V100 GPUS. On our dataset of 900,000 reports, that could be a 10x speedup.