This work studies the generalization performance of current chest X-rays prediction models when trained and tested on X-rays image datasets from different institutions that were annotated by different clinicians or labelling tools. By doing so, we aim to provide supporting evidence for which tasks are reliable/consistent across multiple different datasets. Indeed, it seems there are limits to the performance of systems designed to replicate humans which is consistent with the evidence that human radiologists often don’t agree with each other. Recent research has discussed generalisation issues Pooch2019; Yao2019op; Baltruschat2019 however it is not clear exactly what the cause of the problem is. We enumerate some possibilities:
Errors in labelling as discussed by oakden-raynerExploringLargeScale2019 and Majkowska2019, in part due to automatic labellers.
Discrepancy between the radiologist’s vs clinician’s vs automatic labeller’s understanding of a radiology report Brady2012.
Bias in clinical practice between doctors and their clinics Busby2018 or limitations in objectivity Cockshott1983; Garland1949.
Interobserver variability Moncada2011. It can be related to the medical culture, language, textbooks, or politics. Possibly even conceptually (e.g. footballs between USA and the world ).
Formally we have pairs of X-ray images, , and corresponding task labels,
, drawn from some joint distribution
for a given population. Our learning methods estimate, but may not generalize well when the joint distribution changes due to, for example, different X-ray machines or variable patient characteristics between different populations. There are several different cases that can give rise to variations in and we will use the terminology of moreno-torresUnifyingViewDataset2012 to describe them. Approaches for generalizing medical image models (e.g. Pooch2019) have assumed to be constant and concentrated on covariate shift (where varies) and prior probability shift (where varies). We present evidence that is not consistent and what is considered the “ground truth” is subjective; concept shift in the terminology of moreno-torresUnifyingViewDataset2012. This forces us to consider where conditions the prediction. Our experiments suggest that this conditioning is not only related to bias from the population but is due to other factors. This presents a new challenge to overcome when developing diagnostic systems as, under the current formulation, it may be impossible to train a system that will generalize.
To address this issue Majkowska2019 relabeled a subset of the NIH dataset images for 4 labels using 3 raters. On these images their raters didn’t agree with each other up to 10% of the time for the label “Airspace opacity” and 6% for “Nodule/mass” 111We calculate these statistics from the published file individual_readers.csv. If there was not unanimous agreement between the 3 raters this is considered disagreement.. When creating the MIMIC-CXR dataset, Johnson2019 used two different automatic label extraction methods. Between these methods the most disagreement was 0.6% for “Fracture” (when only considering positive and negative labels) or 2.6% for Cardiomegaly (when including uncertain and no prediction as well). They also evaluated a subset of the radiology reports with a board certified radiologists which found that a lowest agreement of 0.462 F1 for “Enlarged Cardiomediastinum” which can possibly be explained by uncertainty about what cardio-thoracic ratio (CTR) is clinically relevant Zaman2007.
These studies indicate that automatic labelling tools are consistent with each other and the issue likely is related to the well known problem of interobserver variability. In order to mitigate this problem we focus on studying its impact on the current Deep Learning approaches.
Our approach: In this work we analyze models trained on four of the largest public datasets utilizing over 200k unique chest X-rays after filtering for one PA view per patient. A study like this is needed as these systems are being built and evaluated now Cohen2019; Qin2019; Baltruschat2019; Hwang2019; Rubin2018; Yao2019op; Putha2018. This work is further motivated by the use of these models in populations much different than their training population such as in Qin2019 where systems such as qXR (developed in India) is applied to images from Nepal and Cameroon.
There are many issues that could prevent a model from generalizing. For example: overfitting to artifacts of the training data Zech2018, concepts can vary between training labels and external data, training data may not be a representative sample of external data, and the models could be learning very superficial image statistics Jo2017.
The paper is structured into three sections: performance, agreement, and representation. The performance section §LABEL:sec:perf studies performance of models trained on one dataset and evaluated on others. The agreement section §LABEL:sec:agree studies how much predictions from models trained on one dataset agree with the predictions of other models trained using other datasets for the same task. Finally a representation section §LABEL:sec:rep
studies how well the representations in the neural networks differ between the models. All code is made available onlineand data is publicly available.
We use the following datasets: NIH aka Chest X-ray14 WangNIH2017, PC aka PadChest Bustos2019, CheX aka CheXpert Irvin2019, MIMIC-CXR Johnson2019, OpenI Demner-Fushman2016, Google (Majkowska2019), Kaggle aka the RSNA Pneumonia Detection Challenge222https://www.kaggle.com/c/rsna-pneumonia-detection-challenge. Full details of the data are located in Appendix §LABEL:sec:datadetails_apdx.
All datasets are manually mapped to 18 common labels. Code is provided which details the exact mapping online. We release a framework to load these datasets in a canonical way for further experimentation.
We resize the images to pixels, utilizing a center crop if the aspect ratio is uneven, and scale the pixel values to for the training.
DenseNets Huang2017 have been shown to be the best architecture for X-rays predictive models Rajpurkar2017. We evaluated ResNets and ShuffleNets but they achieved similar performance. Training was standard with other similar work such as Rajpurkar2017. To take into account that only some labels are present with the recent 2019+ datasets the loss is computed only for the available labels and other outputs are ignored.
Due to label imbalance the performance for tasks which are overrepresented receive less focus by the loss function. In order to alleviate this the weight for each task is balanced based on the frequency of that task in the dataset. Each taskis given a weight based on the following formula where is the count of samples with positive samples for task .
In order to calibrate the output of the model so that they can be compared a piecewise linear transformation Eq.1 is applied. The transformation is chosen so that the best operating point corresponds to . For each disease, we computed the optimal operating point by maximizing the difference (True positive rate - False positive rate
). It corresponds to the threshold which maximizes the informedness of the classifierpowersEvaluationPrecisionRecall2011. This is computed with respect to the test set being evaluated so the model is the most optimal it can be. With this we remove miscalibration as a reason for generalization error.