Tremendous progress has been achieved in predictive analytics for medical imaging. With the advent of powerful machine-learning approaches such as deep learning, staggering improvements in predictive accuracy have been demonstrated for applications such as computer-aided diagnosis[esteva2017dermatologist] or assisting radiotherapy planning and monitoring of disease progression via automatic contouring of anatomical structures [menze2015multimodal]. However, two of the main obstacles for translating these successes to more applications and into wider clinical practice remain: data scarcity, concerning the limited availability of high-quality training data required for building predictive models; and data mismatch, whereby a model trained in a lab environment may fail to generalize to real-world clinical data.
Let us illustrate with a hypothetical scenario how these obstacles may arise in practice and pose real threats to the success of research projects. Suppose a team of academic radiologists is excited about the opportunities artificial intelligence seems to offer for their discipline. In a recent study, the clinical team was able to demonstrate the effectiveness of using human interpretation of magnetic resonance imaging (MRI) for diagnosis of prostate cancer, yielding higher sensitivity and specificity than a conventional diagnostic test, as confirmed via ground-truth labels from histopathology. Motivated by these results, the team decides to approach a machine-learning (ML) research lab with the idea of developing a tool for automated, MRI-based diagnosis of prostate cancer. Because reading MRI requires advanced training and experience, they hope such a system may facilitate widespread adoption of MRI as a novel, accurate, and cost-effective tool for early diagnosis, especially in locations with lower availability of the required human expertise.
The clinicians still have access to their previous study data, and are confident this may be used for ML development. Unfortunately, the sample size is small—there are insufficient pairs of images and diagnosis labels to train a state-of-the-art deep learning image classification method. However, the clinicians have access to large amounts of (unlabelled) routine MRI scans. The ML researchers are hopeful they can additionally leverage this data in a so-called semi-supervised learning strategy. After a pilot phase of method development, the team is planning to evaluate their system in a large multi-centre study.
What are the chances of success for their project, and how could a causal analysis help them to identify potential issues in advance? Regarding the limited availability of annotated data, here the team may be lucky in successfully exploiting the unlabelled data thanks to the anticausal direction between images and confirmed diagnosis labels (as we will discuss later in more detail). However, a major obstacle arises due to a mismatch between the retrospective study data and the prospective multi-centre data due to specific inclusion criteria in the previous study (selection bias), varying patient populations (e.g. changes in demographics), and prevalence of disease (e.g. due to environmental factors). While researchers are generally aware of the adverse effects of such differences in aspects of the data, they may be unaware that causal reasoning provides tools for laying out any underlying assumptions about the data generating process in a clear and transparent fashion, such that any issues can be more easily identified beforehand and possibly resolved by employing suitable data collection, annotation, and machine-learning strategies.
In this article we discuss how causal considerations in medical imaging can shed new light on the above challenges and help in finding appropriate solutions. In particular, we illustrate how the causal structure of a task can have profound, and sometimes surprising, consequences on the soundness of the employed machine-learning approach and resulting analysis. We highlight that being aware of causal relationships, and related issues such as dataset shift and selection bias, allows for systematic reasoning about what strategies to prefer or to avoid. Here, the language of causal diagrams provides explicit means to specify assumptions, enabling transparent scrutiny of their plausibility and validity [bareinboim2016datafusion]. It is in fact a natural way of defining the relationships between variables of interest, because it reflects the expert’s knowledge of the biological and logistical processes involved in the generation and collection of data, and has been successfully applied for building models for decision-making in healthcare, for example [lucas2004bayesnets, cypko2017validation]. In addition, we provide in the appendix a gentle introduction to the relevant causal-theoretic concepts and notes on creating and interpreting causal diagrams. We hope our work can serve as a practical guide and inspire new directions for research in medical imaging.
0.1 Predictive analytics in medical imaging.
The focus of this article is on predictive modelling: given an image , train a model to predict some given annotationby fitting a statistical model with a suitable objective function. This formulation encompasses a variety of common medical image analysis tasks, such as semantic segmentation (i.e. contouring of structures of interest), disease classification, outcome prediction, and many more.
In this context, it is worth clarifying some terminology regarding the data that is used for development and after deployment, in order to avoid confusion of some terms that are sometimes used differently in clinical and machine-learning communities. We refer to an annotated dataset with pairs
as the development data, which is used to train and test a predictive model in a lab environment. In ML workflows, the development data is typically split into a training, a validation and a hold-out test set. The training set is used to learn the model parameters (e.g. the weights in a convolutional neural network), while the validation set is used during training to monitor the learning progress and avoid overfitting to the training set. The test set is used only after training is completed, in order to quantify the performance of the model on ‘unseen’ data. However, as the development data is often re-used during iterative development cycles, it is well known that information can leak from the test set into training, hence performance reported on the test set can become unrealistic over time[dwork2015holdout], which poses a major problem for regulators.
Importantly, the assumption that the performance of a trained model on the development test set is representative of the performance on new clinical data after deployment in varying environments is often violated due to differences in data characteristics, as discussed earlier. Furthermore, contrary to the development test set, the real-world test data after deployment will not come with ground-truth annotations, and performance is thus difficult (or impossible) to assess. It is therefore absolutely critical to be able to clearly formalize and communicate the underlying assumptions regarding the data generating processes in the lab and real-world environments, which in turn can help anticipate and mitigate failure modes of the predictive system.
0.2 Challenges in medical imaging.
One of the notorious challenges in medical image analysis is the scarcity of labelled data, in great part due to the high costs of acquiring expert annotations or expensive lab tests, e.g. to confirm initial diagnosis. The techniques often used to circumvent this shortage have markedly different properties under the lens of causality. First, we will discuss semi-supervised learning (SSL): the attempt to improve predictive performance by additionally exploiting more abundant unlabelled data. We will follow with an analysis of data augmentation, a powerful paradigm for artificially boosting the amount of labelled data.
In addition, the recurrent issue of mismatch between data distributions, typically between training and test sets or development and deployment environments, tends to hurt the generalizability of learned models. In the generic case with unconstrained disparities, any form of learning from the training set is arguably pointless, as the test-time performance can be arbitrarily poor. Nonetheless, causal reasoning enables us to recognize special situations in which direct generalization is possible, and to devise principled strategies to mitigate estimation biases. In particular, two distinct mechanisms of distributional mismatch can be identified: dataset shift and sample selection bias. Learning about their differences is helpful for diagnosing when such situations arise in practice.
Before diving into the details of these challenges, however—illustrated with cartoon examples in Fig. 1—the causal properties of the core predictive task must be analysed. In particular, one must pay close attention to the relationship between the inputs and targets of the model.
0.3 Causality in medical imaging.
Given the specification of the input images, , and the prediction targets, , it is imperative to determine which is the cause and which is the effect. Using the categorization in Ref. scholkopf2012causal, we wish to establish whether a task is
causal: estimate , when (predict effect from cause); or
anticausal: estimate , when (predict cause from effect).
The answer is crucial to all further causal analysis of the problem, and has a strong impact on the applicability of semi-supervised learning [chapelle2006ssl, scholkopf2013ssl] (discussed later) and on whether generative or discriminative models should be preferred [blobaum2015discriminative].
Recall the definitions of cause and effect (see Methods): if the annotation could have been different by digitally editing the image beforehand, then one can conclude that the image causes the annotation. For example, manual segmentation masks are drawn over the image by visual inspection and would evidently be influenced by certain pixel changes. On the other hand, a pathology lab result would be unaffected by such manipulations. Images and targets may alternatively be confounded, i.e. descend from a common cause. This relationship is often treated similarly to the anticausal case [scholkopf2012causal].
|Causal direction (predict effect from cause or cause from effect)|
|field of application||diagnosis / screening / prognosis / exploratory research|
|task category||segmentation / classification / regression / detection|
|annotation method||manual / (semi-)automatic / clinical tests; annotation policy|
|nature of annotations||image-wide label / pixel-wise segmentation / spatial coordinates|
image noise, acquisition artefacts, low contrast; user or software errors; signal-to-noise ratio, inter- and intra-observer variability
|Data mismatch (comparing development vs. deployment environments)|
|cohort characteristics||healthy volunteers / patients; demographics, medical records|
|subject selection||routine / specific condition or treatment / specific age range; quality control|
|acquisition conditions||single- / multi-site; modality; device; vendor; protocol|
|train-test split||random / stratified / balanced|
|annotation process||(see above)|
It is generally possible to discern causal structures only when we are aware of the acquired data’s background details, as meta-information plays a fundamental role in understanding the data generation and collection processes. Based on a comprehensive ontology of medical imaging meta-information [maierhein2018rankings], we have compiled in Table 1 a list of attributes that are meaningful for characterizing the predictive causal direction and detecting signs of dataset mismatch. Let us further illustrate this discussion with two practical examples, depicted in Fig. 2. Their descriptions will mention some concepts related to dataset mismatch that will be discussed in detail later on.
0.4 Skin lesion classification example.
Assume a set of dermoscopic images () is collected along with histopathology diagnosis for melanoma following biopsy (). Here, is a gold-standard proxy for the true presence of skin cancer, and as such can be considered as a cause of the visual appearance of the lesion, . This task is therefore anticausal (note the arrow directions in Fig. 2).
Routine dermoscopic examination of pigmented skin lesions typically results in a ‘benign’, ‘suspicious’, or ‘malignant’ label. Prediction of such labels would instead be causal, as they are obtained visually and could be affected if the images were digitally manipulated. Moreover, we know that patients are referred for biopsy only if dermoscopy raises suspicions. As inclusion in this study is case-dependent, a dataset with ground truth biopsy labels suffers from sample selection bias, and is thus not representative of the overall distribution of pigmented skin lesions.
0.5 Brain tumour segmentation example.
Structural brain MRI scans () are acquired for a cohort of glioma patients, after which a team of radiologists performs manual contouring of each lesion (). This annotation is done by visual inspection and evidently depends on image content, resolution, and contrast, for example, whereas manually editing the segmentation masks would have no effect on the images. These considerations allow us to conclude this is a case of causal prediction (). Here it might also be natural to assume the radiologists were aware of the diagnosis (e.g. specific cancer subtype and stage), in which case we could include an additional arrow from ‘disease’ to ‘segmentation’. This would however not alter the fact that the segmentations are a consequence of the images (and diagnoses), thus the task remains causal. Regardless, notice how any model trained on this data will be learning to replicate this particular manual annotation process, rather than to predict a ‘true’ underlying anatomical layout.
In addition, suppose our dataset was collected and annotated for research purposes, employing a high-resolution 3 T MRI scanner and containing a majority of older patients, and that the trained predictive model is to be deployed for clinical use with conventional 1.5 T scanners. This is a clear case of dataset shift, firstly because the images are expected to be of different quality (acquisition shift). Secondly, because the different age distribution in the target population entails variations in brain size and appearance, and in the prevalences of various types of tumour (population shift).
For the two examples above, establishing the causal direction between images and prediction targets seemed reasonably straightforward. This is not always the case, and arguably in many settings identifying whether the relationship is causal or anti-causal can be non-trivial, particularly if crucial meta-information is missing. Consider the case when prediction targets are extracted from radiology reports. At first, one may conclude that the report reflects purely the radiologist’s reading of a medical image, hence image causes report. However, their conclusions might be based on additional information—potentially even more important than the findings in the images—such as blood tests or other diagnostic test results. This instance highlights the importance of modelling the full data generating process and of gathering the right information to make an informed decision about the causal relationships underlying the data.
0.6 Tackling data scarcity via semi-supervision.
Semi-supervised learning (SSL) aims to leverage readily available unlabelled data in the hope of producing a better predictive model than is possible using only the scarce annotated data. Given this ambitious goal, it is perhaps unsurprising that strong requirements need to be met. Namely, the distribution of inputs needs to carry relevant information about the prediction task—otherwise it would be pointless to collect additional unlabelled data. This idea is typically articulated in terms of specific assumptions about the data which can be intuitively summarised as follows [chapelle2006ssl]: similar inputs (images in our case) are likely to have similar labels and will naturally group into clusters with high density in the input feature space. Lower density regions in that space in-between clusters are assumed to be ideal candidates for fitting decision boundaries of predictive models. In this context, considering large amounts of unlabelled data together with the scarce labelled data may reveal such low density regions and may lead to better decision boundaries than using labelled data alone.
Note how this idea insinuates an interplay between the distribution of inputs, , and the label conditional, . Now recall that, by independence of cause and mechanism (see Methods), if the prediction task is causal (), then is uninformative with respect to , and SSL is theoretically futile in this case [chapelle2006ssl, scholkopf2013ssl]. Since typical semantic segmentation tasks are causal, as illustrated in our brain tumour example, there is likely very little hope that semantic segmentation can fundamentally benefit from unlabelled data, which may relate to recent concerns raised in the literature [oliver2018evaluation]. Conversely, if as for skin lesions, then these distributions may be dependent, and semi-supervision has a chance of success [scholkopf2013ssl]. We conjecture that, in practice, anticausal problems are more likely than causal ones to comply with the SSL assumptions outlined above, as observed e.g. among the datasets analysed in Ref. blobaum2015discriminative.
That is not to say that SSL is completely useless for causal tasks, as there can be practical algorithmic benefits. Under certain conditions, unlabelled data can be shown to have a regularizing effect, potentially boosting the accuracy of an imperfect model by lowering its variance[cozman2006risks], and may reduce the amount of labelled data required to achieve a given performance level [singh2008unlabeled, balcan2006pac]. Further work is needed to empirically validate these gains in causal and anticausal scenarios.
A recent comprehensive empirical study [oliver2018evaluation]
reported that properly tuned purely supervised models and models pre-trained on related labelled datasets (i.e. transfer learning) are often competitive with or outperform their semi-supervised counterparts. It also demonstrated that SSL can hurt classification performance under target shift (discussed later asprevalence shift) between labelled and unlabelled sets. This suggests that practitioners willing to apply SSL should be cautious of potential target distribution mismatch between labelled and unlabelled sets—e.g. unequal proportions of cases and controls or presence of different pathologies.
0.7 Tackling data scarcity via data augmentation.
Data augmentation refers to the practice of systematically applying random, controlled perturbations to the data in order to produce additional plausible data points. This now ubiquitous technique aims to improve the robustness of trained models to realistic variations one expects to find in the test environment, and has met tremendous practical success across a wide variety of tasks. Notably, we can distinguish between augmentations encouraging invariance and equivariance.
Many tasks require predictions to be insensitive to certain types of variation. Examples include image intensity augmentations, such as histogram manipulations or addition of noise, and spatial augmentations (e.g. affine or elastic transformations) for image-level tasks (e.g. regression or classification, as in the skin lesion example). As these augmentations apply uniformly to all inputs without changing the targets , their benefits stem from a refined understanding of the conditional , while contributing no new information about .
For other tasks, such as segmentation or localization, predictions must change similarly to the inputs, e.g. a spatial transformation applied to an image —such as mirroring, affine or elastic deformations—should be likewise applied to the target
(e.g. spatial coordinates or segmentation masks, as in the brain tumour example). Information is gained about the joint distribution via its shared spatial structure, related to e.g. anatomy and acquisition conditions.
In contrast with SSL, data augmentation produces additional pairs, thereby providing more information about the joint distribution, . Its compound effect on the joint rather than only on the marginal corroborates its suitability for both causal and anticausal tasks, without the theoretical impediments of semi-supervised learning for causal prediction.
An emerging line of research aims to exploit unlabelled data for learning realistic transformations for data augmentation [zhao2019augmentation, chaitanya2019augmentation]. This direction has the potential to deliver the promises of semi-supervised learning while improving over the reliable framework of standard data augmentation.
0.8 Data mismatch due to dataset shift.
Dataset shift is any situation in which the training and test data distributions disagree due to exogenous factors, e.g. dissimilar cohorts or inconsistent acquisition processes. As before, let be the input images and be the prediction targets. We use an indicator variable for whether we are considering the training () or the test domain ():
For simplicity, in the following exposition we will refer only to disparities between training and test domains. This definition can however extend to differences between the development datasets (training and test data) and the target population (after deployment), when the latter is not well represented by the variability in the test data.
Moreover, when analysing dataset shift, it is helpful to conceptualize an additional variable , representing the unobserved physical reality of the subject’s anatomy. We then interpret the acquired images as imperfect and potentially domain-dependent measurements of , i.e. .
Switching between domains may produce variations in the conditional relationships between , , and
or in some of their marginal distributions. Based on the predictive causal direction and on which factors of the joint distribution change or are invariant across domains, dataset shift can be classified into a variety of familiar configurations. Here we formulate the concepts of ‘population shift’, ‘annotation shift’, ‘prevalence shift’, ‘manifestation shift’, and ‘acquisition shift’. These terms correspond roughly to particular dataset shift scenarios studied in general machine-learning literature, namely ‘covariate shift’, ‘concept shift’, ‘target shift’, ‘conditional shift’, and ‘domain shift’, respectively[quinonerocandela2009shift]. However, we believe it is beneficial to propose specific nomenclature that is more vividly suggestive of the phenomena encountered in medical imaging. By also explicitly accounting for the unobserved anatomy, the proposed characterization is more specific and enables distinguishing cases that would otherwise be conflated, such as population or manifestation shift versus acquisition shift. The basic structures are summarized in Fig. 3 in the form of selection diagrams (causal diagrams augmented with domain indicators) [bareinboim2016datafusion], and some examples are listed in Table 2. We hope this may empower researchers in our field to more clearly communicate dataset shift issues and to more easily assess the applicability of various solutions.
|Type||Direction||Change||Examples of differences|
|Population shift||causal||ages, sexes, diets, habits, ethnicities, genetics|
|Annotation shift||causal||annotation policy, annotator experience|
|Prevalence shift||anticausal||case-control balance, target selection|
|Manifestation shift||anticausal||anatomical manifestation of the target disease or trait|
|Acquisition shift||either||scanner, resolution, contrast, modality, protocol|
For causal prediction, we name population shift the case wherein only intrinsic characteristics (e.g. demographics) of the populations under study differ, i.e. . Fortunately, this case is directly transportable, i.e. a predictor estimated in one domain is equally valid in the other [pearl2014validity]. An underfitted model (‘too simple’) may however introduce spurious dependencies, for which importance reweighting with is a common mitigation strategy [storkey2009transfer, zhang2015multisource]. Clearly, learning in this scenario makes sense only if the variability in the training data covers the support of the test distribution [quinonerocandela2009shift]—in other words, there are no guarantees about extrapolation performance to modes of variation that are missing from the training environment.
Under prevalence shift (for anticausal tasks), the differences between datasets relate to class balance: . This can arise for example from different predispositions in the training and test populations, or from variations in environmental factors. If the test class distribution is known a priori (e.g. from an epidemiological study), generative models may reuse the estimated appearance model () in Bayes’ rule, and, for discriminative models, instances can be weighted by to correct the bias in estimating the training loss. Alternatively, more elaborate solutions based on the marginal are possible [storkey2009transfer, zhang2013target].
Cases of annotation shift involve changes in class definitions, i.e. the same datum would tend to be labelled differently in each domain (). For example, it is not implausible that some health centres involved in an international project could be operating slightly distinct annotation policies or grading scales, or employing annotators with varying levels of expertise (e.g. senior radiologists vs. trainees). Without explicit assumptions on the mechanism behind such changes, models trained to predict evidently cannot be expected to perform sensibly in the test environment, and no clear solution can be devised [morenotorres2012unifying]. A tedious and time-consuming calibration of labels or (partial) re-annotation may be required to correct for annotation shift.
Another challenging scenario is that of manifestation shift, under which the way anticausal prediction targets (e.g. disease status) physically manifest in the anatomy changes between domains. In other words, . As with annotation shift, this cannot be corrected without strong parametric assumptions on the nature of these differences.
We lastly discuss acquisition shift, resulting from the use of different scanners or imaging protocols, which is one of the most notorious and well-studied sources of dataset shift in medical imaging [glocker2019multisite]. Typical pipelines for alleviating this issue involve spatial alignment (normally via rigid registration and resampling to a common resolution) and intensity normalization. In addition, the increasingly active research area of domain adaptation investigates data harmonization by means of more complex transformations, such as extracting domain-invariant representations [ganin2016domain, kamnitsas2017unsupervised] or translating between imaging modalities [frangi2018simulation] (e.g. synthesizing MRI volumes from CT scans [huo2019synsegnet]).
0.9 Data mismatch due to sample selection bias.
A fundamentally different process that also results in systematic data mismatch is sample selection. It is defined as the scenario wherein the training and test cohorts come from the same population, though each training sample is measured () or rejected () according to some selection process that may be subject-dependent:
The main difference to standard dataset shift is the data-dependent selection mechanism (Table 3), as opposed to external causes of distributional changes (Fig. 3). In other words, the indicator variables in sample selection concern alterations in the data-gathering process rather than in the data-generating process [zhang2015multisource].
|Type||Causation||Examples of selection processes||Resulting bias|
|Random||none||uniform subsampling, randomized trial||none|
|Image||visual phenotype selection (e.g. anatomical traits, lesions)||population shift|
|image quality control (QC; e.g. noise, low contrast, artefacts)||acquisition shift|
|Target||hospital admission, filtering by disease, annotation QC, learning strategies (e.g. class balancing, patch selection)||prevalence shift|
|Joint||combination of the above (e.g. curated benchmark dataset)||spurious assoc.|
Completely random selection simply corresponds to uniform subsampling, i.e. when the training data can be assumed to faithfully represent the target population (). Since the analysis will incur no bias, the selection variable can safely be ignored. We conjecture this will rarely be the case in practice, as preferential data collection is generally unavoidable without explicit safeguards and careful experimental design.
Selection can be affected by the appearance of each image in two different manners. We can select subjects based on anatomical features—viewing the image as a proxy for the anatomy —which has similar implications to population shift. Alternatively, selection criteria may relate to image quality (e.g. excluding scans with noise, poor contrast, or artefacts), which is akin to acquisition shift [morenotorres2012unifying]. If selection is purely image-based (), we may exploit the conditional independence , which implies that the predictive relation is directly recoverable [bareinboim2014selection], i.e. . In a learning scenario, however, the objective function would still be biased, and methods for mitigating the corresponding cases of dataset shift can be employed.
When selection is solely target-dependent (), we have , and it can be treated as prevalence shift. This will typically result from factors like hospital admission, recruitment or selection criteria in clinical trials, or annotation quality control. Notably, machine-learning practitioners should be wary that it can also arise as a side-effect of certain training strategies, such as class re-balancing or image patch selection for segmentation (e.g. picking only patches containing lesion pixels).
Sample selection can additionally introduce spurious associations when the selection variable is a common effect of and (or of causes of and ): implicitly conditioning on unblocks an undesired causal path between and (see Methods). This is the classic situation called selection bias [hernan2004structural] (cf. Berkson’s paradox [pearl2009causality]), and recovery is more difficult without assumptions on the exact selection mechanism. In general, it requires controlling for additional variables to eliminate the indirect influence of on via conditioning on the collider [bareinboim2014selection, bareinboim2016datafusion].
This paper provides a fresh perspective on key challenges in machine learning for medical imaging using the powerful framework of causal reasoning. Not only do our causal considerations shed new light on the vital issues of data scarcity and data mismatch in a unifying approach, but the presented analysis can hopefully serve as a guide to develop new solutions. Perhaps surprisingly, causal theory also suggests that the common task of semantic segmentation may not fundamentally benefit from unannotated images via semi-supervision. This possibly controversial conclusion may prompt empirical research into validating the feasibility and practical limitations of this approach.