In the last few years, deep learning techniques have permeated the field of medical image processing [1, 2]. Beyond the automation of existing radiological tasks— e.g. segmentation , detection , disease grading and classification —deep learning has been applied to a diverse set of “data enhancement” problems. Data enhancement aims to improve the quality, the information content, or the quantity of medical images available for research and clinics by transforming images from one domain to another . Previous research has shown the efficacy of data enhancement in different forms such as super-resolution [7, 8, 9], image synthesis [10, 11], denoising [12, 13], data harmonisation [14, 15] across scanners and protocols, reconstruction [16, 17, 18, 19, 20, 21, 22], registration [23, 24] and quality control [25, 26]. These advances have the potential not only to enhance the quality and efficiency of radiological care, but also facilitate scientific discoveries in medical research through increased volume and content of usable data.
However, most efforts in the development of data enhancement techniques have focused on improving the accuracy of deep learning algorithms, with little consideration of risk management. Blindly trusting the output of a given machine learning tool risks undetected failures e.g. spurious features and removal of structures . In medical applications, images inform scientific conclusions in research, and diagnostic, prognostic and interventional decisions in clinics. Therefore, translation of current proofs of principle to such safety-critical applications demands mechanisms for quantifying the risks of failures i.e. quantification of uncertainty/confidence and explanation of its source .
Predictive failures of deep learning systems, by and large, occur due to two reasons: i) the task itself is inherently ambiguous or ii) the learned model is not adequate to describe the data [29, 32, 33, 34], as illustrated in Fig. 1. The former stems from intrinsic uncertainty , which describes ambiguity in the underlying data generating process (e.g. presence of stochasticity such as measurement noise and intrinsic ill-posed nature of the problem), and cannot be alleviated by increasing available training data or model complexity111Intrinsic uncertainty is also known as aleaotoric or statistical uncertainty.. The latter is characterised by model uncertainty, which describes ambiguity in model specification222Model uncertainty is a subclass of epistemic uncertainty  which encompasses types of uncertainties that arise from lack of knowledge. . Model uncertainty arises from a) parameter uncertainty: ambiguity in fitting the model to the target mapping due to limited training data, or b) model bias
: errors due to insufficient flexibility of the model class (e.g. fitting a linear model to a sinusoidal process). These types of uncertainty can be reduced by collecting more data or specifying a different class of models. With the expressivity of deep neural networks, which are known to be universal approximators if sufficiently large, one might reasonably assume that the model bias is small enough to be discounted. Under this assumption, intrinsic and parameter uncertainty (Fig. 1) fully characterise the predictive failures of deep learning models. Therefore, accurate estimation of these uncertainties are needed and would potentially allow practioners to understand better the limits of the models, flag doubtful predictions, and highlight test cases that are not well represented in the training data.
In this work, we introduce methods for capturing components of uncertainty in medical image enhancement systems based on deep learning. We propose to model intrinsic uncertainty through a input-dependent (heteroscedastic) noise model  and parameter uncertainty through variational dropout . We then combine and propagate these two “source” uncertainties into a spatial map of predictive uncertainty over the output image, which can be used to assess the output reliability on subject-specific and voxel-wise basis. Lastly, we propose a method to propagate the predictive uncertainty to arbitrary derived quantities of the output images, such as scalar indices that are commonly used for subsequent analysis, and decompose it into distinct components which separately quantify the contributions of intrinsic and parameter uncertainty. This paper demonstrates the benefits of these ideas to enhancing system safety within the context of Image Quality Transfer (IQT) [38, 39, 40, 41], a data-enhancement framework for propagating information from rare or expensive high quality images to lower quality but more readily available images. We focus on the application of IQT to super-resolution of diffusion magnetic resonance imaging (dMRI) scans, and evaluate the utility of uncertainty quantification in terms of three aspects; i) performance on unseen datasets; ii) safety assessment of system output; iii) explainability of failures. For two different types of diffusion signal representations, we evaluate the effects of uncertainty modeling on generalisation by measuring the predictive accuracy on unseen test subjects in the Human Connectome Project (HCP) dataset  and the Lifespan dataset . We additionally test the value of improved predictive performance in a downstream tractography application. We then test the capability of the predictive uncertainty map to indicate predictive errors and thus to detect potential failures on images of both healthy subjects and those in which pathologies unseen in the training data arise, specifically from glioma and multiple-sclerosis (MS) patients. Lastly, we perform the decomposition of predictive uncertainty on HCP subjects with benign abnormalities, and assess its potential value in gaining high-level interpretations of predictive performance.
2 Related Works
This section provides a review of related works under several different themes. We first review the development of learning-based image enhancement methods in medical imaging applications. We then discuss the recent advances made to model and quantify uncertainty in such image enhancement problems. Lastly, we describe the existing strands of research in uncertainty modelling for other medical imaging problems and fields of applications.
Various forms of image enhancement can be cast as image transformation problems where the input image from one domain is mapped to an output image from another domain. Numerous recent methods have proposed to perform image transformation tasks as supervised regression of low quality against high quality image content. Alexander et al. 
proposed Image Quality Transfer (IQT), a general framework for supervised quality enhancement of medical images. They demonstrated the efficacy of their method through a random forest (RF) implementation of super-resolution (SR) of brain diffusion tensor images and estimation of advanced microstructure parameter maps from sparse measurements. More recently, deep learning, typically in the form of convolutional neural networks (CNNs), has shown additional promise in this kind of task. For example, Oktayet al.  proposed a CNN model to upsample a stack of 2D MRI cardiac volumes in the through-plane direction, where the SR mapping is learnt from 3D cardiac volumes of nearly isotropic voxels. This work was later extended by  with the addition of global anatomical prior based on auto-encoder. Zhao et al.  proposed a solution to the same SR problem for brains that utilises the high frequency information in in-plane slices to super-resolve in the through-plane direction without requiring external training data. In addition, a range of different architectures of CNNs have been considered for SR of other modalities and anatomical structures such as structural MRI  of brains, retinal fundus images  and computer tomography (CT) scans of chest . Another problem of growing interest is image synthesis, which aims to synthesise an image of a different modality given the input image. Nie et al.  employed a conditional generative adversarial network to synthesise CT from MRI with fine texture details whilst Wolterink et al.  extended this idea using a CycleGAN  to leverage the abundance of unpaired training sets of CT and MR scans. In , a variant of CNN was applied to predict 7T images from 3T MRI, where both contrast and resolution are enhanced. Another notable application is the harmonisation of diffusion MRIs [14, 15, 41, 52] where images acquired at different scanners or magnetic field strengths are mapped to the common reference image space to allow for joint analysis.
Despite this advancement, all of these methods commit to a single prediction and lack a mechanism to communicate uncertainty in the output image. In medical applications where images can ultimately inform life-and-death decisions, quantifying reliability of output is crucial. Tanno et al.  aimed to address this problem for supervised image enhancement for the first time by proposing a Bayesian variant of random forests to quantify uncertainty over predicted high-resolution MRI. They showed that the uncertainty measure correlates well with the accuracy and can highlight abnormality not represented in the training data. In our preliminary work , we made an initial attempt to extend this approach with probabilistic deep-learning formulation, and showed that modelling different components of uncertainty—intrinsic and parameter uncertainty—allows one to build a more generalisable model and quantify predictive confidence. Kendall et al. 
concurrently investigated the same problem in computer vision, suggesting its utility for safety-critical applications such as self-driving cars. More recently, Huet al. extended these works in the context of medical image segmentation and proposed a mechanism to learn the intrinsic uncertainty in a supervised manner, when multiple labels are available. Dalca et al. 
proposed a CNN-based probabilistic model for diffeomorphic image registration with a learning algorithm based on variational inference, and demonstrated the state-of-the-art registration accuracy on established benchmarks while providing estimates of registration uncertainty. An alternative approach is ensembling where the variance of the predictions of multiple networks is used to quantify the predictive uncertainty. Schlemper et al.  proposed a novel combination of the cascaded CNN architecture and compressive sensing, equipped with a variant of ensemble techniques, which enabled robust reconstruction of highly undersampled cardiovascular diffusion MR images, and quantification of reconstruction uncertainty. Bragman et al.  studied the value of uncertainty modelling for multi-task learning in the context of MR-only radiotherapy treatment planning where the synthetic CT image and the segmentation of organs at risk are simultaneously predicted from the input MRI image.
We should also note that, although not the focus of this work, research on uncertainty modelling in deep learning techniques extend to other medical image processing tasks beyond data enhancement, such as segmentation, detection and classification. For example, Nair et al.,  demonstrated for lesion segmentation of multiple sclerosis that the voxel-wise uncertainty metrics can be used for quality control; by filtering out predictions with high uncertainty, the model could achieve higher lesion detection accuracy. A concurrent work by Eaton-Rosen et al.  showed for the task of brain tumour segmentation that the Monte Carlo (MC) sample variance from dropout  can be calibrated to provide meaningful error bars over estimates of tumour volumes. Similarly,  introduced ways to turn voxel-wise uncertainty score into structure-wise uncertainty metrics for brain parcellation task, and showed their values in performing more reliable group analysis. The uncertainty metric based on MC dropout has also shown promise in disease grading of retinal fundal images [62, 63], and more recently an extension based on test-time augmentation was introduced by . An alternative approach is to train a model to predict uncertainty score directly;  showed that this approach is more effective when opinions from multiple experts are available for each image. Koh et al.  and Baumgartner et al.  proposed methods to generate a set of diverse and plausible segmentation proposals on a given image, capturing more realistically the high inter-reader annotation variability, which is commonly observed in medical image segmentation tasks. Lastly, [68, 69] demonstrated for the classification of mammograms and cardiac ultra-sound images, respectively that modelling uncertainty and biases of individual annotators enables robust learning from noisy labels in the presence of large disagreement.
However, within the context of medical image enhancement, these lines of research performed only limited validation of the quality and utility of uncertainty modelling. In this work, we formalise and extend the preliminary ideas in Tanno et al.  and provide a comprehensive set of experiments to evaluate the proposed uncertainty modelling techniques in a diverse set of datasets, which vary in demographics, scanner types, acquisition protocols or pathology. Moreover, with the exception of , none of the previous methods model different components of uncertainty, namely intrinsic and parameter uncertainty. Our method accounts for both, and provides conclusive evidence that this improves performance thanks to different regularisation effects. In addition, we propose a method to decompose predictive uncertainty over an arbitrary function of the output image (e.g. morphological measurements) into its sources, in order to provide a high-level explanation of model performance on the given input.
This section describes the methods for modelling different components of uncertainty that arise in data enhancement. Firstly, we provide an overview of Image Quality Transfer (IQT) which formulates data enhancement as a supervised learning problem. Secondly, using the IQT framework, we introduce methods to modelintrinsic and parameter uncertainty, separately, focusing on the application of super-resolution. We then combine the two approaches and estimate the overall uncertainty over prediction (predictive uncertainty) by approximating the variance of the predictive distribution (eq. (9)). Lastly, we propose a method for decomposing predictive uncertainty into its sources—intrinsic and parameter uncertainty—in an attempt to provide quantifiable explanations for the confidence on model output (eq. (13)).
3.1 Background: Image Quality Transfer
Alexander et al.  proposed Image Quality Transfer (IQT), the first supervised learning based framework for data enhancement of medical images, and here we survey its general formulation which forms the testing ground of this work. IQT performs data enhancement via regression of low quality against high quality image content. In order to overcome the memory demands of processing 3-dimensional medical images, along with other subsequent work such as [70, 7, 51, 44], IQT assumes factorisability over local neighbourhoods (also called patches) and models the conditional distribution of high-quality image given the corresponding low-quality input as:
where is a set of disjoint high-quality subvolumes with denoting the set of their indices, which together constitute the whole image , while is a set of potentially overlapping low-quality subvolumes, each of which contains and is spatially larger than the corresponding , as illustrated in Fig. 2. Here we assume that each local neighbourhood is a cubic sub-volume. The locality assumption reduces the problem of learning to the much less memory intensive problem of learning . In other words, IQT formulates the data enhancement task as a patch-wise regression where an input low-quality image is split into smaller overlapping sub-volumes and the corresponding non-overlapping high-quality sub-volumes are independently predicted according to the patch regressor . The final prediction for the 3D high-quality volume is constructed by tesellating the output patches .
The original implementation of IQT [38, 40, 39] employed a variant of random forests (RFs) to model while more recent [70, 7, 51, 44] approaches use variants of convolutional neural networks (CNNs). Either way, the machine learning algorithm is trained on pairs of high-quality and low-quality patches extracted from a set of image volumes, and is used to perform the data-enhancement task of interest. Typically, such patch pairs are synthesised by down-sampling a collection of high quality images to approximate their counterparts in a particular low-quality scenario [38, 7]. In this work, we focus on the task of super-resolution (SR) where the spatial resolution of is higher than the input image .
3.2 Baseline Super-Resolution Model: 3D-ESPCN
As the baseline architecture for modelling , we adapt efficient subpixel-shifted convolutional network (ESPCN) 
to 3D data. ESPCN is a recently proposed method with the capacity to perform real-time per-frame SR of videos while retaining high accuracy on 2D natural images. We have chosen to base on this architecture for its simplicity and computational performance. Most CNN-based SR techniques first up-sample a low-resolution input image (e.g. through bilinear interpolation, deconvolution[7, 73]74], etc) and then refine the high-resolution estimate through a series of convolutions. These methods suffer from the fact that the up-sampling can be a lossy process and refinement in the high-resolution space has a higher computational cost than in the low-resolution space. By contrast, ESPCN performs convolutions in the low-resolution-space, upsampling afterwards. The reduced resolution of feature maps dramatically decreases the computational and memory costs, which is more pronounced in processing 3D data.
More specifically the ESPCN is a fully convolutional network, with a special shuffling operation on the output, which identifies individual feature channel dimensions with spatial locations in the high-resolution output. Fig. 3
shows a 2D illustration of an example ESPCN when the fully convolutional part of the network consists of 3 convolutional layers, each followed by a ReLU, and the final layer hasfeature maps where is the upsampling rate and is the number of channels in the output image (e.g. in the case of DT images). The shuffling operation takes the feature maps of shape and remaps pixels from different channels into different spatial locations in the high-resolution output, producing a image, where and denote height and width of the pre-shuffling feature maps. This shuffling operation in 3D is given by where is the pre-shuffled feature maps. The combined effects of the last convolution and shuffling is effectively a learned interpolation, and an efficient implementation of deconvolution layer  where the kernel size is divisible by the size of the stride . Therefore, it is less susceptible to checker-board like artifacts commonly observed with deconvolution operations .
At test time, the prediction of higher resolution volume is performed through shift-and-stitch operation. The network takes each subvolume in a low-resolution image, and predicts the corresponding high-resolution sub-volume . By tessellating the predictions from appropriately shifted inputs , the whole high-resolution volume is reconstructed. With convolutions being local operations, each output voxel is only inferred from a local region in the input volume, and the spatial extent of this local connectivity is referred to as the receptive field. For a given input subvolume, the network increases the resolution of the central voxel of each receptive field e.g. the central output voxels are estimated from the corresponding receptive field in the input volume, as coloured yellow in Fig. 3.
Given training pairs of high-resolution and low-resolution patches , we optimise the network parameters by minimising the sum of per-pixel mean-squared-error (MSE) between the ground truth and the predicted high-resolution patch over the training set. Here denotes all network parameters. This is equivalent to minimising the negative log likelihood (NLL) under the Gaussian noise model with fixed isotropic variance .
3.3 Intrinsic Uncertainty and Heteroscedastic Noise Model
Intrinsic uncertainty quantifies the inherent ambiguity of the underlying problem that is irreducible with data as illustrated in Fig. 1(i). Here we capture intrinsic uncertainty by estimating the variance of the target conditional distribution . In medical images, intrinsic uncertainty is often spatially and channel-wise varying. For example, super-resolution could be fundamentally harder on some anatomical structures than others due to signal variability as shown in . It may also be the case that some channels of the image volume might contain more complex, non-linear and noisy signals than other channels e.g. higher order terms in diffusion signal representations. To capture such potential variation of intrinsic uncertainty, we model
as a Gaussian distribution with input-dependent varying variance:
where the mean and the covariance are functions of input and modelled by two separate 3D-ESPCNs (as shown in Fig. 4), which we refer to as “mean network” and “covariance network”, and are parametrised by and , respectively. We note that the input patch varies spatially, which makes the estimated variance spatially varying and different for respective channels. Fig. 4 shows a 2D illustration of our 3D architecture. For each low-resolution input patch , we use the output of the mean network at the top as the final estimate of the high-resolution ground truth whilst the diagonal elements of the covariance quantify the corresponding intrinsic uncertainty over individual components in and over different channels. Lastly, we note that this is a specifc instance of a broad class of models, called heteroscedastic noise models [77, 36] where the variance is a function of the value of the input. In contrast, the baseline 3D-ESPCN can be viewed as an example of homoscedastic noise models with , with constant variance across all spatial locations and image channels, which is highly unrealistic in most medical images.
We jointly optimise the parameters of the mean network and the covariance network by minimising the negative loglikelihood (NLL):
where is a constant and the remaining terms are given by
Here denotes the mean squared Mahalanobis distance with respect to the predictive distribution . For simplicity, in this work we assume diagonality of the covariance matrix . This means that the Mahalanobis distance term equates to the sum of MSEs across all pixels and channels in the output, weighted by the inverse of the corresponding variance (estimated intrinsic uncertainty)333In the case of full covariance, becomes the MSE in the basis of principle components, weighted by the corresponding eigenvalues.
becomes the MSE in the basis of principle components, weighted by the corresponding eigenvalues.
. This term naturally encourages assigning low uncertainty to regions with higher MSEs, robustifying the training to noisy labels and outliers. On other other hand,represents the mean differential entropy and discourages the spread of from growing too large. We note that the covariance network is used to modulate the training of the mean network and quantify intrinsic uncertainty during inference while only the mean network generates the final prediction, requiring a single 3D-ESPCN to perform super-resolution.
3.4 Parameter Uncertainty and Variational Dropout
Parameter uncertainty signifies the ambiguity in selecting the parameters of the model that best describes the training data as illustrated in Fig. 1.(ii). The limitation of the previously introduced 3D-ESPCN baseline (Sec. 3.2) and its heteroscedastic extension (Sec. 3.3) is their reliance on a single estimate of network parameters. In many medical imaging problems, the amount of training data is modest; in such cases, this point estimate approach increases the risk of overfitting .
We combat this problem with a Bayesian approach. Specifically, instead of resorting to a single network of fixed parameters, we consider the (posterior) distribution over all the possible settings of network parameters given training data
. This probability density encapsulates the parameter uncertainty, with its spread of mass describing the ambiguity in selecting most appropriate models to explain the training data. However, in practice, the posterior is intractable due to the difficulty in computing the normalisation constant. We, therefore, propose to approximate with a simpler distribution . Specifically, we adapt a technique called variational dropout  to convolution operations from its original version introduced for feedforward NNs.
Binary dropout  is a popular choice of method for approximating posterior distributions  with demonstrated utility in medical imaging applications [62, 70, 63, 61, 59, 58, 57]. However, typically hyper-parameters (dropout rates) need to be pre-set before the training, requiring inefficient cross-validation and thus substantially constraining the flexibility of approximate distribution family (often a fixed dropout rate per layer). This limitation motivates us to use variational dropout 
that extends such approach with a way to learn the dropout rate from data for every single weight in the network and theoretically enables a more effective approximation of the posterior distribution. Another established class of methods is stochastic gradient Markov chain Monte Carlo (SG-MCMC) method[80, 81, 82, 83]. However, in this work, we do not not consider SG-MCMC methods because they remain, although unbiased, computationally inefficient due to the requirement of evaluating an ensemble of models for posterior computation, and are slow to converge for high-dimensional problems.
Variational dropout  employs a form of variational inference to approximate the posterior by a member of tractable family of distributions parametrised by , such that Kullback-Leibler (KL) divergence is minimised. Here,
denotes an individual element in the convolution filters of CNNs as a random variable with parameters(dropout rate) and (mean), and the posterior over the set of all weights is effectively approximated with a product of univariate Gaussian distributions. In practice, introducing a prior and applying Bayes’ rule allow us to rewrite the minimization of the KL divergence as maximization of the quantity known as the evidence lower bound (ELBO) . Here during training, we learn the variational parameters by minimizing the negative ELBO (to be consistent with the NLL cost function in eq.(3)):
An accurate approximation for the KL term for log-uniform prior is proposed in , which is employed here. On the other hand, the first term (referred to as the reconstruction term) cannot be computed exactly, thus we employ the following MC approximation by sampling samples of network parameters from the posterior:
Adapting the local reparametrisation trick presented in  to a convolution operation, we derive the implementation of posterior sampling such that the variance of gradients over each mini-batch is low 444See the proof for feedforward networks given in  which generalises to convolutions. In practice, this amounts to replacing each standard convolution kernel with a “Bayesian” convolution, which proceeds as follows. Firstly, we define two separate convolution kernels: (“mean” kernels) and (“variance” kernels) where denotes the element-wise multiplication, is the number of input channel and is the kernel width. Input feature maps and its elementwise squared values are convolved by respective kernels to compute the “mean” and “variance” of the output feature maps and . Lastly, the final output feature maps are computed by drawing a sample from i.e. computing the following quantity:
Every forward pass (i.e. computation of each ) with variational dropout is thus performed via a sequence of Bayesian convolutions. Since the injected Gaussian noise is independent of the variational parameters , the approximate reconstruction term in eq. 7 is differentiable with respect to them .
3.5 Joint Modelling of Intrinsic and Parameter Uncertainty
We now describe how to combine the methods for modelling intrinsic and parameter uncertainty. Operationally, we take the dual architecture (Fig. 4) used to model intrinsic uncertainty, and apply variational dropout to every convolution layer in it. The intrinsic uncertainty is modelled in the heteroscedastic Gaussian model while the parameter uncertainty is captured in the approximate posterior obtained from variational dropout.
At test time, for each low-resolution input subvolume , we would like to compute the predictive distribution over the high-resolution output . We approximate this quantity by by taking the “average” of all possible network predictions from all settings of the parameters , weighted by the associated approximate posterior distribution . More formally, we need to compute the integral below:
where the last line represents the true predictive distribution which is estimated by our model . However, in practice, the integral cannot be evaluated in closed form because the likelihood is a highly non-linear function of input as given in eq. 2. At test time, we therefore estimate, for each input , the mean and covariance of the approximate predictive distribution with the unbiased Monte Carlo estimators:
where are samples of the network parameters (i.e. convolution kernels) drawn from the approximate posterior . In the other words, the inference performs stochastic forward passes at test time by injecting noise into features according to eq. 8, and amalgamates the corresponding network outputs to compute the sample mean and sample covariance . We use the sample mean as the final prediction of an high-resolution ouput patch and use the diagonal elements of the sample covariance to quantify the corresponding uncertainty, which we refer to as predictive mean and predictive uncertainty, respectively.
3.6 Uncertainty Decomposition and Propagation
Predictive uncertainty arises from the combination of two source effects, namely intrinsic and parameter uncertainty, for which we have previously introduced methods for estimation. Lastly, we introduce a method based on variance decomposition for disentangling these effects and quantifying their contributions separately in predictive uncertainty. We consider such decomposition problem in the presence of an arbitrary transformation of the output variable .
The users of super-resolution algorithms are often interested in the quantities that are derived from the predicted high-resolution images, rather than the images themselves. For example, quantities such as the principal direction (first eigenvalue of the DT), mean diffusivity (MD) and fractional anisotropy (FA) are typically calculated from diffusion tensor images (DTIs) and used in the downstream analysis. We therefore consider an generic function555We assume here that the transform is a measurable function with well-defined expectation and variance. which transforms the high-resolution multi-channel data to a quantity of interest e.g. MD and FA maps, and propose a way to propagate the predictive uncertainty over to the transformed domain (i.e. compute the variance of ) and decompose it into the “intrinsic” and “parameter” components. Specifically, by using the law of total variance , we perform the following decomposition:
where the respective component terms are given by:
We refer to the components and as “propagated” parameter and intrinsic uncertainty. Intuitively, the first term quantifies the difference in variance between the cases where we have variable parameters and fixed parameters. In other words, this quantifies how much predictive uncertainty on the derived quantity arises, on average, from the variability in parameters. The second term on the other hand quantifies the average variance of the model prediction when the parameters are fixed, which signifies the model-independent uncertainty due to data i.e. intrinsic uncertainty. Assuming that the considered neural network is identifiable666We note that a neural network is, in general, not identifiable i.e. there exist more than a single set of parameters that capture the same target distribution . In such cases, the posterior distribution does not collapse to a single Dirac Delta function with infinite amount of observations—it rather converges to a mixture of all sets of network parameters such that . However, the expectation is the same for all and thus the propagated parameter uncertainty converges to zero. and sufficiently complex to capture the underlying data generating process, as the amount of training data increases, the posterior tends to a Dirac delta function and thus the first term diminishes to zero while the second term remains. A similar variance decomposition technique was employed in  to understand how the variation in cell signals of interest (e.g. gene expression) in a bio-chemical network is caused by the fluctuations of other environmental variables (e.g. transcription rate and biological noise). In our case, we employ the variance decomposition technique to separate the effects of network parameters from the intrinsic uncertainty in the prediction of .
We first consider a special case where the transform is an identify map i.e. . Assuming the likelihood is modelled by a Gaussian distribution with heteroscedastic noise i.e. , then we can show that the parameter and intrinsic uncertainty are given by
which can be approximated by the components of the MC variance estimator in eq. (12) :
where are drawn from the approximate posterior .
More generally, when the transform is complicated, MC sampling provides an alternative implementation. Given samples of model parameters and for , we estimate both the progapated parameter and intrinsic uncertainty as follows:
4 Experiments and Results
In this section, we evaluate the proposed uncertainty modelling techniques for super-resolution of diffusion MR images. We first compare quantitatively the reconstruction performance of our probabilistic CNN models against the relevant baselines in two different types of diffusion signal representations. Secondly, we study the real-world utility of the technique in downstream tractography applications. Thirdly, we evaluate the value of predictive uncertainty as a realiability metric of output images on multiple datasets of both healthy subjects and those with unseen pathological structures such as brain tumour (Glioma) and multiple sclerosis (MS).
We make use of the following four diffusion MRI datasets to evaluate different benefits of the proposed technique:
Human Connectome Project dataset: we use the diffusion MRI data from the WU-Minn HCP (release Q3)  as the source of the training datasets. The dataset enjoys very high image resolution, signal levels and coverage of the measurement space, enabled by the combination of custom imaging, reconstruction innovations and a lengthy acquisition protocol . Each subject’s data set contains diffusion weighted images (DWIs) of voxel size of which have nominal and the three high-angular-resolution-diffusion-imaging (HARDI) shells of 90 directions have nominal b-values of , and (see  for the full acquisition details). The data are preprocessed by correcting distortions including susceptibility-induced, eddy currents and motion as outlined in .
Lifespan dataset: this dataset (available online at http://lifespan.humanconnectome.org contains subjects of much wider age range ( years) than the main HCP cohorts ( years), and is acquired with a shortened version of the main HCP protocol with lower resolution ( mm isotropic voxels) and only two HARDI shells, with and . However, we also note that the protocol still leverages the special features of the HCP scanners, providing images of substantially better quality than standard sequences. We utilise this out-of-training-distribution dataset to assess the robustness of our techniques to domain shifts.
Prisma dataset: two healthy male adults (29 and 33 years old respectively) were scanned twice at different image resolutions using the clinical 3T Siemens Prisma scanner in FMRIB, Oxford. Both datasets contain diffusion MRI data with 21 images and three 90-direction HARDI shells, b-values of 1000, 2000, and , each for two resolutions, 2.50 mm and 1.35 mm isotropic voxels (see  for full acquisition details). In addition, each of these datasets also includes a standard 3D T1-weighted MPRAGE (1 mm isotropic resolution). The Prisma scanner is less powerful than the bespoke HCP scanner and cannot achieve sufficient signal at mm resolution, but the mm data provides a pseudo ground-truth for IQT resolution enhancement of the 2.5 mm data.
Pathology dataset: we use two separate datasets which consist of images of brain tumour (Glioma)  and multiple sclerosis (MS) patients, respectively. The data of each wubject with glioma contains DWIs with while the measurement of each MS patient is of . Both datasets have isotropic voxel size , which is closer to the image resolution of commonplace clinical scanners. We use these datasets to assess the behaviour of predictive uncertainty on images with pathological features that are not represented in the training data set.
In all the experiments, super-resolution are performed on diffusion parameter maps derived from the DWIs in the above datasets. In particular, we consider two diffusion MRI models, namely the diffusion tensor (DT) model  and Mean Apparent Propagator (MAP) MRI , where the former is the simplest and most standard diffusion parameter map, and the latter is a high-order generalisation of the former with the capacity to characterise signals from more complex tissue structures (e.g. fibre crossing regions), a requirement for successful tractography applications. We compute both of these diffusion parameter maps using the implementation from , which is available at https://github.com/ucl-mig/iqt.
We fit the DT model to the combination of images and HARDI shell for the HCP and Lifespan datasets, and shell for the brain tumour dataset. In all cases, weighted linear least squares are employed for the fitting, taking into account the spatially varying b-values and gradient directions in the HCP dataset. On the other hand, in the case of MAP-MRI, coefficients of basis functions up to order are estimated via (unweighted) least squares to all three shells of the HCP, Lifespan and Prisma datasets. As noted in , the choice of scale parameters (see ) mm empirically minimises the fitting error in the HCP dataset, and is used for all datasets.
Training datasets in all experiments are constructed by artificially downsampling very high-resolution images in the HCP dataset. In particular, we employ the following downsampling procedure: (i) the raw DWIs of selected subjects are blurred by applying the mean filter of size independently over channels with denoting the upsampling rate; (ii) the DT or MAP parameters are computed for every voxel; (iii) the spatial resolution of the resultant parameter maps are reduced by taking every pixels. A coupled library of low-resolution and high-resolution patches is then constructed by associating each patch in the downsampled DTI/MAP-MRI with the corresponding patch in the ground truth DTI or MAP-MRI. In this case, we ensure the low-resolution patch to be centrally and entirely contained within the corresponding high-resolution patch (as illustrated by the yellow and orange squares in Fig. 3). We then randomly select a pre-set number of patches from each subject in the training pool to create a training dataset as detailed in Table LABEL:tab:train_data. In addition to the subjects used in the prior work [38, 39, 33], we randomly select additional subjects from the HCP cohort and include them in the training subject pool. Patches are standardized channel-wise by subtracting the mean of foreground pixel intensities of the corresponding subject and dividing by its standard deviation. Moreover, since MAP-MRI datasets contain outliers due to model fitting, in large enough quantity to influence the training of the baseline 3D-ESPCN model, we remove them by clipping the voxel intensity values of the respective channels separately at and percentiles computed over all the foreground voxels in the whole training dataset.
|Data||Size of input||Size of output||No. pairs per subject||No. subjects|
4.2 Network Architectures and Training
For the training of all CNN models, we minimised the associated loss function using Adam  for epochs with initial learning rate of and , with minibatches of size . We hold out 50% of training patch pairs as a validation set. The best performing model was selected based on the mean-squared-error (MSE) on the validation set.
For the super-resolution of DTIs, as in , we use a minimal architecture for the baseline 3D-ESPCN, consisting of three convolutional layers with filters where is upsampling rate and is the number of channels in DTIs. As illustrated in Fig. 3, the dimensions of convolution filters are chosen, so each low-resolution receptive field patch maps to a high-resolution patch, which mirrors competing random forest based methods [38, 39] for a fair comparison. On the other hand, for MAP-MRI, which is a more complex image modality with channels, we employ a deeper model with 6 convolution layers prior to the shuffling operation, which expands the receptive field on each high-resolution patch to
input low-resolution patch. Every convolution layer is followed by a ReLU non-linearity except the last one in the architecture, and batch-normalization is additionally employed for MAP-MRI super-resolution between convolution layer and ReLU non-linearity.
The mean and variance networks in the heteroscedastic noise model introduced in Sec. 3.3 are implemented as two separate baseline 3D-ESPCNs of the architectures, specified above for DTIs and MAP-MRIs. Positivity of the variance is enforced by passing the output through a softplus function as in .
For variational dropout, we considered two flavours: Var.(I) optimises per-weight dropout rates, and Var.(II) optimises per-filter dropout rates. More formally, the “drop-out rate" in the approximate posterior is different for every element in each convolution kernel in the former while the latter has common shared across each kernel. In preliminary analysis, we found that the number of samples per data point for estimating reconstruction term (eq. 7) can be set to so long as the batch size is sensibly large ().
We also note the default training with binary and Gaussian dropout also employs  along with other MC variational inference methods for neural networks such as [85, 37, 95]. Variational dropout is applied to both the baseline and heteroscedastic models without changing the architectures. For both binary and Gaussian dropout modes, we incorporate the dropout operations of fixed rate in every convolution layer of the baseline 3D-ESPCN architecture.
All models are trained on simulated datasets generated from 16 HCP subjects as detailed in Sec. 4.1. We also retrained the random forest models employed in [39, 40] on equivalent datasets. It takes under
mins to train a single network on DTI/MAP-MRI data on a single TITAN X GPU. All models are implemented in the TensorFlow framework and the codes will be released at https://github.com/rtanno21609/UncertaintyNeuroimageEnhancement.
4.3 Quantitative Evaluation of Super-resolution Performance
We evaluate the prediction performance of our models for super-resolution of DTI and MAP-MRI on two datasets—HCP and Lifespan as detailed in Sec. 4.1. The first dataset contains 16 unseen subjects from the same HCP cohort used for training, while the second one consists of subjects from the HCP Lifespan dataset. The latter tests generalisability, as they are acquired with a different protocol at lower resolution (1.5 mm isotropic), and contain subjects of a different age range (45-75 years) to the original HCP data (22-36 years). We perform upsampling in all spatial directions. The reconstruction quality is measured with root-mean-squared-error (RMSE), peak-signal-to-noise-ratio (PSNR) and mean-structural-similarity (MSSIM)  on two separate regions: i) “interior”; set of patches contained entirely within the brain mask; ii) “exterior”; set of patches containing some brain and some background voxels, as shown in Fig. 6. This is because the current state-of-the-art methods based on random forests (RFs) such IQT-RF  and BIQT-RF  are only trained on patches from the interior region and requires a separate procedure on the brain boundary. In addition, the estimation problem is quite different in boundary regions, but remains valuable particularly for applications such as tractography where seed or target regions are often in the cortical surface of the brain. We only present the RMSE results, but the derived conclusions remain the same for the other two metrics. Aside from the interpolation techniques, for each method an ensemble of models are trained on different trainings set (generated by randomly extracting patch pairs from the common HCP training subjects) and for each model, the average error metric over the test subjects are first calculated. The mean and standard deviations of such average errors are computed across the model ensemble and reported in Table 2 and Table 3.
Table 2 shows that our baseline achieves reduction in RMSE for the super-resolution of DTIs on the HCP dataset on the interior/exterior regions with respect to the best published method, BIQT-RF. While the standard deviations are higher, the improvements are more pronounced in MAP-MRI super-resolution, reducing the average RMSEs by and on the interior and exterior regions. We note that that IQT-RF and BIQT-RF are only trained on interior patches, and super-resolution on boundary patches requires a separate ad hoc procedure. Despite including exterior patches in training our model, which complicates the learning task, the baseline CNN out-performs the RF methods on both regions. We see similar improvements in the out-of-distribution Lifespan dataset.
Reconstruction is faster than the RF baselines; the 3D-ESPCN is capable of estimating the whole high-resolution DTI/MAP-MRI under 10/60 seconds on a CPU and second(s) on a GPU. On the other hand, BIQT-RF takes mins with 8 trees on both DTIs and MAP-MRIs. The fully convolutional architecture of the model enables to process input patches of different size from that of training inputs, and we achieve faster reconstruction by using larger input patches of dimension where is the number of channels. We also note that the reconstruction time of the variational dropout based models increases by a factor of the number of MC samples used at test time, although it is possible, with more memory, to leverage GPU parallelisation by making multiple copies of each input patch and treating them as a mini-batch. On the other hand, the heteroscedastic CNN enjoys the same inference speed of the baseline since only the mean network is used for reconstruction (the covariance network is only employed to quantify the estimated intrinsic uncertainty).
|Models||HCP (interior)||HCP (exterior)||Life (interior)||Life (exterior)|
|+ Binary Dropout ()|
|+ Gaussian Dropout ()|
|+ Variational Dropout (I)|
|+ Variational Dropout (II)|
|+ Hetero. + Variational Dropout (I)|
|+ Hetero. + Variational Dropout (II)|
|Models||HCP (interior)||HCP (exterior)||Life (interior)||Life (exterior)|
|+ Binary Dropout ()|
|+ Gaussian Dropout ()|
|+ Variational Dropout (I)|
|+ Variational Dropout (II)|
|+ Hetero + Variational Dropout (I)|
|+ Hetero + Variational Dropout (II)|
|3D-ESPCN(without outlier removal)|
|+ Hetero + Variational Dropout (I)|
|+ Hetero + Variational Dropout (II)|
Table 2 shows that, on both HCP and Lifespan data, modelling both intrinsic and parameter uncertainty (i.e. Hetero. + Variational Dropout (I), (II)) achieves the best reconstruction accuracy in DTI super-resolution. We observe that modelling intrinsic uncertainty with the heteroscedastic network on its own further reduces the average RMSE of the baseline 3D-ESPCN on the interior region with high statistical significance (). However, poorer performance is observed on the exterior than the baseline. On the other hand, using MC weight samples, we see modelling parameter uncertainty with variational dropout (see Variational Dropout.(I)-CNN) performs best on both datasets on the exterior region. Combination of heteroscedastic model and variational dropout (i.e. Hetero. + Variational Dropout (I) or (II)) leads to the top 2 performance on both datasets on the interior region and reduces errors on the exterior to the level comparable or better than the baseline.
Similarly, Table 3 shows that the best performance in MAP-MRI super-resolution comes from the combined models (i.e. Hetero.+Variational Dropout.(I) and (II)). We observe that as with the DTI case, modelling intrinsic uncertainty through the heteroscedastic network improves the reconstruction accuracy on the interior region, whilst the errors on the exterior are increased with respect to the baseline 3D-ESPCN. Moreover, the improvement is pronounced when the outliers due to model fitting errors are not removed in the training data. In this case, we see that the reconstruction accuracy of 3D-ESPCN dramatrically decreases, whilst in contrast it is only marginally compromised when equipped with the heteroscedastic noise model, displaying robustness to outliers. Lastly, we note that the top-2 accuracy are consistently achieved by the joint modelling of intrinsic and parameter uncertainty (i.e. Hetero.+Variational Dropout.(I) and (II)) on both the interior and exterior regions on both HCP and Lifespan datasets.
The performance difference of heteroscedastic network between the interior and the exterior region roots from the loss function. The Mahalanobis term in eq.(5) imposes a larger penalty on the regions with smaller intrinsic uncertainty. The network therefore allocates less of its resources towards the regions with higher uncertainty (e.g. boundary regions) where the statistical mapping from the low-resolution to high-resolution space is more ambiguous, and biases the model to fit the regions with lower uncertainty. However, we note that the performance of the heteroscedastic network is still considerably better than the standard interpolation and RF-based methods. By augmenting the model with variational dropout, the exterior error of the heteroscedastic model is dramatically reduced, indicating its regularisation effect against overfitting to low-uncertainty areas. We also observe concomitant performance improvement on the interior regions on both datasets, which additionally shows the benefits of such regularisation even in low-uncertainty areas.
Both Table 2 and Table 3 show that the use of variational dropout attains lower errors than the models with fixed dropout probabilities , namely, Binary and Gaussian dropout . Different instances of both dropout models are trained for a range of by linearly increasing on the interval with increment , and the test errors for the configurations with smallest RMSE on the validation set are reported in Table 2 and Table 3. As with variational dropout models, MC samples are used for inference. In all cases, two variants of variational dropout (I) and (II) outperform the networks with the best binary or Gaussian dropout models, showing the benefits of learning dropout probabilities rather than fixing them in advance.
4.4 Tractography with MAP-MRI
Reconstruction accuracy does not necessarily reflect real world utility. We thus further assessed the benefits of super-resolution with a tractography experiment on the Prisma dataset, which contains two DWIs of the same subject at two different image resolutions—1.35 mm and 2.5 mm isotropic voxels, as detailed in Sec. 4.1. An ensemble of best performing CNN (3D-ESPCN+Hetero.+Variational Dropout(I)) is used to super-resolve the MAP-MRI coefficients  derived from the low-resolution DWIs, and the ensemble predictions aggregated into the final output by taking the average estimate weighted by the inverse of the estimated intrinsic uncertainty. Lastly, the high-resolution multi-shell DWIs are obtained from this super-resolved MAP volume. Specifically, the Spherical Mean Technique (SMT) is used to fit a microscopic tensor model to the predicted dataset . The voxel-by-voxel estimated model parameters inform the spatially varying fibre response function that is used to recover the fibre orientation distribution through spherical deconvolution. Afterwards, we perform probabilistic tractography  with the fibre pathways randomly seeded in the brain. In a similiar fashion, we also generate high-resolution datasets by using IQT-RF and linear interpolation.
Fig. 7 shows that IQT via our best performing CNN makes a tangible difference in downstream tractography. In the top row, tractography on the low-resolution data produces a false-positive tract under the corpus callosum (yellow arrow), which tractography at high resolution avoids. Reconstructured high-resolution images from IQT-RF and CNN predictions avoid the false positive better than linear interpolation. Note that we do not expect to reproduce the high-resolution tractography map exactly, as the high-resolution and low-resolution images are not aligned exactly and the high-resolution and prediction have different resolutions (1.35 mm vs. 1.25 mm). The bottom row shows sharper recovery of small gyral white matter pathways (green arrow) at high-resolution than low-resolution resulting from reduced partial volume effect. CNN reconstruction produces a sharper pathway than RF-IQT and linear interpolation, more closely reflecting the high-resolution tractography.
4.5 Uncertainty Quantification
In this section, we investigate the value of uncertainty modelling in enhancing the safety of super-resolution system beyond reduced reconstruction errors. Firstly, in Sec. 4.5.1, we study the utility of predictive uncertainty map as a proxy measure of reconstruction accuracy on healthy test subjects from both HCP and Lifespan datasets. Secondly, in Sec. 4.5.2, we look into the behaviour of uncertainty maps in the presence of abnormal features that are not present in the training data.
4.5.1 Healthy Test Subjects
We employ the most performant CNN model (3D-ESPCN + Hetero. + Variational Dropout(I)) to generate the high-resolution predictions of mean diffusivity (MD) and fractional anisotropy (FA), and their associated predictive uncertainty maps. Here we draw samples of high-resolution DTI predictions for each subject from the predictive distribution , and then the FA and MD maps of each prediction are computed. The sample mean and standard deviation are then calculated from these samples to generate the final estimates of high-resolution MD/FA maps and their corresponding predictive uncertainty.
Fig. 8 displays high correspondence between the error (RMSE) maps and the predictive uncertainty on both FA and MD of a HCP test subject. This demonstrates the potential utility of uncertainty map as a surrogate measure of prediction accuracy. In particular, the MD uncertainty map captures subtle variations within the white matter and the cerebrospinal fluid (CSF) at the centre. Also, in accordance with the low reconstruction accuracy, high predictive uncertainty is observed in the CSF in MD. This is expected since the CSF is essentially free water with low signal-to-noise-ratio (SNR) and is also affected by biological noise such as cardiac pulsations. The reconstruction errors are high in FA prediction on the bottom-right quarter of the brain boundary, close to the skull, which is also reflected in the uncertainty map.
Fig. 9 tests the utility of predictive uncertainty map in discriminating potential predictive failures in the predicted high-resolution MD map. We define ground truth “safe” voxels as the ones with reconstruction error (RMSE) smaller than a fixed value, and the task is to separate them from the remaining ground-truth “risky” voxels by thresholding on their predictive uncertainty values. The threhold for defining safe voxels is set to , such that the risky voxels mostly concentrate on the outer-boundary and the CSF regions (which account for of all voxels under consideration). Here the positive class is defined as “safe” while the negative class is defined as “risky”. Fig. 9
(a) shows the corresponding receiver operating characteristic (ROC) curve of such binary classification task, which plots the true-positive-rate (TPR) against the false-positive-rate (FSR) computed based on all the voxels in the 16 HCP training subjects. In this case, TPR decribes the percentage of correctly detected safe voxels out of all the safe ones, while FPR is defined as the percentage of risky voxels that are wrongly classified as safe out of all the risky voxels. We then select the best threshold by maximising the F1 score, and use this to classify the voxels in each predicted high-resolution MD into “safe” and “risky” ones for all subjects in the test HCP dataset and the Lifespan dataset. Fig.9 (b) shows the inter-subject average of the TPR and FPR on both datasets. While on average TPR slightly worsens compared to the results on the training subjects, FPR improves in both cases—notably, this uncertainty-based classification is able to correctly identify 96% of risky predictions on unseen subjects from out-of-training-distribution dataset, namely Lifespan, which differs in demographics and underlying acquisition. Fig.9 (c) visualises the classification results to the pre-defined “ground truth" on one of the Lifespan subjects, which illustrates that the generated “warning” aggressively flags potentially risky voxels at the cost of thresholding out the safe ones.
4.5.2 Unseen Abnormalities and Uncertainty Decomposition
We separately visualise the propagated intrinsic and parameter uncertainty over the predicted high-resolution MD map on images of subjects with a variety of different unseen abnormal structures, such as benign cysts, tumours (Glioma) and focal lesions caused by multiple sclerosis (MS). We emphasise here that the all these images have been acquired with different protocols. Specifically, benign cysts in the HCP datasets represent abnormalities in images acquired with the same protocol as the training data, while tumours and MS lesions are examples of pathologies present in out-of-distribution imaging protocols. In all cases, we use the SR network, Hetero.+Variational Dropout (I), trained on healthy subjects from HCP dataset. For each of different sets of parameters sampled from the posterior distribution , we draw samples of high-resolution DTIs from the likelihood, , compute the corresponding MD, and approximate the two constituents of predictive uncertainty with the MC estimators given in eq.(20) and (21).
Fig. 10 shows the reconstruction accuracy along with the components of predictive uncertainty over the high-resolution MD map of a HCP test subject, which contains a benign abnormality (a small posterior midline arachnoid cyst). The error (RMSE) and propagated intrinsic uncertainty are plotted on the same scale whereas the propagated model uncertainty is plotted on 1/5 of the scale for clear visualisation. In this case, the predictive uncertainty is dominated by the intrinsic component. In particular, low propagated intrinsic uncertainty is observed in the interior of the cyst relative to its boundary in accordance with the high accuracy in the region. This is expected as the interior structure of a cyst is highly homogeneous with low variance in signals and the super-resolution task should therefore be relatively straightforward. On the other hand, the component of parameter uncertainty is high on the interior structure which also makes sense as such homogeneous features are underrepresented in the training data of healthy subjects. This example illustrates how decoupling the effects of intrinsic and parameter uncertainty potentially allows one to make sense of the predictive performance.
Fig.11 visualises the uncertainty components generated by the same CNN model trained on datasets of varying size. We see that the propagated parameter uncertainty diminishes as the training set size increases, while the propagated intrinsic uncertainty stays more or less constant. This result is indeed what is expected as described in Fig. 1; the specification of network weights becomes more confident i.e. the variance of the posterior distribution decreases as the amount of training data increases, while the effect of intrinsic uncertainty is irreducible with the amount of data. On the other hand, when the standard binary or Gaussian dropout was employed instead of variational dropout, we observed that the effect of parameter uncertainty stayed more or less constant with the size of training data. This may be a consequence of the posterior variance, largely determined by the prespecified drop-out rates, which in turn results in more static variance of predictive distribution.
We further validate our method on clinical images with previously unseen pathologies. We note that the pathology data contain images acquired with standard clinical protocols with voxel size slightly smaller than that of the training low-resolution images and lower signal-to-noise ratio.
Fig. 12 shows that pathological areas not represented in the training set are flagged as highly uncertain. Although the ground truth is not available in this case, the uncertainty can be quantified instead to flag potential low accuracy areas. Fig. 12 (a) shows that the propagated parameter uncertainty highlights the tumour core, and speckly artefacts in the input image, which are not represented in the training data. On the other hand, the intrinsic uncertainty component is high on the whole region of pathology covering both the tumour core and its surrounding edema. Fig. 12 (b) shows that high parameter uncertainty is assigned to a large part of focal lesions in MS, while the intrinsic uncertainty is mostly prevalent around the boundaries between anatomical structures and CSF. We also observe that the super-resolution sharpens the original image without introducing noticeable artifacts; in particular, for the brain tumour image, some of the partial volume effects are cleared.
5 Discussion and Conclusion
We introduce a probabilistic deep learning (DL) framework for quantifying three types of uncertainties that arise in data-enhancement applications, and demonstrate its potential benefits in improving the safety of such systems towards practical deployment. The framework models intrinsic uncertainty through heteroscedastic noise model and parameter uncertainty through approximate Bayesian inference in the form of variational dropout, and finally integrates the two to quantify predictive uncertainty over the system output. Experiments focus on the super-resolution application of image quality transfer (IQT) and study several desirable properties of such framework, which lack in the existing body of data enhancement methods based on deterministic DL models.
Firstly, results on a range of applications and datasets show that modelling uncertainty improves overall prediction performance. Table 2 and 3 show that modelling the combination of both intrinsic and parameter uncertainty achieves the state-of-the-art accuracy on super-resolution of DTIs and MAP-MRI coefficients in both of the HCP test dataset and the Lifespan dataset, improving on the present best methods based on random-forests (RF-IQT and RF-BIQT) and interpolation—the standard method to estimate sub-voxel information used in clinical visualisation software. In particular, results on the Lifespan dataset, which differs from the training data in age range and acquisition protocol, indicates the better generalizability of our method. In addition, Fig. 7 shows that such combined model also benefits downstream tractography in comparison with the previous methods, illustrating the potential utility of the method for downstream connectivity analysis. Such improvement in the predictive performance arises from the regularisation effects imparted by the modelling of respective uncertainty components. Specifically, modelling intrinsic uncertainty through the heteroscedastic network improves robustness to outliers, while modelling parameter uncertainty via variational dropout defends against overfitting. For example, Table 3 shows that the predictive performance of the 3D-ESPCN + Hetero. model is only marginally compromised even when the outliers are not removed from training data, while the baseline 3D-ESPCN results in much poorer performance. This can be ascribed to the ability of the variance network in the 3D-ESPCN + Hetero. architecture to attenuate the effects of outliers by assigning small weights (i.e. high uncertainty) in the weighted MSE loss function as shown in eq. (21). However, this loss attenuation mechanism can also encourage the network to overfit to low-uncertainty regions, potentially focusing less on ambiguous yet important parts of the data—we indeed observe in Table 3 that the heteroscedastic network performs considerably worse than the baseline 3D-ESPCN on the exterior regions while the reverse is observed on the interior part. Such overfitting to low-uncertainty interior regions is alleviated by modelling parameter uncertainty with variational dropout , as evidenced by the dramatic error reduction in the exterior region on both HCP and Lifespan datasets.
Secondly, experiments on the images of healthy and pathological brains have demonstrated the utility of predictive uncertainty as a reliability metric of output images. Fig. 8 illustrates the strong correspondence between the maps of predictive uncertainty and the reconstruction quality (voxel-wise RMSE) in the downstream derived quantites such as FA and MD maps. In addition, Fig. 12 shows that such uncertainty measure also highlights pathological structures not observed in the training data. We have also tested the utility of predictive uncertainty in discriminating voxels with sufficiently low RMSEs in the predicted high-resolution MD maps. As shown in Fig. 9, the optimal threshold selected on the HCP training dataset is capable to detecting over of non-reliable predictions—voxels with RMSE above a certain threshold—not only on the unseen subjects in the same HCP cohort but also on subjects from the out-of-sample Lifespan dataset, that are statistically disparate from the training distribution (e.g. different age range and acquisition protocol). These results combined demonstrate the utility of predictive uncertainty map as a means to quantify output safety, and provides a subject-specific alternative to standard population-group reliability metrics (e.g. mean reconstruction accuracy in a held-out cohort of subjects). Such conventional group statistics can be misleading in practice; for instance, the information that a super-resolution algorithm is reliable of the time on a dataset of subjects may not accurately represent the performance on a new unseen individual if the person is not well-represented in the cohort (e.g. pathology, different scanners, etc). In contrast, predictive uncertainty provides a metric of reliability, tailored to each individual at hand.
Thirdly, our preliminary experiments show that decomposition of the effects of intrinsic and parameter uncertainty in the predictive uncertainty provides a layer of explanations into the performance of the considered deep learning methods. Fig. 10 shows that the low reconstruction error in the centre of the benign cyst can be explained by the dominant intrinsic uncertainty, which indicates the inherent simplicity of super-resolution task in such homogeneous region, whilst the unfamiliarity of such structure in the healthy training dataset is reflected in the high parameter uncertainty. Assuming that the estimates of decomposed uncertainty components are sufficiently accurate, we could act on them to further improve the overall safety of the system. Imagine a scenario where reconstruction error is consistently high on certain image structures, if the parameter uncertainty is high but intrinsic uncertainty is low, this indicates that collecting more training data would be beneficial. On the other hand, if the parameter uncertainty is low and intrinsic uncertainty is high, this would mean that we need to regard such errors as inevitability, and abstain from predictions to ensure safety or account for them appropriately in subsequent analysis.
The proposed methods for estimating intrinsic and parameter uncertainty, however, make several simplifying assumptions in the forms of likelihood model and posterior distributions over network parameters . Firstly, the likelihood model takes the form of a Gaussian distribution with a diagonal covariance matrix. This means that the likelihood model is not able to capture multi-modality of the predictive distribution i.e. the presence of multiple different solutions. While the full predictive distribution (eq. (9)) is not necessarily unimodal in theory due to the integration with the posterior distribution, we observe in practice that the drawn samples are not very diverse. Future work should explore the benefits of employing more complex forms of likelihood functions such as mixture models [101, 66], diversity losses [102, 103, 104] and more powerful density estimators [105, 106, 107, 108, 66]. Also, the diagonality of covariance matrices means that the output pixels are assumed statistically independent given the input. Although the predicted images display high inter-pixel consistency, modelling the correlations between neighbouring pixels  may further improve the reconstruction quality. Analogous to the likelihood function, variational dropout , which is used in this work, approximates the posteriors by Gaussian distributions with diagonal covariance, imposing restrictive assumptions of unimodality and statistical independence between neural network weights. More recent advances in the Bayesian deep learning research [110, 111, 112, 113, 114, 115] could be used to enhance the quality of parameter uncertainty estimation by allowing the model to capture multi-modality and statistical dependencies between parameters.
An important future challenge is the clinical validation of predictive uncertainty as a reliability metric of output images. To this end, we need to design a more clinically meaningful definition of success and failure of the data enhancement algorithm at hand. Despite the high accuracy in distinguishing between predictive failures and successes attained with our method (Fig. 9), our definition of reconstruction quality, namely voxel-wise RMSE, does not necessarily represent the real utility of the output image. One possible approach would be to have clinical experts to label the potential failures in the super-resolved images, be it for a targeted application (e.g. diagnosis of some neurological conditions) or for general usage in clinical practice. A more economical alternative, which does not require extra label acquisition, is to define the prediction success in downstream measurements of interest i.e. functions of the output images , such as morphometric measurements of anatomical or pathological structures (e.g. volumes). The propagation method (eq. (13)) introduced in Sec. 3.6 can be utilised to quantify uncertainty components in the space of target measurement . Measuring the correlation between such propagated uncertainty estimates and the corresponding errors would be a useful indicator of how well the uncertainty measure reflects the accuracy of the chosen measurement . Lastly, our initial results on the brain tumour dataset motivate a larger-scale quantitative validation of uncertainty estimates in the presence of pathology. Future work must examine the effect of including patients’ dataset in the training data on the estimate of uncertainty components.
There are many ways in which uncertainty information could be utilised by radiologists or other users of data enhancement algorithms. First, predictive uncertainty can be used to decide when to abstain from predictions in high-risk regions of images (e.g. anomalies, out-of-distribution examples or inherently ambiguous features). For example, the original input low-resolution image can be augmented by overlaying the high-resolution prediction only in locations with sufficiently low uncertainty, before presenting to clinicians. As demonstrated by Fig. 9 in the context of super-resolution, such uncertainty-based quality control of predictions is potentially an effective means to maintain high accuracy of output images and also to safeguard against hallucination or removal of structures 
. Second, the uncertainty information could be used for active learning to decide which images should be labelled and included in the training set to maximally improve the model performance. Prior work [117, 118]
define the acquisition function so as to select examples with high parameter uncertainty, and achieve promising results in classification and segmentation tasks. In particular, these methods are able to construct a compact and effective training dataset, and consequently improve the prediction accuracy while reducing the training time. The same idea could be naturally extended to data enhancement problems, that are typically formulated as multivariate regression tasks. For example, in the case of IQT, we could simulate a library of low-resolution and high-resolution image pairs from a large public dataset (e.g. HCP), and incrementally expand the training data by adding more examples from such a library. We should note, however, that in many data enhancement applications, obtaining a new “label” may require an extra acquisition possibly with a different scanner or modality, which may be logistically challenging. Third, another important application is transfer learning where uncertainty information could be used to leverage knowledge from different but related domains or tasks. In many data enhancement applications, the test distribution can considerably deviate from the training distribution. For example, the algorithm might be trained on a synthetic dataset or images acquired from a scanner that is very different from the one used in the hospital where one plans to deploy the model. Therefore, a mechanism to adapt performance within a specific environment (e.g., based on the local patient population) , possibly in an online fashion [121, 122], is in demand. Recent work have shown that the Bayesian formalism provides a natural framework to use uncertainty in order to account for the difference and commonality between distributions to guide information transfer in continual learning [123, 124] or few-shot learning [125, 126] settings. Exploring the benefits of these ideas in the context of medical image enhancement remains future work.
The proposed framework for uncertainty quantification is formulated for multivariate regression in the general form, and thus is naturally applicable to many other image enhancement challenges such as: rapid image acquisition techniques e.g., compressed sensing , MR fingerprinting [127, 128] or sparse reconstruction [19, 18]; denoising  and dealiasing [21, 129]; image synthesis tasks e.g., estimating T2-weighted images from T1 [130, 131, 132], estimating CT images from MRI [133, 57, 48], and generating a high-field scan from a low-field scan ; data harmonisation [134, 14, 15] which aims to learn mappings among imaging protocols to reduce confounds in multicentre studies. Our results on image quality transfer  illustrate the potential of the uncertainty modelling techniques to improve the safety of these applications by not only improving the predictive accuracy, but also providing a mechanism to quantify risks and safeguard against potential malfunction.
Data and code availability statement
The Human Connectome Project dataset (release Q3)  and the Lifespan dataset  are publicly available. The Prisma data is available upon request. The glioma and multiple-sclerosis datasets are part of on-going studies at the Humanitas Research Hospital, Italy and Institute of Neurology at UCL, UK repectively, and we are bound by the policies of the data providers. The code will be released at https://github.com/rtanno21609/SaferNeuroimageEnhancement upon publication.
We would like to thank Felix Bragmann at Babylon Health, Zach Eaton-Rosen at UCL/KCL and Stefano Blumberg at UCL for their valuable feedback. We would also like to thank Samuel Hurley whom helped with the Prisma acquisitions in FMRIB at University of Oxford. The tumour data were acquired as part of a clinical research project lead by Alberto Bizzi, MD at the Humanitas Research Hospital in Milan, Italy. We are also grateful to Mark S. Graham at Visulytix and Gary Zhang at UCL for connecting us with Alberto Bizzi. The multiple sclerosis (MS) data were acquired as part of a study at UCL Institute of Neurology, funded by the MS Society UK and the UCL Hospitals Biomedical Research Centre (PIs: David Miller and Declan Chard). The HCP data were provided by the WU-Minn Consortium (PIs: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by NIH and Washington University.
EU Horizon 2020 grant CDS-QuaMRI 634541-2 and EPSRC grants R014019 R006032 N018702 and M020533 support DCA’s work on this topic. FG has received funding under the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 634541 and from the EPSRC (R006032/1, M020533/1). RT was supported by Microsoft scholarship.
-  D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.
-  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
-  K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation,” Medical Image Analysis, vol. 36, pp. 61–78, 2017.
-  H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, S. Wang, J. Liu, E. Turkbey, and R. M. Summers, “A new 2.5 d representation for lymph node detection using random sets of deep convolutional neural network observations,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2014, pp. 520–527.
-  T. Araújo, G. Aresta, E. Castro, J. Rouco, P. Aguiar, C. Eloy, A. Polónia, and A. Campilho, “Classification of breast cancer histology images using convolutional neural networks,” PloS one, vol. 12, no. 6, p. e0177544, 2017.
Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
-  O. Oktay, W. Bai, M. Lee, R. Guerrero, K. Kamnitsas, J. Caballero, A. de Marvao, S. Cook, D. O’Regan, and D. Rueckert, “Multi-input cardiac image super-resolution using convolutional neural networks,” in MICCAI. Springer, 2016.
-  Y. Chen, F. Shi, A. G. Christodoulou, Y. Xie, Z. Zhou, and D. Li, “Efficient and accurate mri super-resolution using a generative adversarial network and 3d multi-level densely connected network,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 91–99.
-  D. Ravì, A. B. Szczotka, S. P. Pereira, and T. Vercauteren, “Adversarial training with cycle consistency for unsupervised super-resolution in endomicroscopy,” Medical image analysis, vol. 53, pp. 123–131, 2019.
-  D. Nie, X. Cao, Y. Gao, L. Wang, and D. Shen, “Estimating ct image from mri data using 3d fully convolutional networks,” in Deep Learning and Data Labeling for Medical Applications. Springer, 2016, pp. 170–178.
-  E. Kang, J. Min, and J. C. Ye, “A deep convolutional neural network using directional wavelets for low-dose x-ray ct reconstruction,” Medical physics, vol. 44, no. 10, 2017.
-  A. Benou, R. Veksler, A. Friedman, and T. R. Raviv, “Ensemble of expert deep neural networks for spatio-temporal denoising of contrast-enhanced mri sequences,” Medical image analysis, vol. 42, pp. 145–159, 2017.
-  H. Chen, Y. Zhang, W. Zhang, P. Liao, K. Li, J. Zhou, and G. Wang, “Low-dose ct via convolutional neural network,” Biomedical optics express, vol. 8, no. 2, pp. 679–694, 2017.
-  S. C. Karayumak, M. Kubicki, and Y. Rathi, “Harmonizing diffusion mri data across magnetic field strengths,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 116–124.
-  C. M. Tax, F. Grussu, E. Kaden, L. Ning, U. Rudrapatna, J. Evans, S. St-Jean, A. Leemans, S. Koppers, D. Merhof et al., “Cross-scanner and cross-protocol diffusion mri data harmonisation: A benchmark database and evaluation of algorithms,” NeuroImage, 2019.
-  J. Sun, H. Li, Z. Xu et al., “Deep admm-net for compressive sensing mri,” in Advances in neural information processing systems, 2016, pp. 10–18.
-  K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4509–4522, 2017.
-  K. Hammernik, T. Klatzer, E. Kobler, M. P. Recht, D. K. Sodickson, T. Pock, and F. Knoll, “Learning a variational network for reconstruction of accelerated mri data,” Magnetic resonance in medicine, vol. 79, no. 6, pp. 3055–3071, 2018.
-  J. Schlemper, J. Caballero, J. V. Hajnal, A. N. Price, and D. Rueckert, “A deep cascade of convolutional neural networks for dynamic mr image reconstruction,” IEEE transactions on medical imaging, vol. 37, no. 2, pp. 491–503, 2018.
-  B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domain-transform manifold learning,” Nature, vol. 555, no. 7697, p. 487, 2018.
-  G. Yang, S. Yu, H. Dong, G. Slabaugh, P. L. Dragotti, X. Ye, F. Liu, S. Arridge, J. Keegan, Y. Guo et al., “Dagan: Deep de-aliasing generative adversarial networks for fast compressed sensing mri reconstruction,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1310–1321, 2018.
-  Y. H. Yoon, S. Khan, J. Huh, and J. C. Ye, “Efficient b-mode ultrasound image reconstruction from sub-sampled rf data using deep learning,” IEEE transactions on medical imaging, vol. 38, no. 2, pp. 325–336, 2019.
-  H. Sokooti, B. de Vos, F. Berendsen, B. P. Lelieveldt, I. Išgum, and M. Staring, “Nonrigid image registration using multi-scale 3d convolutional neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 232–239.
G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “An unsupervised learning model for deformable medical image registration,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9252–9260.
-  L. Wu, J.-Z. Cheng, S. Li, B. Lei, T. Wang, and D. Ni, “Fuiqa: Fetal ultrasound image quality assessment with deep convolutional networks,” IEEE transactions on cybernetics, vol. 47, no. 5, pp. 1336–1349, 2017.
-  S. J. Esses, X. Lu, T. Zhao, K. Shanbhogue, B. Dane, M. Bruno, and H. Chandarana, “Automated image quality evaluation of t2-weighted liver mri utilizing deep learning architecture,” Journal of Magnetic Resonance Imaging, vol. 47, no. 3, pp. 723–728, 2018.
-  J. P. Cohen, M. Luck, and S. Honari, “Distribution matching losses can hallucinate features in medical image translation,” arXiv preprint arXiv:1805.08841, 2018.
-  E. Begoli, T. Bhattacharya, and D. Kusnezov, “The need for uncertainty quantification in machine-assisted medical decision making,” Nature Machine Intelligence, vol. 1, no. 1, p. 20, 2019.
-  S. C. Hora, “Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management,” Reliability Engineering & System Safety, vol. 54, no. 2-3, pp. 217–223, 1996.
-  H. Wang, D. M. Levi, and S. A. Klein, “Intrinsic uncertainty and integration efficiency in bisection acuity,” Vision research, vol. 36, no. 5, pp. 717–739, 1996.
-  D. Draper, “Assessment and propagation of model uncertainty,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 57, no. 1, pp. 45–70, 1995.
-  A. Der Kiureghian and O. Ditlevsen, “Aleatory or epistemic? does it matter?” Structural Safety, vol. 31, no. 2, pp. 105–112, 2009.
-  R. Tanno, D. E. Worrall, A. Ghosh, E. Kaden, S. N. Sotiropoulos, A. Criminisi, and D. C. Alexander, “Bayesian image quality transfer with cnns: Exploring uncertainty in dmri super-resolution,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 611–619.
-  A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Advances in Neural Information Processing Systems, 2017, pp. 5580–5590.
G. Cybenko, “Approximation by superpositions of a sigmoidal function,”Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
D. A. Nix and A. S. Weigend, “Estimating the mean and variance of the target probability distribution,” inIEEE WCCI, vol. 1. IEEE, 1994, pp. 55–60.
-  D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in NIPS, 2015, pp. 2575–2583.
-  D. C. Alexander and et al., “Image quality transfer via random forest regression: applications in diffusion MRI,” in MICCAI. Springer, 2014, pp. 225–232.
-  R. Tanno, A. Ghosh, F. Grussu, E. Kaden, A. Criminisi, and D. C. Alexander, “Bayesian image quality transfer,” in MICCAI. Springer, 2016, pp. 265–273.
-  D. C. Alexander, D. Zikic, A. Ghosh, R. Tanno, V. Wottschel, J. Zhang, E. Kaden, T. B. Dyrby, S. N. Sotiropoulos, H. Zhang et al., “Image quality transfer and applications in diffusion mri,” Neuroimage, vol. 152, pp. 283–298, 2017.
-  S. B. Blumberg, R. Tanno, I. Kokkinos, and D. C. Alexander, “Deeper image quality transfer: Training low-memory neural networks for 3d images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 118–125.
-  S. N. Sotiropoulos and et al., “Advances in diffusion MRI acquisition and processing in the human connectome project,” Neuroimage, vol. 80, pp. 125–143, 2013.
-  M. P. Harms, L. H. Somerville, B. M. Ances, J. Andersson, D. M. Barch, M. Bastiani, S. Y. Bookheimer, T. B. Brown, R. L. Buckner, G. C. Burgess et al., “Extending the human connectome project across ages: Imaging protocols for the lifespan development and aging projects,” NeuroImage, vol. 183, pp. 972–984, 2018.
-  O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. de Marvao, T. Dawes, D. P. O?Regan et al., “Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation,” IEEE transactions on medical imaging, vol. 37, no. 2, pp. 384–395, 2018.
-  C. Zhao, A. Carass, B. E. Dewey, J. Woo, J. Oh, P. A. Calabresi, D. S. Reich, P. Sati, D. L. Pham, and J. L. Prince, “A deep learning based anti-aliasing self super-resolution algorithm for mri,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 100–108.
-  D. Mahapatra, B. Bozorgtabar, S. Hewavitharanage, and R. Garnavi, “Image super resolution using generative adversarial networks and local saliency maps for retinal image analysis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 382–390.
-  H. Yu, D. Liu, H. Shi, H. Yu, Z. Wang, X. Wang, B. Cross, M. Bramler, and T. S. Huang, “Computed tomography super-resolution using convolutional neural networks,” in Image Processing (ICIP), 2017 IEEE International Conference on. IEEE, 2017, pp. 3944–3948.
-  D. Nie, R. Trullo, J. Lian, L. Wang, C. Petitjean, S. Ruan, Q. Wang, and D. Shen, “Medical image synthesis with deep convolutional adversarial networks,” IEEE Transactions on Biomedical Engineering, 2018.
-  J. M. Wolterink, A. M. Dinkla, M. H. Savenije, P. R. Seevinck, C. A. van den Berg, and I. Išgum, “Deep mr to ct synthesis using unpaired data,” in International Workshop on Simulation and Synthesis in Medical Imaging. Springer, 2017, pp. 14–23.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint, 2017.
-  K. Bahrami, F. Shi, I. Rekik, and D. Shen, “Convolutional neural network for reconstruction of 7t-like images from 3t mri using appearance and anatomical features,” in MICCAI DLDLM workshop. Springer, 2016, pp. 39–47.
-  S. B. Blumberg, M. Palombo, C. S. Khoo, C. Tax, R. Tanno, and D. C. Alexander, “Multi-stage prediction networks for data harmonization,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019.
-  H. Shi, D. Worrall, B. Veeling, H. Huisman, and M. Welling, “Supervised uncertainty quantification for segmentation with multiple annotations,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2019.
-  A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu, “Unsupervised learning for fast probabilistic diffeomorphic registration,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018.
-  B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Advances in Neural Information Processing Systems, 2017, pp. 6402–6413.
-  J. Schlemper, G. Yang, P. Ferreira, A. Scott, L.-A. McGill, Z. Khalique, M. Gorodezky, M. Roehl, J. Keegan, D. Pennell et al., “Stochastic deep compressive sensing for the reconstruction of diffusion tensor cardiac mri,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018.
-  F. J. Bragman, R. Tanno, Z. Eaton-Rosen, W. Li, D. J. Hawkes, S. Ourselin, D. C. Alexander, J. R. McClelland, and M. J. Cardoso, “Uncertainty in multitask learning: joint representations for probabilistic mr-only radiotherapy planning,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018.
-  T. Nair, D. Precup, D. L. Arnold, and T. Arbel, “Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 655–663.
-  Z. Eaton-Rosen, F. Bragman, S. Bisdas, S. Ourselin, and M. J. Cardoso, “Towards safe deep learning: accurately quantifying biomarker uncertainty in neural network predictions,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 691–699.
-  Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Insights and applications,” in Deep Learning Workshop, ICML, 2015.
-  A. G. Roy, S. Conjeti, N. Navab, C. Wachinger, A. D. N. Initiative et al., “Bayesian quicknat: Model uncertainty in deep whole-brain segmentation for structure-wise quality control,” NeuroImage, 2019.
-  D. E. Worrall, C. M. Wilson, and G. J. Brostow, “Automated retinopathy of prematurity case detection with convolutional neural networks,” in MICCAI DLDLM Workshop. Springer, 2016, pp. 68–76.
-  C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl, “Leveraging uncertainty information from deep neural networks for disease detection,” Scientific reports, vol. 7, no. 1, p. 17816, 2017.
-  M. S. Ayhan and P. Berens, “Test-time data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks,” 2018.
-  M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, S. Mullainathan, and J. M. Kleinberg, “Direct uncertainty prediction for medical second opinions,” 2018.
-  S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. Maier-Hein, S. A. Eslami, D. J. Rezende, and O. Ronneberger, “A probabilistic u-net for segmentation of ambiguous images,” in Advances in Neural Information Processing Systems, 2018, pp. 6965–6975.
-  C. F. Baumgartner, K. C. Tezcan, K. Chaitanya, A. M. Hötker, U. J. Muehlematter, K. Schawkat, A. S. Becker, O. Donati, and E. Konukoglu, “PHiSeg: Capturing uncertainty in medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2019.
-  V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1297–1322, 2010.
-  R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and N. Silberman, “Learning from noisy labels by regularized estimation of annotator confusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
-  X. Yang, R. Kwitt, and M. Niethammer, “Fast predictive image registration,” in MICCAI DLDLM Workshop. Springer, 2016, pp. 48–57.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016, pp. 1874–1883.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE PAMI, vol. 38, no. 2, pp. 295–307, 2016.
-  S. McDonagh, B. Hou, K. Kamnitsas, O. Oktay, A. Alansary, and B. Kainz, “Context-sensitive super-resolution for fast fetal magnetic resonance imaging,” arXiv preprint arXiv:1703.00035, 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV. Springer, 2016, pp. 694–711.
-  M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in ICCV. IEEE, 2011, pp. 2018–2025.
-  A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill, 2016. [Online]. Available: http://distill.pub/2016/deconv-checkerboard
-  C. R. Rao, “Estimation of heteroscedastic variances in linear models,” Journal of the American Statistical Association, vol. 65, no. 329, pp. 161–172, 1970.
-  D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  R. M. Neal, “Bayesian learning via stochastic dynamics,” in Advances in neural information processing systems, 1993, pp. 475–482.
-  M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 681–688.
-  T. Chen, E. Fox, and C. Guestrin, “Stochastic gradient hamiltonian monte carlo,” in International Conference on Machine Learning, 2014, pp. 1683–1691.
-  Y.-A. Ma, T. Chen, and E. Fox, “A complete recipe for stochastic gradient mcmc,” in Advances in Neural Information Processing Systems, 2015, pp. 2917–2925.
-  D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” arXiv preprint arXiv:1701.05369, 2017.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, 2014.
-  N. A. Weiss, A course in probability. Addison-Wesley, 2006.
-  C. G. Bowsher and P. S. Swain, “Identifying sources of variation and the flow of information in biochemical networks,” Proceedings of the National Academy of Sciences, 2012.
-  D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. Behrens, E. Yacoub, K. Ugurbil, W.-M. H. Consortium et al., “The wu-minn human connectome project: an overview,” Neuroimage, vol. 80, pp. 62–79, 2013.
-  M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni et al., “The minimal preprocessing pipelines for the human connectome project,” Neuroimage, vol. 80, pp. 105–124, 2013.
-  M. Figini, M. Riva, M. Graham, G. M. Castelli, B. Fernandes, M. Grimaldi, G. Baselli, F. Pessina, L. Bello, H. Zhang et al., “Prediction of isocitrate dehydrogenase genotype in brain gliomas with mri: single-shell versus multishell diffusion models,” Radiology, vol. 289, no. 3, pp. 788–796, 2018.
-  P. J. Basser, J. Mattiello, and D. LeBihan, “Mr diffusion tensor spectroscopy and imaging,” Biophysical journal, vol. 66, no. 1, pp. 259–267, 1994.
-  E. Özarslan, C. G. Koay, T. M. Shepherd, M. E. Komlosh, M. O. İrfanoğlu, C. Pierpaoli, and P. J. Basser, “Mean apparent propagator (map) mri: a novel diffusion imaging method for mapping tissue microstructure,” NeuroImage, vol. 78, pp. 16–32, 2013.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Information Processing Systems, 2017, pp. 3581–3590.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans Image Proc, vol. 13, no. 4, 2004.
-  E. Kaden, F. Kruggel, and D. C. Alexander, “Quantitative mapping of the per-axon diffusion coefficients in brain white matter,” Magnetic resonance in medicine, vol. 75, no. 4, pp. 1752–1763, 2016.
-  J. Tournier, F. Calamante, and A. Connelly, “Improved probabilistic streamlines tractography by 2nd order integration over fibre orientation distributions,” in ISMRM, 2010, p. 1670.
-  J.-D. Tournier, F. Calamante, and A. Connelly, “Mrtrix: diffusion tractography in crossing fiber regions,” International journal of imaging systems and technology, vol. 22, no. 1, pp. 53–66, 2012.
-  C. M. Bishop, “Mixture density networks,” Citeseer, Tech. Rep., 1994.
-  A. Guzman-Rivera, D. Batra, and P. Kohli, “Multiple choice learning: Learning to produce multiple structured outputs,” in Advances in Neural Information Processing Systems, 2012, pp. 1799–1807.
-  D. Bouchacourt, P. K. Mudigonda, and S. Nowozin, “Disco nets: Dissimilarity coefficients networks,” in Advances in Neural Information Processing Systems, 2016, pp. 352–360.
-  H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 35–51.
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 172–189.
-  D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,” arXiv preprint arXiv:1505.05770, 2015.
-  G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autoregressive flow for density estimation,” in Advances in Neural Information Processing Systems, 2017, pp. 2338–2347.
-  A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2642–2651.
-  S. Chandra and I. Kokkinos, “Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs,” in European Conference on Computer Vision. Springer, 2016, pp. 402–418.
-  C. Louizos and M. Welling, “Structured and efficient variational deep learning with matrix gaussian posteriors,” in International Conference on Machine Learning, 2016, pp. 1708–1716.
-  C. Oh, K. Adamczewski, and M. Park, “Radial and directional posteriors for bayesian neural networks,” arXiv preprint arXiv:1902.02603, 2019.
-  D. Krueger, C.-W. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville, “Bayesian hypernetworks,” arXiv preprint arXiv:1710.04759, 2017.
-  R. Zhang, C. Li, J. Zhang, C. Chen, and A. G. Wilson, “Cyclical stochastic gradient mcmc for bayesian deep learning,” arXiv preprint arXiv:1902.03932, 2019.
-  N. Pawlowski, A. Brock, M. C. Lee, M. Rajchl, and B. Glocker, “Implicit weight uncertainty in neural networks,” arXiv preprint arXiv:1711.01297, 2017.
-  C. Louizos and M. Welling, “Multiplicative normalizing flows for variational bayesian neural networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2218–2227.
-  B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.
-  Y. Gal, R. Islam, and Z. Ghahramani, “Deep bayesian active learning with image data,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1183–1192.
-  M. Gorriz, A. Carlier, E. Faure, and X. Giro-i Nieto, “Cost-effective active learning for melanoma segmentation,” arXiv preprint arXiv:1711.09168, 2017.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
-  K. Kamnitsas, C. Baumgartner, C. Ledig, V. Newcombe, J. Simpson, A. Kane, D. Menon, A. Nori, A. Criminisi, D. Rueckert et al., “Unsupervised domain adaptation in brain lesion segmentation with adversarial networks,” in International conference on information processing in medical imaging. Springer, 2017, pp. 597–609.
-  N. Karani, K. Chaitanya, C. Baumgartner, and E. Konukoglu, “A lifelong learning approach to brain mr segmentation across scanners and protocols,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 476–484.
-  C. Baweja, B. Glocker, and K. Kamnitsas, “Towards continual learning in medical imaging,” arXiv preprint arXiv:1811.02496, 2018.
-  J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
-  C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, “Variational continual learning,” in International Conference on Learning Representations, 2018.
-  C. Finn, K. Xu, and S. Levine, “Probabilistic model-agnostic meta-learning,” in Advances in Neural Information Processing Systems, 2018, pp. 9516–9527.
-  J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn, “Bayesian model-agnostic meta-learning,” in Advances in Neural Information Processing Systems, 2018, pp. 7332–7342.
-  D. Ma, V. Gulani, N. Seiberlich, K. Liu, J. L. Sunshine, J. L. Duerk, and M. A. Griswold, “Magnetic resonance fingerprinting,” Nature, vol. 495, no. 7440, p. 187, 2013.
-  O. Cohen, B. Zhu, and M. S. Rosen, “Mr fingerprinting deep reconstruction network (drone),” Magnetic resonance in medicine, vol. 80, no. 3, pp. 885–894, 2018.
-  Y. Han, J. Yoo, H. H. Kim, H. J. Shin, K. Sung, and J. C. Ye, “Deep learning with domain adaptation for accelerated projection-reconstruction mr,” Magnetic resonance in medicine, vol. 80, no. 3, pp. 1189–1205, 2018.
-  F. Rousseau, “Brain hallucination,” in ECCV. Springer, 2008, pp. 497–508.
-  D. H. Ye, D. Zikic, B. Glocker, A. Criminisi, and E. Konukoglu, “Modality propagation: coherent synthesis of subject-specific scans with data-driven regularization,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2013, pp. 606–613.
-  A. Jog, A. Carass, S. Roy, D. L. Pham, and J. L. Prince, “MR image synthesis by contrast learning on neighborhood ensembles,” Medical image analysis, vol. 24, no. 1, pp. 63–76, 2015.
-  N. Burgos, M. J. Cardoso, F. Guerreiro, C. Veiga, M. Modat, J. McClelland, A.-C. Knopf, S. Punwani, D. Atkinson, S. R. Arridge et al., “Robust CT synthesis for radiotherapy planning: Application to the head and neck region,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 476–484.
-  H. Mirzaalian, L. Ning, P. Savadjiev, O. Pasternak, S. Bouix, O. Michailovich, G. Grant, C. Marx, R. A. Morey, L. Flashman et al., “Inter-site and inter-scanner diffusion mri data harmonization,” NeuroImage, vol. 135, pp. 311–323, 2016.