Unsupervised Knowledge-Transfer for Learned Image Reconstruction

07/06/2021 ∙ by Riccardo Barbano, et al. ∙ University of Oulu UCL 37

Deep learning-based image reconstruction approaches have demonstrated impressive empirical performance in many imaging modalities. These approaches generally require a large amount of high-quality training data, which is often not available. To circumvent this issue, we develop a novel unsupervised knowledge-transfer paradigm for learned iterative reconstruction within a Bayesian framework. The proposed approach learns an iterative reconstruction network in two phases. The first phase trains a reconstruction network with a set of ordered pairs comprising of ground truth images and measurement data. The second phase fine-tunes the pretrained network to the measurement data without supervision. Furthermore, the framework delivers uncertainty information over the reconstructed image. We present extensive experimental results on low-dose and sparse-view computed tomography, showing that the proposed framework significantly improves reconstruction quality not only visually, but also quantitatively in terms of PSNR and SSIM, and is competitive with several state-of-the-art supervised and unsupervised reconstruction techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the past few years, deep learning-based image reconstruction techniques have demonstrated remarkable empirical results, often outperforming more conventional methods [arridge2019solving, ongie2020deep]. A prominent class among existing approaches is deep unrolled methods [GregorLecun:2010, Monga:2021], which encompass learned iterative methods that replace components of well-established optimisation schemes (e.g. gradient descent [putzky2017recurrent, adler2017solving, hauptmann2018model], primal-dual methods [adler2018learned], or alternating direction method of multipliers [sun2016deep]) by DNN.

The medical imaging community has embraced deep unrolled optimisation schemes as a powerful tool to improve reconstruction quality and speed, with supervised learning rapidly becoming a workhorse in several imaging applications

[WangYe:2020]

. In supervised learning, a parametric model “learns” how to reconstruct images using a reference training dataset, which consists of ordered pairs of ground truth images and measurement data. This is different from more classical reconstruction techniques (e.g. variational methods

[engl1996regularization, ItoJin:2015]), which typically rely on only a single noisy measurement. In contrast, learned reconstruction methods mostly require a large amount of ordered pairs of measurement data and (approximate) ground truth images, which are often of limited availability in the vast majority of medical imaging applications.

Reconstruction methods that learn in a scarce-data regime often fail to generalise on instances which belong to different data distributions [bickel2008learning, quinonero2009dataset]. Moreover, even small deviations from the training data distribution can potentially lead to severe reconstruction artefacts. This can be further exacerbated by the presence of structural changes such as rare pathologies. These “shifts” in data distribution can significantly degrade the performances of learned reconstruction methods [antun2020instabilities]. To make matters worse, such forms of deviation from the training data are ubiquitous in medical imaging, owing to factors such as the change in acquisition protocols. For example, in magnetic resonance imaging (MRI), these include factors such as echo time, repetition time, flip angle, and inherent hardware variations in the used scanner [KaraniKonukoglu:2021]; in CT, they include the choice of view angles, acquisition time per view, and source-target separation.

There is therefore an imperative need to develop learning-based methods for image reconstruction that do not rely on a large amount of high-quality ordered pairs of training data. In this work we develop a novel unsupervised knowledge-transfer strategy to transfer acquired “reconstructive knowledge” across different datasets using the Bayesian framework. The proposed framework falls into the class of deep unrolled methods, with the training process comprising of two phases. The first phase is supervised and is tasked with pretraining a reconstructor (a DNN) on data pairs of ground truth images and corresponding measurement data (which are either simulated or experimentally collected). The second phase is unsupervised and at test-time fine-tunes the reconstructor learned in the first phase on clinically-realistic measurement data, using a novel regularised Bayesian loss. Extensive numerical experiments with low-dose and sparse-view CT indicate that the proposed approach is competitive with state-of-the-art methods both quantitatively (in terms of PSNR and SSIM) and qualitatively, and that adaptation can significantly boost the performance. To the best of our knowledge, this is the first work to propose Bayesian unsupervised knowledge-transfer for test-time adaptation of a learned iterative reconstruction method. Furthermore, the use of the Bayesian framework allows us to capture predictive uncertainty of the reconstructions. Overall, our framework has the following distinct features: (i) adapting to unseen measurement data without the need for supervision (i.e. ground truth images); (ii) leveraging reconstructive properties learned in the supervised phase for feature representation; (iii) providing uncertainty estimates on reconstructed images.

The rest of the paper is structured as follows. In Section II, we survey related work. In Section III we describe the setting, and discuss deep unrolled methods for image reconstruction and Bayesian deep learning. In Section IV, we develop the proposed two-phase unsupervised knowledge-transfer paradigm. In Section V, we present experimental results for low-dose and sparse-view CT, including several benchmarks. In Section VI we discuss the results obtained with the two-phase learning paradigm, and in Section VII we give concluding remarks.

Ii Related work

The lack of (a sufficient amount of) reference training data has only recently motivated the development of image reconstruction approaches that do not require ground truth images. Below we categorise these approaches into two main groups: test-time adaptation, and unsupervised approaches.

Test-time adaptation studies problems arising from learning under differing training and testing distributions. It often consists of fine-tuning a pretrained DNN for a single datum at a time or for a small set of test instances. In [HanYooYe:2018, DarCukur:2020] this paradigm is used for MRI reconstruction, where reconstructive properties acquired by a network that has been pretrained on a task where a large dataset is available are transferred to a different task where the data is scarce. Zhang et al.[ZhangWang:2020]

propose to fine-tune the weights of a pretrained convolutional neural network (CNN) for each instance in the test dataset, by minimising an unsupervised data-fidelity term that is based on the forward model. Likewise, Sun et al.

[SunWangHardt:2020] propose to adapt only a part of a CNN according to a self-supervised loss defined on the given test image. Gilton et al.[gilton2021model] adapt a pretrained image reconstruction network to reconstruct images from a perturbed forward model using only a small collection of measurements. Analogous to these studies, our approach conducts test-time adaptation, but within a Bayesian framework.

Unsupervised approaches operate with only measurements, but no ground truth image data at any stage of the training. Deep image prior (DIP) is a prominent member of this group, which achieves sample-specific performance using DNN to describe the mappings from latent variables to high-quality images [Ulyanov:2018deepimage]. During the inference, the network architecture acts as a regulariser for reconstruction [DittmerKluthMaass:2020]. Despite the strong performance, it suffers from slow convergence (often requiring thousands of iterations), and the need of a well-timed early stopping. The latter issue has motivated the use of an additional stabiliser (e.g. total variation) [baguer2020computed]. Other popular unsupervised methods build upon the Noise2Noise framework [lehtinen2018noise2noise], which conducts image denoising by training only on paired noisy images that correspond to the same ground truth image. Thereafter, Batson et al.[batson2019noise2self] demonstrate the feasibility of the framework for denoising using a self-prediction loss on a single noisy image, instead of pairs of noisy images. More recently, Hendriksen et al.[hendriksen2020noise2inverse, lagerwerf2020noise2filter] propose a method that performs blind image denoising on reconstructed images. This class of methods operates in a post-processing manner, and thus substantially differs from the proposed unsupervised knowledge-transfer framework.

Iii Methods

Iii-a Problem Setup

In image reconstruction, we aim to recover an image from a corrupted measurement , where and

are suitable vector spaces. The process for acquiring

is assumed to be modelled by a forward operator and additive noise . In this paper, we make the simplifying assumption that is linear, leading to

In deep learning, reconstructing the original image from the corresponding measurement can be phrased as a problem of finding a DNN satisfying the pseudo-inverse property . The network is a mapping parametrised by a parameter vector of a possibly high dimension, and the task of training is to find a configuration for the parameters that yields an optimal reconstruction. In supervised learning this is achieved using a set of input-output pairs of ground truth images and the corresponding measurement data. Training then amounts to minimising a suitable loss:

(1)

where measures the discrepancy between the network output and the corresponding ground truth image .

Supervised learning is a predominant paradigm in deep learning-based image reconstruction techniques [arridge2019solving, ongie2020deep]. In order to deliver competitive performance, it requires many ordered pairs , which are unavailable in many medical imaging applications [WangYe:2020]. Further, supervised models have been observed to exhibit poor generalisation in the presence of a small distributional shift [antun2020instabilities].

Iii-B Deep Unrolled Methods

Unrolling is a popular paradigm for constructing a network for medical image reconstruction. It is based on mimicking well-established iterative algorithms in optimisation [GregorLecun:2010, Monga:2021]. Namely, unrolling methods use an iterative procedure to reconstruct an image from the measurement by combining analytical model components (e.g. the forward map and its adjoint ) with data-driven components. This allows to integrate the physics of the data acquisition process into the design of the network, which can help mitigate issues due to limited data size, as well as enabling development of high-performance networks from reasonably sized training sets [Monga:2021]. In this work we consider unrolled gradient methods. Specifically, given an initial guess (e.g. the FBP estimate in X-ray CT reconstruction), we compute iterates in the form of residual updates:

(2)

Here is the total number of iterations, is the residual update computed by a DNN, denotes the projection onto a convex feasibility constraint set (e.g. non-negativity), and is a weighting parameter. To formulate a gradient-like learned iterative scheme, we absorb the residual update and the projection operator into the network architecture, and supply the network with the model information in the form of the gradient of the data-fidelity term

(3)

We then write the -th unrolled iteration as:

(4)

where is the network used at the -th iteration, and is the corresponding weight vector. This iterative process can be written as one network with weights , and consisting of sub-networks . Note that between each sub-network one needs to compute the gradient to evaluate (4). The overall iterative process can be written as

where is the final reconstruction. In practice, the parameters of sub-networks are often shared [adler2017solving], i.e. , which reduces the total number of trainable parameters and facilitates the training process.

Iii-C Bayesian Iterative Gradient Networks

In this work the network is learned in a Bayesian framework as this provides principled mechanisms for knowledge integration and uncertainty quantification [gal2016uncertainty]. The weights of a BNN

are treated as random variables. By placing a prior distribution

over , and combining it with a likelihood function using Bayes’ formula, we obtain a posterior distribution over , given the data , which is the Bayesian solution of the learning task.

The posterior is often computationally intractable [graves2011practical, blundell2015weight, gal2016uncertainty]. To circumvent this issue, we adopt VI, which is a widely used approximate inference scheme [jordan1999introduction] that employs the KL divergence to construct an approximation. Recall that divergence from to is defined by [kullback1951information]

VI looks for an easy-to-compute approximate posterior parametrised by variational parameters . The approximation is most commonly taken from the variational family consisting of products of independent Gaussians:

where

denotes a Gaussian distribution with mean

and variance

, are the variational parameters, and is the total number of weights in . VI constructs an approximation within by minimising

(5)

Once an optimal approximate posterior has been learned, the posterior of for a query measurement is given by

An estimate of can be obtained via Monte Carlo sampling:

with (i.e. samples from ) and being the number of Monte Carlo samples taken [gal2016uncertainty, (3.16)].

We now combine BNN with the unrolled method in (4), as first presented in [Barbano:2020, barbano2020quantifying]. Therein it was termed BDGD, which we also adopt below, but the blocks are now trained end-to-end, instead of training one block at a time as in [barbano2020quantifying]. Let (with the superscript denoting the -fold product), and the overall iterative process reads , with the densities shared across the iterates.

Fig. 1: The architecture of is a downscaled version of a residual U-Net [ronneberger2015u] with two scales of 64 and 128 channels. Each box corresponds to a multi-channel feature map, with the number of channels indicated inside. The inputs (i.e. , , ) go through a contractive path of repeated applications of two Bayesian convolutional layers (BConv), each followed by group normalisation (GN) [wu2018group]

and leaky ReLU (LR), with a maxpool operation in between. Maxpool halves the feature channels resulting in a coarser scale. The expansive path consists of a transposed convolution (Convt) with stride length 2, which doubles the number of feature channels. The resulting feature map is then concatenated with the feature map from the contracting path, which is further processed through a convolutional pipeline. The architecture then bifurcates into two identical convolutional pipelines with feature maps reduced to a single channel. The output of the first pipeline is added as a residual update to the initial input iterate, and projected onto the positive set to produce a new iterate

. The second output is the feedback term . Both terms are recursively fed-back until is reached. The arrows denote different operations, and the ones which have a symbol “x2” next to the arrow imply that the operation in question is repeated twice.

In BDGD, the architecture of is a downscaled version of a residual U-Net [ronneberger2015u] (cf. Fig. 1), and adopts a multi-scale encoder-decoder structure consisting of a contractive (i.e. encoding) and an expansive (i.e. decoding) component, whose weights are denoted respectively by and and

. Note that the training of fully Bayesian models is often nontrivial, and consequently, the performance of the resulting networks is often inferior to non-Bayesian networks

[Osawa:2019]. To make our approach competitive with non-Bayesian methods, while retaining the benefits of Bayesian modelling, we follow the strategy of “being Bayesian only a little bit” [daxberger2020expressive, barbano2020quantifying]. Specifically, we use VI only on the weights of the encoder, which can be interpreted as choosing an approximate posterior for the encoder . The weights of the decoder remain point-estimates (i.e. deterministic). This reduces the number of trainable parameters and hence facilitates the training process, while maintaining the Bayesian nature of the learning algorithm.

Iv Two-Phase Learning

To address the challenges associated with the lack of supervised training data, we develop a novel UKT strategy for learned gradient method, which consists of two learning phases.

The first phase is supervised, and employs a given training dataset , where consists of a ground truth image and the corresponding (noisy) measurement datum and can be either simulated or experimentally collected. The purpose of the first phase is to pretrain a reconstruction network to assist the unsupervised phase. This step has two goals: (i) identifying a sensible region for the network weights; (ii) learning robust representations that are not prone to overfitting. Ideally, this phase would mimic the target reconstruction problem in terms of the geometry of image acquisition, the noise distribution, etc. This would allow to learn adequate inductive biases, and specific priors in order to enable successful unsupervised learning.

The second phase is unsupervised, and has access to a dataset which consists of only a few measurements (e.g. clinically-realistic CT sinograms). Moreover, the distribution of the measurements data in differs from that in . The goal of this phase is to fine-tune the parameters of the reconstruction network , i.e. the variational parameters of the distribution over the encoder parameter and the decoder parameter , so that the fine-tuned network performs well on the data

. This is achieved by devising a novel loss function, using the Bayesian framework, and then initialising the parameters of the reconstruction network to the optimal configuration found in the first phase

. Through this phase, we address the need for adaptivity due to a distributional shift of the data.

Below we describe the details of the two phases.

Iv-a Pretraining via Supervised Learning

In this phase we have access to a training dataset of ordered pairs, and we employ the BDGD framework, described in Section III-C, to find the optimal distribution that approximates the true posterior and the optimal decoder parameter . To construct the posterior , the prior over the encoder weights is set to the standard Gaussian . The likelihood is set to

(6)

with and , and

, following the heteroscedastic noise model

[nix1994estimating]. The outputs and , along with the term will be discussed below. Here denotes the initial guess for the learned gradient method for the training pair . For example, in CT reconstruction, it is customarily taken to be the FBP estimate. Up to an additive constant, we can write:

The minimisation of KL divergence in (5) can be recast as the minimisation of the following loss over :

where is a regularisation parameter. This loss coincides with the negative value of the ELBO in VI (when ). Note that affects only the encoder weights (since the decoder weights are treated deterministically). To compute the gradient of the loss we use the local reparametrisation trick [kingma2015variational], which employs a deterministic dependence of the ELBO with respect to .

BDGD provides a natural means to quantify not only the predictive uncertainty associated with a given reconstruction, but also the sources from which the predictive uncertainty arises. Uncertainty is typically categorised into aleatoric and epistemic uncertainties [hora1996aleatory, ayyub2006uncertainty, Kendall:2017, BarbanoArridgeJinTanno:2021]. Epistemic uncertainty arises from the over-parameterisation of the network (i.e. the number of network weights exceeds the size of the training data); and is described by the posterior [blundell2015weight, Kendall:2017]. Aleatoric uncertainty is, instead, caused by the randomness in the data generation process. To account for this, we employ a heteroscedastic noise model [nix1994estimating] in (6), which sets the likelihood to be a Gaussian distribution, with both its mean and variance predicted by the network . Accordingly, we adjust the network architecture by bifurcating the decoder output. At each iteration, is used to update the estimate , whilst, the intermediate , which embodies a form of “information transmission”, is given by , and at the final iteration, provides an estimate of the variance of the likelihood.

Following [depeweg2018decomposition], we can decompose the (entry-wise) predictive variance into aleatoric () and epistemic () uncertainties using the law of total variance

Denoting initial guesses for the mean and the variance for a query data by and , and abbreviating as , and as , we estimate and by Monte Carlo samples as:

where all the operations are understood entry-wise.

Iv-B Unsupervised Knowledge-Transfer

In this phase we integrate the knowledge learned in the first stage for new imaging data for which we don’t have access to the ground truth image. Note that the knowledge of the trained network (on the supervised data ) is encoded in the distribution and in the optimal deterministic weights . The goal of the second phase is to approximate the true posterior and to find the updated optimal decoder weights given the measurement data (for which we do not have the ground truth image) and the supervised data from the first phase. This can be achieved as follows. By Bayes’ formula, the posterior distribution is given by

Here is the likelihood at test-time, and the normalising constant is the marginal likelihood of the observed data . We approximate by the estimated optimal posterior , which is learned in the first phase, thus encapsulating the “proxy” knowledge we have acquired from the simulated dataset . An approximation to the true posterior (for the new data ) can be obtained using VI:

where the objective function is given by

(7)

By construction, the approximate posterior over the supervised data is used as the prior in the second phase. The likelihood for a measurement is set to

Let and . The standard bias-variance decomposition then implies that the quadratic data-fidelity term can be decomposed as:

In practice, the term can be estimated efficiently using randomised trace estimators [bujanovic2021norm]. Computing the optimal variational parameters and the optimal decoder parameter by minimising the negative value of the ELBO proceeds exactly as in the supervised phase, but with the key changes outlined above.

In addition to enforcing data-fidelity, we also include a regularisation term to the loss in (7). This incorporates prior knowledge over expected reconstructed images by penalising unlikely or undesirable solutions

where is the total variation seminorm , and is the regularisation parameter. TV is widely used in image reconstruction, due to its edge-preserving property [RudinOsherFatemi:1992], and has also been applied to learned reconstructions [baguer2020computed, Cascarano:2020]. In summary, the loss at the second phase reads:

(8)

The first term in (IV-B) enforces data-fidelity, which encourages the learned network to be close to the right-inverse of , i.e. the action of the forward map on the output of is close to the measurement data . The second term controls the growth of the variance, and along with the first term arises naturally when performing VI (with a Gaussian likelihood). The third term, the TV regulariser, plays a crucial role in stabilising the learning process. The fourth term keeps the posterior to be close to the posterior obtained during the supervised phase. These properties together give rise to a highly flexible unsupervised knowledge-transfer paradigm, which can be directly extended to streaming data.

V Experiments and Results

V-a Datasets and (Noisy) Data Generation

We use the following two datasets in the experiments.

Fig. 2: Representative ground truth images from Ellipses (left) and LoDoFanB (right) datasets. The window is set to a Hounsfield unit (HU) range [-1000, 400].

V-A1 Ellipses Dataset

The Ellipses dataset consists of random phantoms of overlapping ellipses, and is commonly used for inverse problems in imaging [adler2017solving]. The intensity of the background is taken as , the intensity of each ellipse is taken randomly between and , and the intensities are added up in regions where multiple ellipses overlap. The phantoms are all of size ; see Fig. 2 for a representative phantom. The training set contains pairs of phantoms and sinograms, while the test set consists of 128 pairs. This dataset is used for the training of all the methods that involve supervised training.

V-A2 LoDoFanB Dataset

This (clinically realistic) dataset consists of human chest CT, taken from [leuschner2019lodopab], in which the (original) slices from the LIDC/IDRI Database [armato2004lung] have been pre-processed, and the resulting images are of size ; see Fig. 2 for a representative slice. This dataset is used in the unsupervised phase, where we assume to know only the sinograms. The ground truth images are only used to evaluate the performances of all the studied methods.

For the forward map , taken to be the Radon transform, we employ a two-dimensional fan-beam geometry with angles for the low-dose CT case, and angles for the sparse-view CT case. The source to axis, and axis to detector distances are set to mm. For both datasets, we apply a corruption process given by , where is the mean number of photons per pixel and is fixed at (corresponding to low-dose CT), and is the attenuation coefficient. Following [adler2018learned], we linearise the forward model by applying . We can use as the data-fidelity term since post-log measurements of low-dose CT approximately follow a Gaussian distribution [wang2006noise, li2004nonlinear].

Low-Dose CT Sparse-View CT
Methods Ellipses LoDoFanB Ellipses LoDoFanB Parameters Runtime
Unsupervised FBP 28.50/0.844 33.01/0.842 26.74/0.718 29.10/0.594 1 38ms/7ms
TV 33.41/0.878 36.55/0.869 30.98/0.869 34.74/0.834 1 20s/10s
DIP+TV 34.53/0.957 39.32/0.896 32.02/0.931 36.80/0.866 20min/18min
Supervised FBP+U-Net 36.63/0.946 33.14/0.852 31.75/0.900 24.99/0.673 ms
LGD 40.09/0.985 32.67/0.849 33.52/0.951 33.64/0.812 ms/ms
LPD 43.25/0.993 33.49/0.868 34.35/0.960 31.79/0.807 180ms/55ms
BDGD 42.70/0.993 35.26/0.865 34.97/0.969 31.11/0.704 7s/6s
BDGD+UKT 38.29/0.898 35.71/0.854 7s/6s
TABLE I: Comparison of reconstruction methods for the Ellipses and LoDoFanB datasets by average PSNR and SSIM. All supervised methods are trained on Ellipses dataset. Learned models are then tested on LoDoFanB dataset. For each method, approximate runtime for both low-dose CT and sparse-view CT, and the number of learnable parameters are also indicated.

V-B Benchmarks

We compare the proposed BDGD+UKT with the following supervised and unsupervised methods.

V-B1 Unsupervised Methods

Include FBP (using a Hann filter with a low-pass cut-off 0.6), (isotropic) TV regularisation [RudinOsherFatemi:1992], and DIP+TV [baguer2020computed].

V-B2 Supervised Methods

Include U-Net based post-processing (FBP+U-Net) [chen2017low], two learned iterative schemes: LGD [adler2017odl] and LPD [adler2018learned], and BDGD (i.e. without UKT). U-Net is widely used for post-processing (e.g. denoising and artefact removal), including FBP estimates [jin2017deep], and our implementation follows [baguer2020computed] using a slightly down-scaled version of the standard U-Net. LGD and LPD are widely used, with the latter often seen as the gold standard for supervised deep tomographic reconstruction. BDGD exhibits competitive performance while being a Bayesian method[Barbano:2020, barbano2020quantifying].

All supervised methods are first trained on the Ellipses dataset, and then tested on both Ellipses and LoDoFanB datasets. The learned models are not adapted to LoDoFanB dataset, but perform reconstruction directly on a given LoDoFanB sinogram.

V-C Implementation

The methods were implemented in PyTorch, and trained on a GeForce GTX 1080 Titan GPU. All operator-related components (e.g. forward operator, adjoint, and FBP) are implemented using the Operator Discretisation Library

[adler2017odl] with the astra_gpu backend [van2015astra].

For the unsupervised methods (FBP, TV, DIP+TV), the hyperparameters (frequency scaling in FBP and regularisation parameter in TV and DIP+TV) are selected to maximise the PSNR on a subset with 5 images. DIP+TV adopts a U-Net architecture proposed in

[baguer2020computed] (accessible in the DIVal library [leuschner2019]), i.e. a 5-scale U-Net without skip connections for the Ellipses dataset, and 6-scale U-Net with skip connections only at the last two scales for the LodoFanB dataset. For both architectures, the number of channels is set to 128 at every scale. In Table I, we report the number of parameters used for the LodoFanB dataset.

All learned reconstruction methods were trained until convergence on the Ellipses dataset. FBP+U-Net implements a down-sized U-Net architecture with 4 scales and skip connections at each scale. LGD is implemented as in [adler2018learned], where the weights of the reconstructor are not shared across the iterates, and the number of unrolled iterations is set to . LPD follows the implementation in [adler2018learned]. We train FBP+U-Net, LGD and LPD by minimising the loss in (1) using the Adam optimiser and a learning rate schedule according to cosine annealing [loshchilov2016sgdr]. BDGD uses a multi-scale convolutional architecture (cf. Fig. 1), with unrolled iterations. Furthermore, the UKT phase is initialised with parameters , which are obtained at the end of the supervised training on the Ellipses dataset. Monte Carlo samples are used to reconstruct each image, and to compute uncertainty estimates. The implementation will be made public on GitHub.

V-D Runtime

Table I reports the approximate runtime for all the methods under consideration. All learned methods (i.e. LGD, LPD, BDGD) require multiple calls of the forward operator , and thus they are slower at test time than the methods that do not (e.g. FBP+U-Net). In addition, BDGD and BDGD+UKT use 10 Monte Carlo samples to obtain a single reconstruction, leading to a slightly longer reconstruction time of approximately 7s per image. However, all learned methods are found to be faster than TV reconstruction. Meanwhile, DIP+TV is much slower than TV: it takes approximately 20 minutes to reconstruct a single instance of the LodoFanB dataset.

V-E Results

In Table I we report PSNR and SSIM values for Ellipses and LoDoFanB datasets. We observe that unsupervised methods give higher PSNR/SSIM values on the LoDoFanB dataset than on the Ellipses dataset, and that the converse is true for supervised methods. Moreover, TV and DIP+TV outperform supervised reconstruction methods in both the low-dose and the sparse-view settings for the LoDoFanB dataset.

The results for BDGD+UKT and BDGD indicate that adapting the weights on the LoDoFanB dataset allows us to achieve a noticeable improvement in reconstruction quality in both low-dose and sparse-view settings. Note also that BDGD+UKT outperforms all supervised reconstruction methods, while performing on par with DIP+TV.

Example reconstructed images are shown in Figs. 3 and 4, for the low-dose and the sparse-view case, respectively. We observe that BDGD+UKT significantly reduces background noise in the reconstructions, while faithfully capturing finer details, particularly in the low-dose case. Overall, DIP+TV and BDGD+UKT produce reconstructions with similar properties. However, DIP+TV, LPD and BDGD+UKT tend to suffer from slight over-smoothing. Meanwhile, TV reconstruction suffers from patchy artefacts, its well-known drawback [BrediesHoller:2020], and retains background noise.

The sparse-view setting in Fig. 4

is more challenging and the reconstructions are susceptible to streak artefacts, which are especially pronounced in the FBP reconstruction but are still discernible in reconstructions with other methods. Nonetheless, best performing methods (DIP+TV and BDGD+UKT) can achieve an excellent compromise between smoothing and the removal of streak artefacts. Surprisingly, FBP+U-Net “hallucinates” a bone-like structure in the reconstruction, probably induced by the pretraining on the Ellipses dataset, clearly indicating the risk of performing learned reconstructions on data that have undergone a distributional shift.

BDGD+UKT also provides useful uncertainty information on the reconstructions. In Fig. 5, we present the uncertainties along with pixel-wise errors for both low-dose and sparse-view cases. In either case, epistemic uncertainty dominates within the (overall) predictive uncertainty, which largely concentrates around the edges (i.e. reconstruction of sharp edges exhibits a higher degree of uncertainty). Further, the overall shape closely resembles the pixel-wise error, indicating that the uncertainty estimate can potentially be used as an error indicator, concurring with existing empirical measurement data [tanno2017bayesian].

f

Human Chest CT

g

f

Human Chest CT

g

f

FBP

PSNR: 33.97 dB, SSIM: 0.892

g

f

FBP

PSNR: 33.97 dB, SSIM: 0.892

g

f

TV

PSNR: 37.71 dB, SSIM: 0.912

g

f

TV

PSNR: 37.71 dB, SSIM: 0.912

g

f

DIP+TV

PSNR: 43.16 dB, SSIM: 0.971

g

f

DIP+TV

PSNR: 43.16 dB, SSIM: 0.971

g

f

FBP+U-Net

PSNR: 34.21 dB, SSIM: 0.911

g

f

FBP+U-Net

PSNR: 34.21 dB, SSIM: 0.911

g


f

LGD

PSNR: 34.68 dB, SSIM: 0.927

g

f

LGD

PSNR: 34.68 dB, SSIM: 0.927

g

f

LPD

PSNR: 34.68 dB, SSIM: 0.942

g

f

LPD

PSNR: 34.68 dB, SSIM: 0.942

g

f

BDGD

PSNR: 37.25 dB, SSIM: 0.933

g

f

BDGD

PSNR: 37.25 dB, SSIM: 0.933

g

f

BDGD+UKT

PSNR: 41.31 dB, SSIM: 0.965

g

f

BDGD+UKT

PSNR: 41.31 dB, SSIM: 0.965

g


Fig. 3: Low-dose human chest CT reconstruction within the LoDoFanB dataset along with a zoomed region indicated by a small square. The window is set to a HU range of [-1000, 400].

f

Human Chest CT

g

f

Human Chest CT

g

f

FBP

PSNR: 30.34 dB, SSIM: 0.677

g

f

FBP

PSNR: 30.34 dB, SSIM: 0.677

g

f

TV

PSNR: 35.81 dB, SSIM: 0.895

g

f

TV

PSNR: 35.81 dB, SSIM: 0.895

g

f

DIP+TV

PSNR: 39.48 dB, SSIM: 0.947

g

f

DIP+TV

PSNR: 39.48 dB, SSIM: 0.947

g

f

FBP+U-Net

PSNR: 26.61 dB, SSIM: 0.728

g

f

FBP+U-Net

PSNR: 26.61 dB, SSIM: 0.728

g


f

LGD

PSNR: 35.13 dB, SSIM: 0.888

g

f

LGD

PSNR: 35.13 dB, SSIM: 0.888

g

f

LPD

PSNR: 32.95 dB, SSIM: 0.883

g

f

LPD

PSNR: 32.95 dB, SSIM: 0.883

g

f

BDGD

PSNR: 32.03 dB, SSIM: 0.762

g

f

BDGD

PSNR: 32.03 dB, SSIM: 0.762

g

f

BDGD+UKT

PSNR: 37.49 dB, SSIM: 0.929

g

f

BDGD+UKT

PSNR: 37.49 dB, SSIM: 0.929

g


Fig. 4: Sparse-view human chest CT reconstruction within the LoDoFanB dataset along with a zoomed region indicated by a small square. The window is set to a HU range of [-1000, 400].

fg

fLow-Dose CTg

fg

fLow-Dose CTg

fg

fg

fg

fg

fg

fg


f

Error

g

fSparse-View CTg

f

Error

g

fSparse-View CTg

f

Uncertainty

g

f

Uncertainty

g

f

Aleatoric

g

f

Aleatoric

g

f

Epistemic

g

f

Epistemic

g


fErrorg

fUncertaintiesg

Fig. 5: The pixel-wise reconstruction error, (max-min normalised ) predictive uncertainty and its decomposition into the aleatoric and epistemic constituent components for low-dose and sparse-view CT, obtained by BDGD+UKT.

Vi Discussion

The experimental results in Table I have several implications in image reconstructions. First, they show that while supervised iterative methods (FBP+U-Net, LGD, and LPD) can deliver impressive results when trained and tested on imaging datasets of identical distributions, they fail to carry this performance over when applied to data from a different distribution. Specifically, on the Ellipses dataset they vastly outperform the traditional FBP and TV, but on the LoDoFanB dataset the difference between learned methods and FBP nearly vanishes (particularly in the low-dose case), and the standard TV actually performs better. This behaviour might be due to a form of bias-variance trade-off, where training with a large training set allows to improve the performance in the supervised case, but which has a negative effect on the generalisation property; it results in a loss of flexibility, and underwhelming performance, for images of a different type. Thus, adjusting the training regiment, or further adapting the network weights to data from a different distribution is beneficial for improving the reconstruction quality.

Overall, the results show that Bayesian neural networks with VI can deliver strong performance that is competitive with deterministic reconstruction networks. This can be first observed on the Ellipses dataset, which shows that BDGD performs on par or slightly better than all the unsupervised and the supervised methods under consideration, which is in agreement with previous experimental findings

[Barbano:2020, barbano2020quantifying]. The results also show the potential of the Bayesian UKT framework for medical image reconstruction in the more challenging setting where ground truth images are unavailable. Namely, adapting the model through the described framework allows us to achieve a significant performance boost on the LoDoFanB dataset. Moreover, BDGD+UKT shows roughly the same performance as DIP+TV, while being significantly faster in terms of run-time, cf. Table I. Indeed, all the learned methods are faster than TV and DIP+TV reconstructions.

The experimental results indicate that UKT shows great promise in the unsupervised setting. The results clearly show the need for adapting data-driven approaches to changes in the data, its distribution and size, and to incorporate the insights that have been observed in the supervised data to regularly update the reconstruction model [parisi2019continual, mundt2020wholistic]

. Though only conducted on labelling tasks, recent studies show that transfer learning through pretraining exhibits good results when the difference between data distributions is small

[yang2020transfer]. Moreover, one needs to ensure that pretraining does not result in overfitting the data from the first task. Both requirements seem to be satisfied in the studied setting, since on one hand the two tasks share the forward model, and differ only in the distribution of the ground truth images, and on the other hand since the network still manages to improve the performance using the adaptation stage. Further investigation is needed to examine how does the performance of a reconstruction network change with respect to the size and type of data the pretraining dataset consists of.

The use of a full Bayesian treatment for learned medical image reconstruction methods is still largely under development, due to training challenges [BarbanoArridgeJinTanno:2021]. The proposed BDGD+UKT is very promising in that: (i) it is easy to train due to the adoption of the strategy ”being Bayesian only a little bit”; (ii) the performance of the obtained point estimates is competitive with benchmark methods; (iii) it also delivers predictive uncertainty. In particular, like the prior study (with a different method) [tanno2017bayesian], the numerical results indicate that the predictive uncertainty can be used as an error indicator.

Vii Conclusion

In this work we have presented a novel two-phase learning framework, termed UKT, for addressing the lack of a sufficiently large amount of paired training data in learned image reconstruction techniques. The framework consists of two learning phases, both within a Bayesian framework: it first pretrains a learned iterative reconstructor on (simulated) ordered pairs and then at test-time, it fine-tunes the network weights to realise sample-wise adaptation using only noisy measurements. Extensive experiments on low-dose and sparse-view CT constructions show that the approach is very promising: it can achieve competitive performance with several state-of-the-art supervised and unsupervised approaches both qualitatively and quantitatively.

References