On instabilities of deep learning in image reconstruction - Does AI come at a cost?

02/14/2019 ∙ by Vegard Antun, et al. ∙ University of Cambridge 14

Deep learning, due to its unprecedented success in tasks such as image classification, has emerged as a new tool in image reconstruction with potential to change the field. In this paper we demonstrate a crucial phenomenon: deep learning typically yields unstablemethods for image reconstruction. The instabilities usually occur in several forms: (1) tiny, almost undetectable perturbations, both in the image and sampling domain, may result in severe artefacts in the reconstruction, (2) a small structural change, for example a tumour, may not be captured in the reconstructed image and (3) (a counterintuitive type of instability) more samples may yield poorer performance. Our new stability test with algorithms and easy to use software detects the instability phenomena. The test is aimed at researchers to test their networks for instabilities and for government agencies, such as the Food and Drug Administration (FDA), to secure safe use of deep learning methods.



There are no comments yet.


page 2

page 3

page 4

page 5

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Overview

The Supplementary Information (SI) contains all the extra material on neural networks that is useful in order to understand and reproduce all the experiments done in the paper. In particular, the SI displays the variety of different architectures and training sets used in the various experiments. The neural networks considered are:

  • AUTOMAP [Rosen_Nature]: The AUTOMAP neural network we test is for low resolution single coil MRI with 60% subsampling. In the paper [Rosen_Nature] one mentions 40% subsampling, but this apparent discrepancy is simply due to different interpretation of the word subsampling. We use the traditional meaning in sampling theory referring to x% subsampling as describing that the total amount of samples used are x% of full sampling. The actual network used in our experiment is trained by the authors of [Rosen_Nature] and provided through private communication. The details of the architecture and training data are summarised in §2.1.

  • DAGAN [DAGAN]: This network is for medium resolution single coil MRI with 20% subsampling. The network weights are not available online, however, complete instructions on how to reproduce the network used in [DAGAN] are accessible. Moreover, the advice from the authors (through private communication) of [DAGAN] was to follow these guidelines to obtain the network. Thus, based on these instruction, we have retrained a network that reproduces the results in [DAGAN]. The details of the training data and architecture are summarised in §2.2.

  • Deep MRI [DeepMRI2d]: This neural network is for medium resolution single coil MRI with 33% subsampling. The network used in our experiments is trained by the authors of [DeepMRI2d], can be found online and we summarise the details on training data and architecture in §2.3.

  • Ell 50 [jin17]: Ell 50 is a network for CT or any Radon transform based inverse problem. The number 50 refers to the number of lines used in the sampling in the sinogram. The training of the network is done by the authors of [jin17]. The network can be obtained online, and all the details can be found in §2.4.

  • Med 50 [jin17]: Med 50 has exactly the same architecture as Ell 50 and is used for CT, however the training is done on a different dataset. The network is trained by the authors of [jin17] and network weights have been obtained through private communication. The details are summarised in §2.4.

  • MRI-VN [Hammernik18]: This network is for medium to high resolution parallel MRI with 15 coil elements and 15% subsampling. In order to show a variety of subsampling ratios we have trained this network on a smaller subsampling percentage than what the authors of [Hammernik18] originally (25% and 33%) did in their paper. As we already have 33%, and 20%, we want a test on even lower subsampling rates. All the remaining parameters are kept as suggested in the code provided by the authors of [Hammernik18], except for the subsampling ratios and batch size (due to memory limitations). All the details are documented in §2.5.

2 Deep learning and neural networks for inverse problems

The goal of deep learning in inverse problems is to learn a neural network taking the measurements (where is the matrix representing the sampling modality) as input and producing an approximation to as its output. Many of the networks considered in this work are not directly applied to the measurements , but attempt to take advantage of the knowledge of the structure of the forward operator by first applying a transformation to the measurements , in particular we obtain . The transformation represents the adjoint operator when the forward operator is defined in the Fourier domain and a discretised filtered back projection (FBP) operator for the case of Radon measurements.

2.1 Automap

2.1.1 Network architecture

The AUTOMAP network [Rosen_Nature] is proposed for image reconstruction from Radon measurements, spatial non-Cartesian Fourier sampling, misaligned Fourier sampling and undersampled Cartesian Fourier samples. In this work we have tested the network trained for image reconstruction from undersampled Cartesian Fourier samples. In contrast with the other networks considered in this work, the AUTOMAP network provides a direct mapping of the Fourier measurements to the image domain without applying the adjoint operator as a first step.

The authors of [Rosen_Nature] have not made their code publicly available, and the weights from their paper [Rosen_Nature]

had not been stored. However, they kindly agreed to retrain their network for us and save the weights. The network architecture they trained deviates slightly in some of activation functions reported in their paper

[Rosen_Nature], however, the network was trained on the same data and sampling pattern. Below we describe the network architecture we received. The training parameters and data, are reported as in the paper [Rosen_Nature].

The input of the AUTOMAP network, as described in [Rosen_Nature] and in Figure 10 takes a complex image of measurements as input. In the case of subsampling, one may interpret the

image as being zero padded in the coordinates that are not sampled. In the tests,

, and in the actual implementation the input is represented by the complex measurement data with ( of ) in this experiment. Such data is reshaped into a vector of length with real entries before being fed into the network. The first two layers of the network a fully connected matrices of size and . The first fully connected layer is followed by a hyperbolic tangent activation function. The second fully connected layer is followed by a layer which subtracts the mean from the output of the second layer. The output is then reshaped into an image.

Next follows two convolutional layers with filter size

, 64 feature maps and stride

. The first convolutional layer is followed by a hyperbolic tangent function, while the other is followed by a rectified linear unit (ReLU). Finally, the output layer deconvolves the 64 feature maps provided by the second convolutional layer with

filters with stride . The output of the network is an matrix representing the image magnitudes.

Figure 10: The AUTOMAP architecture (figure from [Rosen_Nature]).

2.1.2 Training parameters

The loss function used for training consisted of two terms,

and . Here is the -norm of the difference between the ground truth magnitude image and the magnitude image predicted by the network. Similarly the is an -penalty on the outputs of the activations following the second convolutional layer (). The total loss was then computed as


. The network is trained using the RMSProp algorithm (see for example

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf as referred to in [Rosen_Nature]) with minibatch size 100, learning rate , momentum , and decay

. The number of training epochs is 100.

2.1.3 Training data

The training dataset consists of images taken from 131 different subjects from the MGH-USC HCP public dataset [Fan16]111https://db.humanconnectome.org/. For each image, the central pixels were cropped and subsampled to a resolution of pixels. Before training, the images were preprocessed by normalizing the entire dataset to a constant value defined by the maximum intensity of the dataset. Fourier data were obtained by subsampling the Cartesian -space using a Poisson-disk sampling pattern with 60% undersampling [Uecker15].

In order to increase the network robustness against translation, the following data augmentation scheme was applied. New images were created from each image in the training dataset by tiling together four reflections of the original image. Then, the so obtained image was cropped to a random

selection. The routines used to implement the AUTOMAP network were written in TensorFlow


2.2 Dagan

2.2.1 Network architecture

The DAGAN network was introduced in [DAGAN] to recover images from Fourier samples, with particular emphasis on MRI reconstruction applications. The DAGAN network assumes measurements , where is a subsampled discrete Fourier transform. The input of the network is represented by the noisy magnitude image , which is obtained by direct inversion of the zero-filled Fourier data, in particular, .

The recovery algorithm presented in [DAGAN] is based on a conditional generative adversarial network (GAN) model, which consists of a generator network, used for the image reconstruction, and a discriminator network, measuring the quality of the reconstructed image. The generator network adopted in [DAGAN] has a U-net structure, similar to that used in [jin17], and its objective is to produce the recovered image. In [DAGAN] the authors propose two almost identical architectures, and train them with different loss functions. Below we will describe their “refinement“ architecture trained with what is referred to as Pixel-Frequency-Perceptual-GAN-Refinement loss in the paper. The refined version is also our choice, as this architecture and training performed the best in the paper and in our tests as well. The network was not made publicly available, and based on advice from the authors of [DAGAN] we trained the network ourselves.


number of filters




Conv2D stride (2,2)

Conv2D stride (2,2) +BN+ lReLU

DeConv2D stride (2,2) +BN+ ReLU

Cond2D stride (1,1) +BN+

ramp ()
Figure 11: DAGAN architecture. Here lReLU is the leaky ReLU function with slope parameter equal to .

The architecture of the generator network, which is reported in Figure 11

, contains 8 convolutional layers and 8 deconvolutional layers each followed by batch normalization (BN)

[Ioffe15]. The batch normalization layers after the convolutional layers are followed by leaky ReLU (lReLU) activations with slope equal to for , while the batch normalization layers after the deconvolutions are followed by a ReLU activation. The generator network also contains skip connections, i.e., connections that copy the output of a layer directly to the input of a layer further down in the hierarchy. The skip connections are used to concatenate mirrored layers (see Figure 11). The filter kernels used for the convolutional and deconvolutional layers have size with stride . The number of filters in each convolutional/deconvolutional layer increases/decreases according to Figure 11.

The last deconvolutional layer is followed by a hyperbolic tangent activation function. A global skip connection, adding the input to the network and the output from the hyperbolic tangent function, is then followed by a ramp function clipping the output values of the image to the range . Adding this last skip connection means that the network is actually approximating the residual error between the network input and the image of interest.

The generator network is trained jointly with a discriminator network, which aims to distinguish between the output of the generator network and ground truth images. For further information on this network, we refer to [DAGAN].

2.2.2 Training parameters

The loss function used to train the DAGAN network consists of four different terms. First, an image domain mean square error (MSE) loss, , which accounts for the distance between the output of the generator network and the ground truth image. Second, a frequency domain MSE loss, , which enforces consistency between the output of the generative network in the frequency domain and the acquired Fourier measurements. Third, a perceptual loss term, , which is computed by using a pretrained VGG-16 described in [VGG]

. In particular, the VGG-16 network was trained over the ImageNet dataset

333http://www.image-net.org/ and the output of its conv4 layer was used to compute the loss term by considering the -norm of the difference between the VGG-16 output corresponding to the ground truth image and the generator network output. Finally, the fourth term, is computed using a cross entropy loss on the output of the discriminator network. Adding these four terms together gives us the loss

We used the same values for and as in [DAGAN], in particular, , , and . The generator and the discriminator network were jointly trained by alternating gradient optimization. In particular, the Adam [Kingma14] optimizer was adopted, with initial learning rate , momentum , and minibatch size . The learning rate was halved every 5 epochs. We applied the same early stopping rule as given in their implementation444https://github.com/nebulaV/DAGAN. This is based on measuring the loss between the training set and validation set. We used the early stopping number . In total this resulted in epochs of training.

2.2.3 Training data

The DAGAN network was trained using data from a MICCAI 2013 grand challenge dataset555http://masiweb.vuse.vanderbilt.edu/workshop2013/index.php/. We removed all images from the dataset where less than 10% of the pixel values were non-black. In total we therefore used 15912 images for training and 4977 images for validation. The dataset consisted of -weighted MR images of different brain tissues.

The following data augmentation techniques were used to increase the amount of training data: image flipping, rotation, shifting, brightness adjustment, zooming, and elastic distortion [Simard03].

The discrete Fourier transform of the training images were subsampled using 1D Gaussian masks, i.e., masks containing vertical lines of data in the -space randomly located over the image according to a Gaussian distribution. In our tests we trained a network to do recovery from 20% subsampling. The code used to implement DAGAN was written using the TensorLayer666http://tensorlayer.readthedocs.io wrapper and TensorFlow, and was made publicly available by the authors of [DAGAN].

2.3 DeepMRINet

2.3.1 Network architecture

The Deep MRI net is used to recover images from their subsampled Fourier measurements. Its architecture is built up as a cascade of neural networks, whose input is represented by the blurry image obtained by direct inversion of the measurements, i.e., . The networks in the cascade are convolutional neural networks (CNN) designed as follows

where is the ReLU activation function, whereas and represent trainable convolutional operators and biases, respectively, for the th network. These networks are then tied together and interleaved with data consistency layers (DC), which have the objective to promote consistency between the reconstructed images and the Fourier measurements. The DC layers are defined as

Here represents the Fourier operator, and is the set of indices corresponding to the measurements acquired in the -space. We point out that in the limit , the function simplifies to if and otherwise.

In practice, a DC layer performs a weighted average of the Fourier coefficients of the image obtained as the output of a CNN in the cascade and the true samples . The parameter can either be trained or kept fixed. In [DeepMRI2d], it is not specified whether is learned or not, however, from the code777https://github.com/js3611/Deep-MRI-Reconstruction it is clear that is chosen to be .

Figure 12: The DeepMRINet architecture (figure from [DeepMRI2d]).

The complete network can now be written as

and its architecture is reported in Figure 12. In particular, the architecture used to produce the results in [DeepMRI2d] and those reported in this paper contains 5 CNNs interleaved with 5 DC layers. Each CNN contains 5 convolutional layers, all with kernel size and stride . The first 4 layers are using 64 filters and are followed by a ReLU activation function. The fifth convolutional layer in each CNN contains 2 filters, representing the real and imaginary part of the image. This fifth layer is not followed by any activation function, however its output is added to the input to the CNN using a skip connection.

2.3.2 Training parameters

In our experiments we used a pre-trained network that was trained (and published online) by the authors of [DeepMRI2d] with training parameters documented in the paper [DeepMRI2d]. The DeepMRINet was trained using a loss function with two terms, and . The term computed the mean squared error (MSE) between the true (complex valued) image and the predicted (complex valued) image, while the computed the -norm of the weights. The loss function was then computed as

The network weights were initialized using He initialization [he2015delving] and the Adam [Kingma14] optimizer was used for training. This optimizer takes as input a learning rate (step size) , and two exponential decay parameters and related to a momentum term. We refer to [Kingma14] for further explanations of these parameters. The network was trained with , , and batch size equal 10.

2.3.3 Training data

The DeepMRINet was trained using data from five subjects from the MR dataset used in [Caballero14], which consists of 10 fully sampled short-axis cardiac cine scans. Each of these scans was then preprocessed, using the SENSE [SENSE] software, into temporal (complex-valued) frames of size . Synthetic MRI measurements were then obtained by sampling retrospectively the reconstructed images in -space according to a Cartesian undersampling masks. During training, whereas a fixed undersampling rate of 33% was used, different undersampling masks were randomly generated in order to allow the network to recover images from measurements obtained with different undersampling masks. In particular, training images were fully sampled along the frequency-encoding direction but undersampled in the phase-encoding direction, according to the scheme described in [Jung07] (center frequencies were always included in the subsampling patterns).

To prevent overfitting, data augmentation was performed by including rigid transformations of the considered images in the training datasets.

The code used to implement the DeepMRINet was written in Python using the Theano

888http://deeplearning.net/software/theano and Lasagne999https://lasagne.readthedocs.io/en/latest libaries.

2.4 FBPConvNet – The Ell 50 and Med 50 networks

The Ell 50 and Med 50 networks were proposed in [jin17] under the name FBPConvNet. The networks are trained to reconstruct images from Radon measurements. The networks have identical architecture and are trained using the same algorithm, with the same set of hyper parameters. The only difference between the training of the two networks, is the dataset they have been trained on. Below, we will describe the architecture and the training procedure of both the networks. We will then describe the datasets for the two networks in separate sections.

2.4.1 Network architecture

Figure 13: The Ell 50 and Med 50 architecture (figure from [jin17]).

The Ell 50 and Med 50 networks are trained for reconstructing from measurements where is a Radon101010We used MatLabs randon command to represent this operator sampling operator, sampling 50 uniformly spaced radial lines. Rather than learning a mapping from to directly, the networks takes advantage of a discrete filtered back projection111111We used MatLabs iradon

with linear interpolation and a ‘Ram-Lak‘ filter to represent this operator

, as described in the methods section, to obtain a noisy approximation to . The operator can be seen as a non-learnable first layer in the network.

The network contain several convolutional and deconvolutional layers, all of which (except the last) are followed by a batch normalization (BN) layer and a ReLU activation function. The (de)convolutional layers use filter size , stride and a varying number of filters. We will not describe the full architecture in detail, as it can be seen in Figure 13

, with the relevant number of filters, skip connections, max-poolings (

) and concatenations. We do, however, point out that the network applies a final global skip connection, so that rather than learning a mapping from to the network is trying to learn the “noise" .

2.4.2 Training parameters

The network weights were provided by the authors of [jin17] and obtained based on the training procedure as described in their paper [jin17]. The loss function used to train the networks is the difference between the network output and the ground truth, and the networks are trained using the stochastic gradient descent algorithm with momentum. The learning rate varies from to , whereas the momentum is set to , and the minibatch size is equal to 1. During training, gradients are clipped to the interval with , to prevent the divergence of the cost function. The networks are trained for 101 epochs, and the code used to implement the networks is written in MatLab using the library MatConvNet121212http://www.vlfeat.org/matconvnet.

2.4.3 Ell 50 – Training data

The Ell 50 network is trained from the filtered back projection of 475 synthetic sinograms containing the Radon transform of ellipses of random intensity, size, and location. The dynamic range of the back projected images is adjusted so that image values are contained in the interval . The Radon transform of an ellipse has an analytic formula, and hence this formula was used to create sinograms of such images using uniformly spaced lines (views). Measurement data are obtained by retaining radial lines out of the views. The ground truth images were obtained by applying filtered back projection to fully sampled sinograms (i.e., radial lines). This approach is motivated by the fact that in applications, one will never have access to the underlying actual ground truth image. Data augmentation is also applied to the training data, by considering horizontal and vertical mirroring of the original images.

2.4.4 Med 50 – Training data

Med 50 is trained on synthetic images obtained from 475 real in-vivo CT images from the Low-dose Grand challenge competition database provided by the Mayo Clinic. The sinograms used for this training were synthetically generated from high quality CT-images using MatLab radon command. The same approach as for the Ell 50 network was used, where one sampled 1000 view and used this as ground truth. The network was trained from 50 of these views.

2.5 MRI Variational Network (MRI-VN)

2.5.1 Network architecture

The MRI Variational Network (MRI-VN) presented in [Hammernik18] is designed to reconstruct images from undersampled MRI data, sampled using 15 coil elements. Thus, we use the sampling operator as described in the methods section, with .

The network structure is inspired by the unfolding of a variational minimization problem including a fidelity term and a regularization term defined according the Fields of Experts model [Roth09]. In particular, each iteration of the corresponding Landweber method [Landweber1951] corresponds to a layer of the resulting neural network. More specifically, the implementation considered in this work consists of layers/iterations that can be expressed as follows:


where is the complex image obtained by applying . We will describe each of the remaining components of this network separately.

We start by noticing that the images (stacked as a vector in this simplified description) are complex valued, and can therefore described by its real and imaginary components and , respectively. We will alternate between the representations.

The operator acts as follows on ,

where , are learnable convolutional operators, with filters (channels), filter size and stride . We will comment on the value of later.

The is a non-linear activation function in the network. For each filter/channel, it applies the non-linear function

pointwise to each component in that channel. Here , with , are weights which are learnt during the training phase. The nodes are non-learnable, and distributed in an equidistant manner on the interval , for a fixed value , commented on below. The is also non-learnable and equals .

The operator , maps , where is the imaginary unit, and are the transpose of , respectively. The matrices are the matrix and its adjoint, while is a learnable scalar. The remaining operations should be clear from Equation (13).

During training, each of the filters in and were restricted to have zero mean and have unit Euclidean norm. This was done to avoid a scaling problem with the weights .

To reproduce this network, we use the code published by the authors of [Hammernik18]131313 https://github.com/VLOGroup/mri-variationalnetwork. Parts of this code uses slightly different parameters, than what was used in the original paper. In particular, the value was chosen, rather than , as used in the paper. The value of , was also changed from in the paper, to in the code. The change of the value is motivated by another change, also made in the published implementation, namely the scaling of the -space values. In [Hammernik18] the -space volumes (with slices) was normalized by the factor , whereas in the code this have been changed to scaling each -space slice with . This change has been made to make the their implementation more streamlined. Whenever there has been a conflict between the two sources, we have chosen the version found in the code.

2.5.2 Training parameters

The MRI-VN network is trained using the -norm as loss function. In particular, since MRI reconstruction are typically assessed through magnitude images, the error is evaluated by comparing smoothed version of magnitude images

with . The network parameters that minimize the loss function are determined using the inertial incremental proximal gradient (IIPG) optimizer (see [Hammernik18, Kobler17] for details). Optimization is performed for epochs, with a step size of . Training data is arranged into minibatches of size 5. In the original paper, the batch size was set to 10, but due to memory limitations we had to adjust this.

2.5.3 Training data

The authors in [Hammernik18] considered 5 datasets for different types of parallel MR imaging protocols, and trained one VN for each dataset. In this work, we have trained a VN for one of these protocols, namely Coronal Spin Density weighted with Fat Suppression. The training data consisted of knee images from 10 patients. From each patient we used 20 slices of the knee images, making up a total of 200 training images. The raw -space data for each slice consisted of 15, -space images of size , each with a precomputed sensitivity map. The sensitivity map was computed by the authors of [Hammernik18], using ESPIRiT [espirit].

The raw data was obtained using a clinical 3T system (Siemens Magnetom Skyra) using an off-the-shelf 15-element knee coil. The raw data was subsampled retrospectively by zeroing out 85% of the -space data. In [Hammernik18] they test both a regular sampling scheme and a variable density pattern as proposed in [Lustig07]. In our work, we used a regular sampling scheme, where the 28 first central -space lines were sampled, and the remaining lines were placed equidistantly in -space. No data augmentation was used.

The code was implemented in Python with a custom made version of Tensorflow141414 https://github.com/VLOGroup/tensorflow-icg , which was partly implemented in C++/CUDA with cuDNN support. All the code and data have been made available online by the authors of [Hammernik18].