1 Introduction
G. Yang and J. Schlemper/D. Rueckert and A. Maier share second/last coauthorship.
Compressed sensingbased Magnetic Resonance Imaging (CSMRI) is a promising paradigm allowing to accelerate MRI acquisition by reconstructing images from only a fraction of the normally required space measurements. Traditionally, sparsitybased methods and their datadriven variants such as dictionary learning [10] have been popular due to their mathematically robust formulation for perfect reconstruction. However, these methods are limited in acceleration factor and also suffer from high computational complexity. More recently, several deep learningbased architectures have been proposed as an attractive alternative for CSMRI. The advantages of these techniques are their computational efficiency, which enables realtime application, and that they can learn powerful priors directly from the data, which allows higher acceleration rates. The most widely adopted deep learning approach is to perform an endtoend reconstruction using multiscale encodingdecoding architectures [8, 16]. Alternative approaches carry out the reconstruction in an iterative manner [17], conceptually extending traditional optimization algorithms. Most previous studies focus on exploring the network architecture; however, the optimal loss function to train the network remains an open question.
Recently, as an alternative to the commonly used MSE loss, adversarial [2] and perceptual losses [5] have been proposed for CSMRI [16]. As these loss functions are designed to improve the visual quality of the reconstructed images, we refer to them as visual loss functions in the following. So far, approaches using visual loss functions still rely on an additional MSE loss for successful training of the network. Directly combining all these losses in a joint optimization leads to a suboptimal training process resulting in reconstructions with lower peak signal to noise ratio (PSNR) values. In this work, we propose a twostage architecture that avoids this problem by separating the reconstruction task from the task of refining the visual quality. Our contributions are the following: (1) we show that the proposed refinement architecture improves visual quality of reconstructions without compromising PSNR much, and (2) we introduce the semantic interpretability score as a new metric to evaluate reconstruction performance, and show that our approach outperforms competing methods on it.
2 Background
Deep Learningbased CSMRI Reconstruction. Let denote a complexvalued MR image of size to be reconstructed, and let represent undersampled space measurements obtained by , where is the undersampling Fourier encoding operator and is complex Gaussian noise. The linear inversion , also called zerofilled reconstruction, is fundamentally illposed and generates an aliased image due to violation of the NyquistShannon sampling theorem. Therefore, it is necessary to add prior knowledge into the reconstruction to constrain the solution space, traditionally formulated as the following optimization problem:
(1) 
Here, expresses a regularization term on (e. g. norm for CSMRI), and is a hyperparameter reflecting the noise level. In deep learning approaches, one learns the inversion mapping directly from the data. However, rather than learning the mapping from Fourier directly to image domain, it is common to formulate this problem as dealiasing the zerofilled reconstructions in the image domain [14, 16]. Let be our training dataset of pairs and be the image generated by the reconstruction network . Given , the network is trained by minimizing the empirical risk , where is a distance function measuring the dissimilarities between the reference fullysampled image and the reconstruction.
For the choice of the reconstruction network , most previous approaches [8, 16] relied on an encoderdecoder structure (e. g. UNet [11]), but our preliminary experiments showed that these architectures performed subpar in terms of PSNR.
Instead, we use the architecture proposed in [14], as it performed well even for high undersampling rates.
This network consists of consecutive dealiasing blocks, each containing convolutional layers.
Each dealiasing block takes an aliased image as the input and outputs the dealiased image , with and being the zerofilled reconstruction.
Interleaved between the dealiasing blocks are data consistency (DC) layers, which enforce that the reconstruction is consistent with the acquired space measurements by replacing frequencies of the intermediate image with frequencies retained from the sampling process.
This process can be seen as an unrolled iterative reconstruction where dealiasing blocks and DC layers perform the role of the regularization step and data fidelity step, respectively [14].
Loss Functions for Reconstruction.
In deep learningbased approaches to inverse problems, such as MR reconstruction and single image superresolution, a frequently used loss function
[8, 17] is the MSE loss . Though networks trained with MSE criterion can achieve high PSNR, the results often lack high frequency image details [1]. Perceptual loss functions [5] are an alternative to the MSE loss. They minimize the distance to the target image in some feature space. A common perceptual loss is the VGG loss , where denotes VGG feature maps [15].Another choice is an adversarial loss based on Generative Adversarial Networks (GANs) [3, 2]. A discriminator and a generator network are setup to compete against each other such that the discriminator is trained to differentiate between real and generated samples, whereas the generator is encouraged to deceive the discriminator by producing more realistic samples. For us, the discriminator learns to differentiate between fullysampled and reconstructed images, and the reconstruction network, playing the role of the generator, reacts by changing the reconstructions to be more similar to the fullysampled images. The discriminator loss is then given by . During training, the reconstruction network minimizes , which has the effect of pulling the reconstructed images closer towards the distribution of the training data.
Perceptual losses are known to increase textural details [7], but also to introduce high frequency artifacts [2], whereas adversarial losses can produce realistic, high frequency details [7]. As perceptual and adversarial losses complement each other, it is sensible to combine them into a single visual loss . For MR reconstruction, previous attempts [16] further combined adversarial and/or perceptual loss with the MSE loss to stabilize the training. This simultaneous optimization yields acceptable solutions, typically however with low PSNR. We argue this is because the different training objectives compete with each other, leading to the network ultimately converging to a suboptimal local maximum.
3 Method
The observation above motivates our approach: instead of directly training a reconstruction network with all loss functions jointly, we use a twostage procedure, detailed in Figure 1. In the first stage, the reconstruction network is trained with . In the second stage, we fix the reconstruction network and train a visual refinement network on top of by optimizing . The final reconstruction is then given by , i. e. learns an additive mapping which refines the base reconstruction. In this setup, discriminator and VGG network still receive the full reconstruction as input.
The decoupling of the refinement step from the reconstruction task has several benefits. The discriminator begins training by seeing reasonably good reconstructions, which avoids overfitting it to suboptimal solutions during the training process. Furthermore, compared to training from scratch, the optimization is easier as it starts closer to the global optimum. Finally, the visual refinement step always starts out from the best possible MSE solution achievable with , whereas this guarantee is not given when jointly training with and .
The choice of the architecture for the visual refinement network is flexible, and in this work we use a UNet architecture. Within , we gate the output of the network by a trainable scalar , which improves the adversarial training dynamics during the early stages of training. If we initialize , the discriminator receives , and the gradient signal to is forced to zero. This allows the discriminator to initially only learn on clean reconstructions from , untainted by the randomly initialized output of . For the refinement network, the impact of less useful gradients is reduced while the discriminator has not yet learned to correctly differentiate between the ground truth (i. e. fullysampled data) and the reconstructions. We also scale to the range of before using it as ’s input and then scale back to the original range after adding the refinement. In accordance to our goal of reaching high PSNR values, we constrain the output of (before gating with ) with an penalty . This guides to learn the minimal sparse transformation needed to fulfill the visual loss, i. e. to change the MSEoptimal solution only in areas important for visual quality. In practice, this means that our approach yields higher PSNR values compared to joint training, as we show in section 4.
We also utilize a couple of techniques known to stabilize the adversarial learning process. For the discriminator network, we use onesided label smoothing [12] of , and an experience replay buffer [9] of size
with probability
to draw from it. For the refinement network, we add a feature matching loss [12] , where denotes the ’th of feature maps of the discriminator. The total loss for is given by(2) 
with being the penalty strength, and , , constants set such that in the first iteration of training, which amounts to assigning the two adversarial loss terms the same initial importance as .
The penalty strength is important for training speed and stability.
Choosing such that in the first training iteration gave us sufficiently good results.
Semantic interpretability score. The most commonly used metrics to evaluate reconstruction quality are PSNR and the structural similarity index (SSIM). It has been shown that those two metrics do not necessarily correspond to visual quality for human observers, as e. g. demonstrated by human observer studies in [7, 1]. Therefore, PSNR and SSIM alone are not sufficient in the evaluation of image reconstructions. This poses the question on how to evaluate reconstruction quality taking human perception into account. One possibility is to let domain experts (e. g. clinicians and MRI physicists) rate the reconstructions and average the results to form a mean opinion score (MOS). Obtaining opinion scores from expert observers is costly, hence cannot be used during the development of new models. However, if expertprovided segmentation labels are available, we can design a metric indicating how visible the segmented objects are in the reconstructed images, in the following referred to as semantic interpretability score (SIS). This metric is motivated by Inception scores [12] in GANs, which tells how well an Inception network can identify objects in generated images.
SIS is defined as the mean Dice overlap between the ground truth segmentation and the segmentation predicted by a pretrained segmentation network from the reconstructed images. The scores are normalized by the average Dice score on the groundtruth images to obtain a measure of segmentation performance relative to the lower errorbound. We only consider images in which at least one instance of the object class is present, and ignore the background class. We argue that if a pretrained network is able to produce better segmentations, the regions of interest are better visible (e. g. have clearly defined boundaries) in the images. Implementing SIS requires a segmentation network trained on the same distribution of images as the reconstruction dataset. In practice, the segmentation network is trained on the fullysampled images used for training the reconstruction method. We trained an offtheshelf UNet architecture to segment the left atrium, achieving a Dice score of 0.796 on the ground truth images.
4 Experiments
Datasets.
We evaluated our method on 3D late gadolinium enhanced cardiac MRI datasets acquired in 37 patients.
We split the 2D axial slices of the 3D volumes into 1248 training images, 312 validation images, and 364 testing images of size pixels.
For training, we generated random 1DGaussian masks keeping 12.5% of raw space data, which corresponded to an 8 speedup.
During testing, we randomly generated a mask for each slice, which we kept the same for all evaluated methods.
Training Details and Parameters. For the reconstruction network, we used dealiasing blocks, and convolutional layers with 32 filters of size 3
3. For the refinement network, we used a UNet with 32, 64, 128 encoding filters and 64, 32 decoding filters of size 4x4, batch normalization and leaky ReLU with slope 0.1. The discriminator used a PatchGAN
[4]architecture with 64, 128, 256, 512, 1024, 1024 filters of size 4x4, and channelwise dropout after the last 3 layers. The VGG loss used the final convolutional feature maps of a VGG19 network pretrained on ImageNet. The reconstruction network was trained for 1500 epochs with batch size 20, the refinement network for 200 epochs with batch size 5, both using the Adam optimizer
[6] with learning rate 0.0002, , . We found that the training is sensitive to the network’s initialization. Thus, we chose orthogonal initialization [13] for the refinement network and Gaussian initialization from for the discriminator.Evaluation Metrics.
We use PSNR and SIS as evaluation metrics. To further evaluate our approach and assess how useful SIS is as a proxy for visual quality, we also asked a domain expert to rate all reconstructed images in which the left atrium anatomy and the atrial scars are visible. The rating ranges from 1 (poor) to 4 (very good), and is based on the overall image quality, the visibility of the atrial scar and occurrence of artifacts. To obtain an unbiased rating, the expert was shown all images from all methods in randomized order.
Results. We compared our approach against three other reconstruction methods: RecNet^{1}^{1}1https://github.com/js3611/DeepMRIReconstruction [14] (i.e. the proposed approach without refinement step), DAGAN^{2}^{2}2https://github.com/nebulaV/DAGAN [16] using both adversarial and perceptual loss, and DLMRI^{3}^{3}3http://www.ifp.illinois.edu/~yoram/DLMRILab/DLMRI.html [10], a dictionary learning based method. No data augmentation was used for any of the methods.
We show the results of our evaluation in Table 1, and a sample reconstruction in Figure 2. RecNet performed best in terms of PSNR, which is expected as its training objective directly corresponds to this metric, but its reconstructions were oversmoothed. DLMRI had the lowest MOS, with its reconstructions showing heavy oil paint artifacts. DAGAN, combining MSE loss with a visual loss function without any further precautions, suffered from low PSNR. While its reconstructions also looked sharp, they were noisy and often displayed aliasing artifacts, which was reflected in a lower MOS compared to our method. Our proposed approach achieved significantly^{4}^{4}4Significance determined by a twosided paired Wilcoxon signedrank test at . higher mean opinion score than all other methods, while still maintaining high PSNR. Reconstructions obtained by our method appeared sharper with better contrast. Moreover, our method achieved the highest SIS close to segmentation performance on the ground truth data, which indicated that the segmented objects were clearly visible in the reconstructed images.
These results further demonstrate that PSNR alone is a subpar indicator for reconstruction quality, making our SIS a useful supplement to those metrics. For our method, SIS agreed with the quality score given by the expert user. Somewhat surprising is that the SIS of DLMRI is slightly higher than RecNet and DAGAN although DLMRI has the worst MOS. We conjecture this is because, although DLMRI reconstructed images lack textural details, areas belonging to the same organ have similar intensity values, which helps the segmentation task. While scoring through an expert user is thus still the safest way to evaluate reconstructions, we believe that in conjunction with PSNR, SIS is a helpful tool to quickly judge image quality during the development of new models.
5 Conclusion
In this work, we highlighted the inadequacy of previously proposed deep learning based CSMRI methods using MSE loss functions in direct combination with visual loss functions. We improved on them by proposing a new refinement approach, which incorporates both loss functions in a harmonious way to improve the training stability. We demonstrated that our method can produce high quality reconstructions with large undersampling factors, while keeping higher PSNR values compared to other stateoftheart methods. We also showed that the reconstruction obtained by our method can provide the best segmentation of the ROIs among all compared methods.
References
 [1] Dahl, R., et al.: Pixel Recursive Super Resolution. ICCV pp. 5449–5458 (2017)
 [2] Dosovitskiy, A., Brox, T.: Generating Images with Perceptual Similarity Metrics based on Deep Networks. In: NIPS (2016)
 [3] Goodfellow, I.J., et al.: Generative Adversarial Nets. In: NIPS (2014)

[4]
Isola, P., et al.: ImagetoImage Translation with Conditional Adversarial Networks. IEEE CVPR pp. 5967–5976 (2017)
 [5] Johnson, J., et al.: Perceptual Losses for RealTime Style Transfer and SuperResolution. In: ECCV (2016)
 [6] Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR) (2015)
 [7] Ledig, C., et al.: PhotoRealistic Single Image SuperResolution Using a Generative Adversarial Network. IEEE CVPR pp. 105–114 (2017)
 [8] Lee, D., et al.: Deep Residual Learning for Compressed Sensing MRI. IEEE 14th International Symposium on Biomedical Imaging pp. 15–18 (2017)
 [9] Pfau, D., Vinyals, O.: Connecting Generative Adversarial Networks and ActorCritic Methods. arXiv preprint arXiv:1610.01945 (2016)
 [10] Ravishankar, S., Bresler, Y.: MR Image Reconstruction From Highly Undersampled kSpace Data by Dictionary Learning. IEEE TMI 30, 1028–1041 (2011)
 [11] Ronneberger, O., et al.: UNet: Convolutional Networks for Biomedical Image Segmentation. In: MICCAI. pp. 234–241 (2015)
 [12] Salimans, T., et al.: Improved Techniques for Training GANs. In: NIPS (2016)
 [13] Saxe, A.M., et al.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR (2014)

[14]
Schlemper, J., et al.: A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction. IEEE TMI (2017)
 [15] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for LargeScale Image Recognition. In: ICLR (2015)
 [16] Yang, G., et al.: DAGAN: Deep DeAliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction. IEEE TMI (2018)
 [17] Yang, Y., et al.: Deep ADMMNet for Compressive Sensing MRI. In: NIPS (2016)
Appendix 0.A Appendix
The following images show more samples for 8fold undersampling. For each of the seven patients of the test set, a random slice showing the left atrium was selected. The contour of the predicted segmentation of left atrium is shown in yellow, the contour of the ground truth segmentation in red.