Log In Sign Up

Fast MRI Reconstruction: How Powerful Transformers Are?

by   Jiahao Huang, et al.

Magnetic resonance imaging (MRI) is a widely used non-radiative and non-invasive method for clinical interrogation of organ structures and metabolism, with an inherently long scanning time. Methods by k-space undersampling and deep learning based reconstruction have been popularised to accelerate the scanning process. This work focuses on investigating how powerful transformers are for fast MRI by exploiting and comparing different novel network architectures. In particular, a generative adversarial network (GAN) based Swin transformer (ST-GAN) was introduced for the fast MRI reconstruction. To further preserve the edge and texture information, edge enhanced GAN based Swin transformer (EESGAN) and texture enhanced GAN based Swin transformer (TES-GAN) were also developed, where a dual-discriminator GAN structure was applied. We compared our proposed GAN based transformers, standalone Swin transformer and other convolutional neural networks based based GAN model in terms of the evaluation metrics PSNR, SSIM and FID. We showed that transformers work well for the MRI reconstruction from different undersampling conditions. The utilisation of GAN's adversarial structure improves the quality of images reconstructed when undersampled for 30


page 1

page 3


Transformer based Generative Adversarial Network for Liver Segmentation

Automated liver segmentation from radiology scans (CT, MRI) can improve ...

PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer

Magnetic resonance imaging (MRI) noninvasively provides critical informa...

Recon-GLGAN: A Global-Local context based Generative Adversarial Network for MRI Reconstruction

Magnetic resonance imaging (MRI) is one of the best medical imaging moda...

MITNet: GAN Enhanced Magnetic Induction Tomography Based on Complex CNN

Magnetic induction tomography (MIT) is an efficient solution for long-te...

I Introduction

Magnetic resonance imaging (MRI) is a widely used non-radiative and non-invasive method for interrogation of organ structures and metabolism [1]. However, a fully sampled MRI with high spatial resolution may require a long time to acquire [21]. Despite endeavours in parallel imaging and compressive sensing, traditional fast MRI methods suffered from a limited acceleration factor and a prolonged iterative procedure [18].

Recently, convolutional neural network (CNN) based models [17, 13, 9] were developed for the post-acquisition reconstruction of undersampled MRI that leveraged its hierarchical structures to establish the latent sparse correlations in both k-space and image space between the undersampled and fully sampled MR images.

More recently, models with expanded receptive fields [12], namely transformers, started gaining attention, for their unique advantage through its sequence-to-sequence model design [16] and adaptive self-attention setting [11]. Instead of spanning receptive fields across the whole image, a variant of transformer with moving receptive field windows of reduced sizes was proposed, namely Shifted windows (Swin) transformer [8], as Fig. 1 (A) shown. Such design improved the adaptability of the transformer based models, while greatly reducing the computational complexity.

Fig. 1: (A) The schematic diagrams of receptive fields for 2D convolution (Conv2D), multi-head self-attention (MSA) in vanilla transformer, windows based multi-head self-attention (W-MSA) and shifted windows based multi-head self-attention (SW-MSA) in Shifted windows (Swin) transformer. Red box: Receptive fields; Green box: Pixels; Blue box: Patches. (B) The architecture of the proposed single-discriminator GAN based Swin transformer (ST-GAN), dual-discriminator edge enhanced GAN based Swin transformer (EES-GAN) and dual-discriminator texture enhanced GAN based Swin transformer (TES-GAN).

In this work, we focus on investigating how powerful transformers are for fast MRI by exploiting and comparing different novel architectures. In particular, the Swin transformer based GAN (ST-GAN) model (Fig. 1 (B)) is proposed for the fast MRI reconstruction, with a Swin transformer based generator and a discriminator for holistic MR image reconstruction. Besides, inspired by the dual-discriminator GAN structure for edge and texture preservation, edge enhanced GAN based Swin transformer (EES-GAN) and texture enhanced GAN based Swin transformer (TES-GAN) are also developed, to further exploit the combination of the transformers and GAN structure for MRI reconstruction. We compare these novel transformer based GANs and the standalone Swin transformer with other CNN based GANs and zero-filled baselines.

Ii Method

Ii-a Formulation

MRI reconstruction aims to recover latent images from the undersampled k-space measurements . Traditionally, the reconstruction problem can be converted into a optimisation problem as follows


where is the regularisation term balanced by and denotes the norm.

With its superior ability of feature extraction, CNN has been applied to MRI reconstruction to reduce the long reconstruction time of traditional methods, which can be formulated by


where is the CNN trained to map undersampled MR images to reconstructed MR images .

Ii-B Network Architecture

Ii-B1 Swin Transformer Based Generator

Fig. 2:

(A) Structure of the Swin transformer based generator. IM: input module; OM: output module. (B) Structure of the residual Swin transformer blocks (RSTBs). STL: Swin transformer layer; Conv2D: 2D convolutional layer. (C) Loss functions applied.

: Pixel-wise Charbonnier loss; : Frequency Charbonnier loss; : Perceptual VGG L1 loss; : Adversarial loss from discriminator for holistic image reconstruction (in ST-GAN, EES-GAN and TES-GAN); : Adversarial loss from discriminator for edge and texture information preservation (in EES-GAN and TES-GAN).

As Fig. 2 (A) shows, a Swin transformer based generator [7, 4]

, which consists of an input module (IM), a cascaded of residual Swin transformer blocks (RSTBs) and an output module (OM), is applied in our proposed ST-GAN, EES-GAN and TES-GAN. A residual connection is applied between the input and output to stable the training process as followed:


The IM is a Conv2D at the beginning of the network for the shallow feature extraction, which maps the images space to high dimension feature space . The channel is enlarged for the follow-up transformer module.

The OM is a Conv2D placed at the end, mapping from the high dimension feature space to the output image space .

As Fig. 2 (A) shows, the RSTB is composed of a series of Swin transformer layers (STLs) and a Conv2D with a residual connection. Patch embedding and patch unembedding operations are placed before the first STL and after the Conv2D to convert the feature map between and .

Ii-B2 Edge and Texture Enhanced GAN Structure

As Fig. 1 (B) shows, in our proposed ST-GAN, standard two-player GAN structure, i.e, one generator and one discriminator is applied. The only discriminator is a U-Net based discriminator [14] for the holistic images reconstruction, which aims to distinguish reconstructed MR images from ground truth MR images .

For the proposed EES-GAN and TES-GAN, dual-discriminator GAN structures were utilised, to train the generator. Similar to TS-GAN, the discriminator for holistic reconstruction also takes the reconstructed MR images and the ground truth MR images as input. In the proposed EES-GAN, an additional U-Net based discriminator for edge information preservation takes the edge information of both and extracted by the Sobel operator as the input. In the proposed TES-GAN, an additional U-Net based discriminator for texture information preservation is applied, whose inputs are the texture information of both and extracted by the Gabor operator .

Ii-C Loss Function

As Fig. 2 (C) shown, the loss function consists of a pixel-wise loss , a frequency loss , a perceptual VGG loss and an total adversarial loss .

The pixel-wise loss and the frequency loss are defined as


where Charbonnier loss [6]

is utilised for its superior robustness for outliers, and

is empirically set to .

The perceptual VGG loss measures the high-dimension mapping between two images by a pre-trained VGG network, which is defined as


where denotes the pre-trained VGG network and is the norm.

For TS-GAN, the total adversarial loss is defined as

where and denote the generator and the discriminator for holistic images reconstruction.

For EES-GAN and TES-GAN, the total adversarial loss is defined as


where , and denote the generator, the discriminator for holistic images reconstruction and the discriminator for edge and texture information preservation, respectively. denotes Sobel operator in EES-GAN and Gabor operator in TES-GAN. and are the coefficient that balanced two terms.

The total loss can be formulated by


where , , and are the coefficients that balance each term.

Iii Experiments and Results

Iii-a Dataset

Our proposed methods were trained and tested on the Calgary Campinas dataset [15]. The dataset contains 15360 2D slices of 12-channel T1-weight brain MR images, which were divided into training, validation and testing sets in a ratio of 5:2:3 (corresponding to 7680, 3072, and 4608 slices respectively).

Iii-B Implement Details

The proposed TS-GAN, EES-GAN and TES-GAN were trained on two NVIDIA RTX 3090 GPU with 24GB GPU RAM and tested on an NVIDIA RTX 3090 GPU or an Intel Core i9-10980XE CPU. We applied 6 RSTBs and 6 STLs in each RSTB in the generator, and the patch and channel number were set to 96 and 180 respectively. The parameters in the total loss function , , and were set to 15, 0.1, 0.0025 and 0.1, respectively. For the EES-GAN and TES-GAN, and were set to 0.05 and 0.05.

Iii-C Evaluation Methods

In the experiment section, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Fréchet Inception Distance (FID) 

[2] were applied for the evaluation of different methods. PSNR is a shallow pixel-scale evaluation metric that presents the ratio between maximum signal power and noise power of two images. SSIM is a shallow perceptual based evaluation metric measuring the structural similarity between two images. FID calculates the Fréchet distance between image sets by using a pre-trained Inception V3 network, measuring the similarity between two image sets.

PSNR and SSIM are not sufficient for measuring the visual perceptional quality of images, since the visual perceptional experience is subject to more latent relationships in a higher dimension [22], where FID is correlated well and more appropriate here.

Iii-D Comparisons with Other Methods

In this experiment, our proposed TS-GAN, EES-GAN and TES-GAN were compared with other MRI reconstruction methods, including CNNs based GANs method, i.e., DAGAN [19] and PIDDGAN [3], and Swin transformer based method, i.e., SwinMR [4] using Gaussian 1D 10% (G1D10%) and 30% (G1D30%) k-space undersampling mask.

Fig. 3 displays the samples of the ground truth (GT), undersampled zero-filled images (ZF) and the reconstructed MR images with G1D10% and G1D30% masks. Fig. 4 and TABLE I show the quantitative results of reconstruction by different methods.

Fig. 3: Comparison results of different models (DAGAN, PIDD-GAN, SwinMR, ST-GAN, EES-GAN and TES-GAN) against the fully sampled ground truth MR images (GT), with reference to the undersampled zero-filled MR images (ZF), under different sampling rates of (A) 10% and (B) 30% by Gaussian 1D (G1D) undersampling masks.
Fig. 4:

Boxplots illustrating the distributions of Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) of the results from different models against the fully sampled ground truth (GT), with reference to the pre-reconstruction undersampled zero-filled MR images (ZF) using Gaussian 1D 10% mask. Paired t-test has been performed between methods, and the difference in distribution between any two groups is significant (

). (Box range: interquartile range;

:1% and 99% confidence interval;

: maximum and minimum; : mean; : median.)
Train Inference
ZF 0.749 (0.018) 22.81 (0.73) 156.38
DAGAN 0.782 (0.018) 24.95 (0.73) 56.04 * 0.003
PIDD-GAN 0.859 (0.020) 26.83 (0.87) 17.55 * 0.006
SwinMR 0.876 (0.022) 27.43 (1.12) 27.66 59.269 0.388
ST-GAN 0.848 (0.025) 26.54 (1.11) 12.67 85.921 0.388
EES-GAN 0.851 (0.024) 26.64 (1.13) 12.95 114.028 0.388
TES-GAN 0.854 (0.023) 26.73 (1.11) 11.83 114.703 0.388
TABLE I: Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) and F́rechet Inception Distance (FID) metrics and the times for training and inference respectively using different models, with reference to the pre-reconstruction undersampled zero-filled images (ZF) using Gaussian 1D 10% mask. (Bold values indicate the best performance. * indicates training with different batch sizes and a fair comparison can’t be presented).

The adversarially trained transformers, i.e., TS-GAN, EES-GAN and TES-GAN, can produce richer edge and texture details in images upon reconstruction, particularly for G1D10% zero-filled images (Fig. 3). However, these enhanced texture details could be wrongly represented (Fig. 3 (B)). Fig. 3 also shows that the extra discriminator in dual-discriminator GANs (EES-GAN and TES-GAN) improved the perceptual quality of the reconstruction but not obviously.

SwinMR achieved the results with the highest SSIM and PSNR, whereas the SSIMs and PSNRs of the results reconstructed by all transformer based GAN (ST-GAN, EES-GAN, TES-GAN) fell slightly behind (Fig. 4). However, when assessed for the visual perceptional metric FID, TES-GAN achieved the best FID, followed by other transformer based GANs and the non-GAN model SwinMR.

Training time in the TABLE I refers to the model training time for 100 steps on two NVIDIA 3090 GPUs with 24GB GPU RAM. Inference time refers to the model inference time for one image on an NVIDIA 3090 GPU with 24GB GPU RAM. For training time, dual-discriminator GAN based transformers, i.e., EES-GAN and TES-GAN, have the longest training time, followed by single-discriminator GAN based transformers ST-GAN.

Iv Discussion

In this work, we have assessed the performance of the transformer models for fast MRI reconstructions. Standalone Swin transformer model (SwinMR) and its GAN based variants, i.e., TS-GAN (standard GAN), EES-GAN and TES-GAN (dual-discriminator structure with edge and texture enhanced) have been evaluated, against other CNNs-based GANs (DAGAN and PIDD-GAN). Experiment results have shown that all transformer models have achieved the best performance. SwinMR has outperformed other CNNs based methods with higher SSIM and PSNR.

To further explore the capabilities of transformers on the MRI reconstruction, we coupled the Swin transformer with a U-Net based adversarial discriminator, forming TS-GAN. In order to further enhance the edge and texture, an additional U-Net discriminator using the edge information by Sobel or the texture information by Gabor, was appended to TS-GAN to form EES-GAN or TES-GAN respectively.

We initially hypothesised that the adversarial training in ST-GAN, EES-GAN and TES-GAN can further improve the quality of reconstruction beyond SwinMR. The reconstruction quality was then assessed by the PSNR and SSIM metrics depicting pixel-to-pixel and structural similarities and the FID score for the visual perceptual experience in a higher dimension. Although richer textures and edges seemed to have been restored by these transformer based GANs (Fig. 3 (B)) with lower FID scores (TABLE I) and much lower converged perceptual VGG loss in training (Fig. 5), these models exhibited lower PSNR and SSIM scores than the standalone SwinMR’s (Fig. 4). This means that although the model may give better visual perceptional experience in terms of its texture (illustrated by FID and converged perceptual VGG loss in training), the images reconstructed exhibited difference when compared in a pixel-to-pixel strategy (illustrated by PSNR and SSIM).

Such paradoxes may bring some flaws in the reconstructed images. Although all GAN generated images in (Fig. 3 (B)) seemed clear, when compared to ground truth we can see some hallucinated brain structural textures were added to the images. This may suggest that these GAN based reconstruction methods could have less specificity when reconstructing brain MRI, particularly when the acquisition is greatly undersampled at only 10% rate (Fig. 3 (B)). Previous studies have proven that different from or loss which focuses on the reconstruction of global low-frequency structures, adversarial loss in GANs focuses on generating high-frequency details [5, 20, 10]. This further explains the clear yet incorrect high-frequency textures generated when the MRI is greatly undersampled (e.g., using G1D10%).

Fig. 5: Model learning curves for SwinMR, ST-GAN, ESS-GAN and TES-GAN with respective to (A) Pixel-wise L1 loss, (B) Frequency L1 loss, and (C) Perceptual VGG loss.

In terms of the computational complexity, the additional edge/texture enhancing adversarial discriminators in EES-GAN and TES-GAN have greatly increased the training times yet with similar inference times (TABLE I), while bringing little improvement in terms of the quantitative metrics (TABLE I and Fig. 4) and their converged training losses (Fig. 5). Therefore, overall we would like to recommend TS-GAN for undersampled MRI reconstruction for G1D30%, because the model has enhanced the textures while maintaining its specificity. When with lower undersampling rates (e.g., 10%), these GAN based transformer models may bring misleading structural changes in the images, where SwinMR is a better choice.

V Conclusions

Our study has explored the potential of transformers and their GAN variants in fast MRI. We have proposed the standard GAN coupled transformer ST-GAN and dual-discriminator GAN based transformers EES-GAN and TES-GAN for edge and texture enhancement. We have assessed their performances by the shallow metrics PSNR and SSIM and the visual perceptional metric FID with a higher dimension. Upon comparison, we can recommend TS-GAN for the reconstruction of undersampled MRI with sampling rates higher than 30% due to its capability of edge and texture enhancement. When at lower undersampling rates (e.g., 10%), these GAN based transformer models may give misleading texture enhancement with a lower specificity, where SwinMR is a better choice. In summary, transformer based models have shown great performance in fast MRI, and we can envisage a further development for clinical specific problems and combination with prior information of MR physics.


  • [1] Y. Chen, C. Schönlieb, P. Liò, T. Leiner, P. L. Dragotti, G. Wang, D. Rueckert, D. Firmin, and G. Yang (2022) AI-based reconstruction for fast mri–a systematic review and meta-analysis. Proceedings of the IEEE, arXiv preprint arXiv:2112.12744. Cited by: §I.
  • [2] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §III-C.
  • [3] J. Huang, W. Ding, J. Lv, J. Yang, H. Dong, J. Del Ser, J. Xia, T. Ren, S. Wong, and G. Yang (2021) Edge-enhanced dual discriminator generative adversarial network for fast MRI with parallel imaging using multi-view information. Applied Intelligence, arXiv preprint arXiv:2112.05758. Cited by: §III-D.
  • [4] J. Huang, Y. Fang, Y. Wu, H. Wu, Z. Gao, Y. Li, J. Del Ser, J. Xia, and G. Yang (2022) Swin transformer for fast MRI. arXiv preprint arXiv:2201.03230. Cited by: §II-B1, §III-D.
  • [5] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016-11) Image-to-Image Translation with Conditional Adversarial Networks. arXiv e-prints, pp. arXiv:1611.07004. External Links: 1611.07004 Cited by: §IV.
  • [6] W. Lai, J. Huang, N. Ahuja, and M. Yang (2019)

    Fast and accurate image Super-Resolution with deep laplacian pyramid networks

    IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (11), pp. 2599–2613. External Links: Document Cited by: §II-C.
  • [7] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte (2021) SwinIR: image restoration using swin transformer. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 1833–1844. Cited by: §II-B1.
  • [8] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §I.
  • [9] J. Lv, G. Li, X. Tong, W. Chen, J. Huang, C. Wang, and G. Yang (2021) Transfer learning enhanced generative adversarial networks for multi-channel mri reconstruction. Computers in Biology and Medicine, pp. 104504. Cited by: §I.
  • [10] J. Lv, C. Wang, and G. Yang (2021) PIC-GAN: a parallel imaging coupled generative adversarial network for accelerated multi-channel mri reconstruction. Diagnostics 11 (1), pp. 61. Cited by: §IV.
  • [11] C. Matsoukas, J. F. Haslum, M. Söderberg, and K. Smith (2021) Is it time to replace cnns with transformers for medical images?. arXiv preprint arXiv:2108.09038. Cited by: §I.
  • [12] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. In

    International Conference on Machine Learning

    pp. 4055–4064. Cited by: §I.
  • [13] J. Schlemper, G. Yang, P. Ferreira, A. Scott, L. McGill, Z. Khalique, M. Gorodezky, M. Roehl, J. Keegan, D. Pennell, et al. (2018)

    Stochastic deep compressive sensing for the reconstruction of diffusion tensor cardiac mri

    In International conference on medical image computing and computer-assisted intervention, pp. 295–303. Cited by: §I.
  • [14] E. Schönfeld, B. Schiele, and A. Khoreva (2020-02) A U-Net Based Discriminator for Generative Adversarial Networks. arXiv e-prints, pp. arXiv:2002.12655. External Links: 2002.12655 Cited by: §II-B2.
  • [15] R. Souza, O. Lucena, J. Garrafa, D. Gobbi, M. Saluzzi, S. Appenzeller, L. Rittner, R. Frayne, and R. Lotufo (2018) An open, multi-vendor, multi-field-strength brain MR dataset and analysis of publicly available skull stripping methods agreement. NeuroImage 170, pp. 482–494. Cited by: §III-A.
  • [16] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §I.
  • [17] S. Wang, Z. Su, L. Ying, X. Peng, S. Zhu, F. Liang, D. Feng, and D. Liang (2016) Accelerating magnetic resonance imaging via deep learning. In 2016 IEEE 13th international symposium on biomedical imaging (ISBI), pp. 514–517. Cited by: §I.
  • [18] G. Yang, J. Lv, Y. Chen, J. Huang, and J. Zhu (2021) Generative adversarial networks (gan) powered fast magnetic resonance imaging–mini review, comparison and perspectives. arXiv preprint arXiv:2105.01800. Cited by: §I.
  • [19] G. Yang, S. Yu, H. Dong, G. Slabaugh, P. L. Dragotti, X. Ye, F. Liu, S. Arridge, J. Keegan, Y. Guo, and D. Firmin (2018-06) DAGAN: deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Transactions on Medical Imaging 37, pp. 1310–1321. External Links: Document, ISSN 0278-0062 Cited by: §III-D.
  • [20] Z. Yuan, M. Jiang, Y. Wang, B. Wei, Y. Li, P. Wang, W. Menpes-Smith, Z. Niu, and G. Yang (2020) SARA-GAN: self-attention and relative average discriminator based generative adversarial networks for fast compressed sensing mri reconstruction. Frontiers in Neuroinformatics 14. Cited by: §IV.
  • [21] J. Zbontar, F. Knoll, A. Sriram, T. Murrell, Z. Huang, M. J. Muckley, A. Defazio, R. Stern, P. Johnson, M. Bruno, et al. (2018) fastMRI: an open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839. Cited by: §I.
  • [22] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018-06)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, Cited by: §III-C.