Magnetic resonance imaging (MRI) is a widely used non-radiative and non-invasive method for interrogation of organ structures and metabolism . However, a fully sampled MRI with high spatial resolution may require a long time to acquire . Despite endeavours in parallel imaging and compressive sensing, traditional fast MRI methods suffered from a limited acceleration factor and a prolonged iterative procedure .
Recently, convolutional neural network (CNN) based models [17, 13, 9] were developed for the post-acquisition reconstruction of undersampled MRI that leveraged its hierarchical structures to establish the latent sparse correlations in both k-space and image space between the undersampled and fully sampled MR images.
More recently, models with expanded receptive fields , namely transformers, started gaining attention, for their unique advantage through its sequence-to-sequence model design  and adaptive self-attention setting . Instead of spanning receptive fields across the whole image, a variant of transformer with moving receptive field windows of reduced sizes was proposed, namely Shifted windows (Swin) transformer , as Fig. 1 (A) shown. Such design improved the adaptability of the transformer based models, while greatly reducing the computational complexity.
In this work, we focus on investigating how powerful transformers are for fast MRI by exploiting and comparing different novel architectures. In particular, the Swin transformer based GAN (ST-GAN) model (Fig. 1 (B)) is proposed for the fast MRI reconstruction, with a Swin transformer based generator and a discriminator for holistic MR image reconstruction. Besides, inspired by the dual-discriminator GAN structure for edge and texture preservation, edge enhanced GAN based Swin transformer (EES-GAN) and texture enhanced GAN based Swin transformer (TES-GAN) are also developed, to further exploit the combination of the transformers and GAN structure for MRI reconstruction. We compare these novel transformer based GANs and the standalone Swin transformer with other CNN based GANs and zero-filled baselines.
MRI reconstruction aims to recover latent images from the undersampled k-space measurements . Traditionally, the reconstruction problem can be converted into a optimisation problem as follows
where is the regularisation term balanced by and denotes the norm.
With its superior ability of feature extraction, CNN has been applied to MRI reconstruction to reduce the long reconstruction time of traditional methods, which can be formulated by
where is the CNN trained to map undersampled MR images to reconstructed MR images .
Ii-B Network Architecture
Ii-B1 Swin Transformer Based Generator
, which consists of an input module (IM), a cascaded of residual Swin transformer blocks (RSTBs) and an output module (OM), is applied in our proposed ST-GAN, EES-GAN and TES-GAN. A residual connection is applied between the input and output to stable the training process as followed:
The IM is a Conv2D at the beginning of the network for the shallow feature extraction, which maps the images space to high dimension feature space . The channel is enlarged for the follow-up transformer module.
The OM is a Conv2D placed at the end, mapping from the high dimension feature space to the output image space .
As Fig. 2 (A) shows, the RSTB is composed of a series of Swin transformer layers (STLs) and a Conv2D with a residual connection. Patch embedding and patch unembedding operations are placed before the first STL and after the Conv2D to convert the feature map between and .
Ii-B2 Edge and Texture Enhanced GAN Structure
As Fig. 1 (B) shows, in our proposed ST-GAN, standard two-player GAN structure, i.e, one generator and one discriminator is applied. The only discriminator is a U-Net based discriminator  for the holistic images reconstruction, which aims to distinguish reconstructed MR images from ground truth MR images .
For the proposed EES-GAN and TES-GAN, dual-discriminator GAN structures were utilised, to train the generator. Similar to TS-GAN, the discriminator for holistic reconstruction also takes the reconstructed MR images and the ground truth MR images as input. In the proposed EES-GAN, an additional U-Net based discriminator for edge information preservation takes the edge information of both and extracted by the Sobel operator as the input. In the proposed TES-GAN, an additional U-Net based discriminator for texture information preservation is applied, whose inputs are the texture information of both and extracted by the Gabor operator .
Ii-C Loss Function
As Fig. 2 (C) shown, the loss function consists of a pixel-wise loss , a frequency loss , a perceptual VGG loss and an total adversarial loss .
The pixel-wise loss and the frequency loss are defined as
where Charbonnier loss 
is utilised for its superior robustness for outliers, andis empirically set to .
The perceptual VGG loss measures the high-dimension mapping between two images by a pre-trained VGG network, which is defined as
where denotes the pre-trained VGG network and is the norm.
For TS-GAN, the total adversarial loss is defined as
where and denote the generator and the discriminator for holistic images reconstruction.
For EES-GAN and TES-GAN, the total adversarial loss is defined as
where , and denote the generator, the discriminator for holistic images reconstruction and the discriminator for edge and texture information preservation, respectively. denotes Sobel operator in EES-GAN and Gabor operator in TES-GAN. and are the coefficient that balanced two terms.
The total loss can be formulated by
where , , and are the coefficients that balance each term.
Iii Experiments and Results
Our proposed methods were trained and tested on the Calgary Campinas dataset . The dataset contains 15360 2D slices of 12-channel T1-weight brain MR images, which were divided into training, validation and testing sets in a ratio of 5:2:3 (corresponding to 7680, 3072, and 4608 slices respectively).
Iii-B Implement Details
The proposed TS-GAN, EES-GAN and TES-GAN were trained on two NVIDIA RTX 3090 GPU with 24GB GPU RAM and tested on an NVIDIA RTX 3090 GPU or an Intel Core i9-10980XE CPU. We applied 6 RSTBs and 6 STLs in each RSTB in the generator, and the patch and channel number were set to 96 and 180 respectively. The parameters in the total loss function , , and were set to 15, 0.1, 0.0025 and 0.1, respectively. For the EES-GAN and TES-GAN, and were set to 0.05 and 0.05.
Iii-C Evaluation Methods
In the experiment section, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Fréchet Inception Distance (FID) were applied for the evaluation of different methods. PSNR is a shallow pixel-scale evaluation metric that presents the ratio between maximum signal power and noise power of two images. SSIM is a shallow perceptual based evaluation metric measuring the structural similarity between two images. FID calculates the Fréchet distance between image sets by using a pre-trained Inception V3 network, measuring the similarity between two image sets.
PSNR and SSIM are not sufficient for measuring the visual perceptional quality of images, since the visual perceptional experience is subject to more latent relationships in a higher dimension , where FID is correlated well and more appropriate here.
Iii-D Comparisons with Other Methods
In this experiment, our proposed TS-GAN, EES-GAN and TES-GAN were compared with other MRI reconstruction methods, including CNNs based GANs method, i.e., DAGAN  and PIDDGAN , and Swin transformer based method, i.e., SwinMR  using Gaussian 1D 10% (G1D10%) and 30% (G1D30%) k-space undersampling mask.
Fig. 3 displays the samples of the ground truth (GT), undersampled zero-filled images (ZF) and the reconstructed MR images with G1D10% and G1D30% masks. Fig. 4 and TABLE I show the quantitative results of reconstruction by different methods.
|ZF||0.749 (0.018)||22.81 (0.73)||156.38||–||–|
|DAGAN||0.782 (0.018)||24.95 (0.73)||56.04||*||0.003|
|PIDD-GAN||0.859 (0.020)||26.83 (0.87)||17.55||*||0.006|
|SwinMR||0.876 (0.022)||27.43 (1.12)||27.66||59.269||0.388|
|ST-GAN||0.848 (0.025)||26.54 (1.11)||12.67||85.921||0.388|
|EES-GAN||0.851 (0.024)||26.64 (1.13)||12.95||114.028||0.388|
|TES-GAN||0.854 (0.023)||26.73 (1.11)||11.83||114.703||0.388|
The adversarially trained transformers, i.e., TS-GAN, EES-GAN and TES-GAN, can produce richer edge and texture details in images upon reconstruction, particularly for G1D10% zero-filled images (Fig. 3). However, these enhanced texture details could be wrongly represented (Fig. 3 (B)). Fig. 3 also shows that the extra discriminator in dual-discriminator GANs (EES-GAN and TES-GAN) improved the perceptual quality of the reconstruction but not obviously.
SwinMR achieved the results with the highest SSIM and PSNR, whereas the SSIMs and PSNRs of the results reconstructed by all transformer based GAN (ST-GAN, EES-GAN, TES-GAN) fell slightly behind (Fig. 4). However, when assessed for the visual perceptional metric FID, TES-GAN achieved the best FID, followed by other transformer based GANs and the non-GAN model SwinMR.
Training time in the TABLE I refers to the model training time for 100 steps on two NVIDIA 3090 GPUs with 24GB GPU RAM. Inference time refers to the model inference time for one image on an NVIDIA 3090 GPU with 24GB GPU RAM. For training time, dual-discriminator GAN based transformers, i.e., EES-GAN and TES-GAN, have the longest training time, followed by single-discriminator GAN based transformers ST-GAN.
In this work, we have assessed the performance of the transformer models for fast MRI reconstructions. Standalone Swin transformer model (SwinMR) and its GAN based variants, i.e., TS-GAN (standard GAN), EES-GAN and TES-GAN (dual-discriminator structure with edge and texture enhanced) have been evaluated, against other CNNs-based GANs (DAGAN and PIDD-GAN). Experiment results have shown that all transformer models have achieved the best performance. SwinMR has outperformed other CNNs based methods with higher SSIM and PSNR.
To further explore the capabilities of transformers on the MRI reconstruction, we coupled the Swin transformer with a U-Net based adversarial discriminator, forming TS-GAN. In order to further enhance the edge and texture, an additional U-Net discriminator using the edge information by Sobel or the texture information by Gabor, was appended to TS-GAN to form EES-GAN or TES-GAN respectively.
We initially hypothesised that the adversarial training in ST-GAN, EES-GAN and TES-GAN can further improve the quality of reconstruction beyond SwinMR. The reconstruction quality was then assessed by the PSNR and SSIM metrics depicting pixel-to-pixel and structural similarities and the FID score for the visual perceptual experience in a higher dimension. Although richer textures and edges seemed to have been restored by these transformer based GANs (Fig. 3 (B)) with lower FID scores (TABLE I) and much lower converged perceptual VGG loss in training (Fig. 5), these models exhibited lower PSNR and SSIM scores than the standalone SwinMR’s (Fig. 4). This means that although the model may give better visual perceptional experience in terms of its texture (illustrated by FID and converged perceptual VGG loss in training), the images reconstructed exhibited difference when compared in a pixel-to-pixel strategy (illustrated by PSNR and SSIM).
Such paradoxes may bring some flaws in the reconstructed images. Although all GAN generated images in (Fig. 3 (B)) seemed clear, when compared to ground truth we can see some hallucinated brain structural textures were added to the images. This may suggest that these GAN based reconstruction methods could have less specificity when reconstructing brain MRI, particularly when the acquisition is greatly undersampled at only 10% rate (Fig. 3 (B)). Previous studies have proven that different from or loss which focuses on the reconstruction of global low-frequency structures, adversarial loss in GANs focuses on generating high-frequency details [5, 20, 10]. This further explains the clear yet incorrect high-frequency textures generated when the MRI is greatly undersampled (e.g., using G1D10%).
In terms of the computational complexity, the additional edge/texture enhancing adversarial discriminators in EES-GAN and TES-GAN have greatly increased the training times yet with similar inference times (TABLE I), while bringing little improvement in terms of the quantitative metrics (TABLE I and Fig. 4) and their converged training losses (Fig. 5). Therefore, overall we would like to recommend TS-GAN for undersampled MRI reconstruction for G1D30%, because the model has enhanced the textures while maintaining its specificity. When with lower undersampling rates (e.g., 10%), these GAN based transformer models may bring misleading structural changes in the images, where SwinMR is a better choice.
Our study has explored the potential of transformers and their GAN variants in fast MRI. We have proposed the standard GAN coupled transformer ST-GAN and dual-discriminator GAN based transformers EES-GAN and TES-GAN for edge and texture enhancement. We have assessed their performances by the shallow metrics PSNR and SSIM and the visual perceptional metric FID with a higher dimension. Upon comparison, we can recommend TS-GAN for the reconstruction of undersampled MRI with sampling rates higher than 30% due to its capability of edge and texture enhancement. When at lower undersampling rates (e.g., 10%), these GAN based transformer models may give misleading texture enhancement with a lower specificity, where SwinMR is a better choice. In summary, transformer based models have shown great performance in fast MRI, and we can envisage a further development for clinical specific problems and combination with prior information of MR physics.
-  (2022) AI-based reconstruction for fast mri–a systematic review and meta-analysis. Proceedings of the IEEE, arXiv preprint arXiv:2112.12744. Cited by: §I.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §III-C.
-  (2021) Edge-enhanced dual discriminator generative adversarial network for fast MRI with parallel imaging using multi-view information. Applied Intelligence, arXiv preprint arXiv:2112.05758. Cited by: §III-D.
-  (2022) Swin transformer for fast MRI. arXiv preprint arXiv:2201.03230. Cited by: §II-B1, §III-D.
-  (2016-11) Image-to-Image Translation with Conditional Adversarial Networks. arXiv e-prints, pp. arXiv:1611.07004. External Links: Cited by: §IV.
Fast and accurate image Super-Resolution with deep laplacian pyramid networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (11), pp. 2599–2613. External Links: Cited by: §II-C.
SwinIR: image restoration using swin transformer.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844. Cited by: §II-B1.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §I.
-  (2021) Transfer learning enhanced generative adversarial networks for multi-channel mri reconstruction. Computers in Biology and Medicine, pp. 104504. Cited by: §I.
-  (2021) PIC-GAN: a parallel imaging coupled generative adversarial network for accelerated multi-channel mri reconstruction. Diagnostics 11 (1), pp. 61. Cited by: §IV.
-  (2021) Is it time to replace cnns with transformers for medical images?. arXiv preprint arXiv:2108.09038. Cited by: §I.
International Conference on Machine Learning, pp. 4055–4064. Cited by: §I.
Stochastic deep compressive sensing for the reconstruction of diffusion tensor cardiac mri. In International conference on medical image computing and computer-assisted intervention, pp. 295–303. Cited by: §I.
-  (2020-02) A U-Net Based Discriminator for Generative Adversarial Networks. arXiv e-prints, pp. arXiv:2002.12655. External Links: Cited by: §II-B2.
-  (2018) An open, multi-vendor, multi-field-strength brain MR dataset and analysis of publicly available skull stripping methods agreement. NeuroImage 170, pp. 482–494. Cited by: §III-A.
-  (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §I.
-  (2016) Accelerating magnetic resonance imaging via deep learning. In 2016 IEEE 13th international symposium on biomedical imaging (ISBI), pp. 514–517. Cited by: §I.
-  (2021) Generative adversarial networks (gan) powered fast magnetic resonance imaging–mini review, comparison and perspectives. arXiv preprint arXiv:2105.01800. Cited by: §I.
-  (2018-06) DAGAN: deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Transactions on Medical Imaging 37, pp. 1310–1321. External Links: Cited by: §III-D.
-  (2020) SARA-GAN: self-attention and relative average discriminator based generative adversarial networks for fast compressed sensing mri reconstruction. Frontiers in Neuroinformatics 14. Cited by: §IV.
-  (2018) fastMRI: an open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839. Cited by: §I.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §III-C.