In this paper, we seek to apply elements of Meta-Learning to Generative Adversarial Networks (GANs) to tackle Super-Resolution in medical images, specifically brain MRI images. We are the first to apply such a scale-free Super-Resolution technique on these images. Super-Resolution is the task of increasing the resolution of images. It helps radiological centres located in rural areas which do not have high-fidelity instruments achieve comparable diagnostic results as their advanced counterparts in the city. The importance of Super-Resolution is growing as cross modality analysis requires combining different types of information (such as PET and NMR scans) of varying resolution.
Traditionally, Super-Resolution has been done using interpolation, such as bicubic interpolation, but recent attempts have involved the use of deep learning methods to extract high level information from data, which can be used to supply additional information to increase the resolution of the image. Current deep learning techniques for medical images rely heavily on Generative Adversarial Networks (GANs) for they are able to generate realistic and sharper images by using a different loss function that yields high perceptual quality[GANsinmedicalimaging]. For example, mDCSRN [GANsinmedicalimaging], Lesion-focussed GAN [GANsforSR_1], ESRGAN [GANsforSR_2], are all GAN-based solutions tackling Super-Resolution for medical images. Most of them are based off SRGAN - a GAN-based network tackling Super-Resolution developed by Ledig et al. [DBLP:journals/corr/LedigTHCATTWS16]. We therefore use SRGAN as a foundation to which we apply our modification.
Despite their better performance, almost all the networks relied on the sub-pixel convolution layer introduced by Shi et al. [DBLP:journals/corr/ShiCHTABRW16], which tied a particular upscaling factor to a particular network architecture. This incurs high storage and energy costs for medical professionals who may wish to conduct Super-Resolution on different scaling factors. In this paper, we combine the Meta-Upscale Module introduced by Hu et al. [DBLP:journals/corr/abs-1903-00875] with SRGAN to create a novel network lovingly termed Meta-SRGAN. This breaks the constraint imposed by Shi et al’s layer and is capable of tackling Super-Resolution on any scale, even non-integer ones, and hence can reduce storage, energy costs and lay the foundations for real-time Super-Resolution. We first show that Meta-SRGAN outperforms the baseline of bicubic interpolation on the BraTS dataset [BraTSCite1, BraTSCite2, BraTSCite3], and also show that Meta-SRGAN is capable of performing similarly to SRGAN, but yet is able to super-resolve images of arbitrary scales. We also compare the memory footprint and show that Meta-SRGAN is 98 smaller than EDSR, a state-of-the-art Super-Resolution technique, but yet is able to achieve similar performance.
The rest of the paper is organised as follows. Background introduces Super-Resolution formally and illustrates how Hu et al. reframes the problem. We then introduce the architecture of Meta-SRGAN in Methods. Experiments elaborates on dataset preparation and training details. Finally, we show the outcomes of the experiments in Results.
Deep learning-based Super-Resolution techniques are generally supervised learning algorithms. They analyse relationships between the low-resolution (LR) image and its corresponding high-resolution (HR) image, and use this relationship to obtain the super-resolved (SR) image. This SR image is then evaluated against the HR image to see how well the algorithm is performing. Any algorithm that is capable of extracting these relationships can be used to do Super-Resolution. Dong et al[DBLP:journals/corr/DongLHT15]
showed that any such algorithms can be thought of as convolutional neural networks (CNNs). Figure1 shows a typical high level recipe for a Super-Resolution task.
1.1.2 Meta-Upscale Module
The Upsampling module in the network is often the efficient sub-pixel convolution layer proposed by Shi et al. [DBLP:journals/corr/ShiCHTABRW16]. Instead of explicitly enlarging feature maps, it expands the channels of the output features to store the extra points to increase resolution. It then rearranges these points to obtain the super-resolved output image. Almost every network tackling Super-Resolution uses this Upsampling module. An example is EDSR [DBLP:journals/corr/LimSKNL17], a convolutional neural network tackling Super-Resolution. Hu et al. then proposed replacing the Upsampling module with a Meta-Upscale Module [DBLP:journals/corr/abs-1903-00875]. This Meta-Upscale Module consists of a Weight Prediction Network and is an example of Meta-Learning as it generalises a network to tackle more than one upscaling factor. Essentially, given a few training examples (small amount of images from some scales), we train a network to tackle arbitrary scale factors. The Weight Prediction Network is able to predict weights for different upscaling factors. This extra layer of abstraction enables the underlying neural network to generalise across tasks. The framework proposed by Hu et al. for enabling arbitrary scale Super-Resolution is illustrated in Figure 2.
Hu et al. reframes Super-Resolution as a matrix multiplication problem:
where is the super-resolved image, is some matrix of weights we learn, is the input low-resolution image, and is some constant. This may seem like an oversimplification and in many ways it is (for example it doesn’t take account how pixels close to each other tend to exert more influence on each other), but it gives us a rather intuitive understanding for why a separate weight prediction network works for arbitrary scale Super-Resolution. Consider as a vector, and as a matrix. The output is therefore a vector. In some sense we have doubled the size of . We can think of W as providing an upscaling of . The idea proposed by Hu et al. is to predict W, for every upscaling factor. This is done by passing in the input dimensions to an external function to create the shape of W, before using a weight prediction network to predict the values of , as in the equation below:
where is the matrix of weights, is the weight prediction network with parameters , and is the input low-resolution image.
The architecture of the Meta-Upscale Module is shown in Figure 3, and includes a weight prediction network that outputs weights which are then matrix multiplied to the input features to obtain the super-resolved image.
The Weight Prediction Network consists of three layers, and is trained alongside the main network. Of significant importance is the shape of the input into the Weight Prediction Network. It must contain information relating to the shape of the high-resolution image, as that determines the size of the weight matrix that is multiplied with the input features fed into the Module. Additional details can be found in [DBLP:journals/corr/abs-1903-00875].
We used Bicubic Interpolation as a baseline, and implemented EDSR as a state-of-the-art technique that we compare Meta-SRGAN to. We also wanted to investigate how well using the Meta-Upscale Module on a GAN will affect its performance, so we implemented SRGAN too. EDSR and SRGAN were trained on x2 upsampling tasks, and were trained on DIV2K for 160k updates before transferring their learning onto the BraTS dataset for an additional 160k updates. EDSR was trained with 256 number of features, 32 residual blocks, and a residual scaling of 0.1 (see [DBLP:journals/corr/LimSKNL17] for more information). EDSR was trained using L1 Loss, whilst SRGAN was trained using the same combination of losses used to train Meta-SRGAN.
To reap the benefits of both GANs and meta-learning, we combine Ledig et al.’s SRGAN with Hu et al.’s Meta-Upscale Module to obtain a new architecture called Meta-SRGAN. The nature of a GAN enables the generator to generate realistic images. The ability to predict weights for different upscaling factors enables the network to tackle arbitrary scales. Combined, they result in a network that is both hugely powerful and highly generalisable.
There are two networks being trained in a Generative Adversarial Network. There is the generator which generates an image (in this case it super-resolves an image) which is then fed into the discriminator which discriminates between a real and a generated image (in this case it determines whether the image presented is the ground truth high-resolution, or not). This feedback from the discriminator is then used to further train the generator. Both generator and discriminator are playing a game to outbid each other. The generator tries to improve its ability to generate images that can fool the discriminator whilst the discriminator tries to improve its ability to discern real images from generated images. This adversarial learning enables us to generate more realistic and detailed images. The architecture of the generator and discriminator are shown in figures 4 and 5 respectively.
The generator is used to generate super-resolved images, which the discriminator will take in and output a classification that says whether it is a generated image or a real image.
refers to the ReLU activation layer. The number 64 refers to the number of kernels, and can be modified like any hyperparameter.
Residual Blocks are used in the generator to stabilise training by incorporating a feedback loop back onto the input. Parametric ReLU was used as it was empirically found to stabilise training [DBLP:journals/corr/LedigTHCATTWS16].
The discriminator consists of several Leaky ReLU layers and an average pooling layer. These was empirically found to stabilise training [DBLP:journals/corr/LedigTHCATTWS16]. A sigmoid layer was used to facilitate binary classification.
2.2.3 L1 Loss
L1 Loss is the typical loss function used by other networks that calculates pixel-wise errors. We incorporate it to train the generator in Meta-SRGAN. We denote an input image by , and a neural network and its parameters by . Letting , be just an index, be the low-resolution input image and be the high-resolution target image, be the dimensions of an image, and be the Frobenius norm, L1 Loss is defined as
L1 Loss minimises the mean absolute error between pixels and helps Meta-SRGAN match its output as closely as possible to the high-resolution target.
2.2.4 Adversarial Loss
The generator tries to minimise the following function whilst the discriminator tries to maximise it. is the discriminator’s output that the high-resolution image is high-resolution, is the super-resolved image produced by the generator, and is the discriminator’s output that the super-resolved image is high-resolution.
The above function is termed Adversarial Loss, and it allows Meta-SRGAN to achieve realistic super-resolved images.
2.2.5 Perceptual Loss
Meta-SRGAN also incorporates Perceptual Loss, a type of loss introduced by Johnson et al. to improve performance on Super-Resolution tasks [JohnsonAL16]. The idea was to encourage the generated image to have similar feature representations as computed by a separate network. We use a 19-layer VGG network [simonyan2014deep]
pretrained on the ImageNet dataset[RussakovskyDSKSMHKKBBF14], and we extract the features before the last layer of the network. Letting and be the super-resolved and high-resolution images respectively, be the network that extracts their feature representations, and assuming that the feature representations are of shape , we define the Perceptual Loss as the Euclidean distance between the feature representations:
In the following experiments we combined the loss functions as
and use it to train Meta-SRGAN.
We performed experiments on the Multimodal Brain Tumor Segmentation (BraTS) dataset [BraTSCite1, BraTSCite2, BraTSCite3]. We first trained Meta-SRGAN on the DIV2K dataset [Agustsson_2017_CVPR_Workshops] - a dataset curated for Super-Resolution tasks, and then transferred that learning onto the BraTS dataset.
3.1 Datasets and Data Preparation
To demonstrate how well Meta-SRGAN performs on medical images, we trained and tested Meta-SRGAN on the Multimodal Brain Tumor Segmentation (BraTS) dataset [BraTSCite1, BraTSCite2, BraTSCite3]. This dataset contains several versions, including t1, t1ce, and t2 brain MRI scans, and serves as a good proxy for medical images. We picked the 2D t1ce version as it was two dimensional and was simple to work with. There were training images and
validation images. Training images were normalised using a mean and standard deviation ofand respectively before they were passed to the network to be trained on. These values were calculated using pixel values from the training images, after the maximal informational crop was applied.
The BraTS dataset is tricky to deal with because every image has sparse information. Only about of each image contained useful information (i.e. the brain), the rest was black. To tackle this problem, we cropped the image with the most pixel information. This corresponded to crops of the brain. Then, from this crop, we then randomly sampled by patches to provide sufficient coverage per image. For an upscaling factor of 2, the low-resolution input image (LR) was of size by whilst the high-resolution target image (HR) and super-resolved output image (SR) were of size by . This workflow is summarised in Figure 6.
3.2 Training Parameters and Experiment Settings
We first trained Meta-SRGAN on the DIV2K dataset for 360k updates, with a mini-batch size of 8. We then did transfer learning and trained Meta-SRGAN on the BraTS dataset for an additional 360k updates. We trained Meta-SRGAN on a range of scales from 1.1 to 4.0 (with 0.1 increments). Higher upscaling factors were not chosen as that would have made the input image too small to be useful. Each minibatch of images was associated with a scale, and each scale was chosen uniformly at random from 1.1 to 4.0. We optimised the network using a combination of L1 Loss, Adversarial Loss, and Perceptual Loss as described in the previous section. We used an Adam optimiser with a learning rate ofon each of the generator and discriminator, and decreased the learning rate by every 60k updates. Pixel values were clamped to between 0 and 255 to give the generator an edge. The network was trained on a NVIDIA TITAN X GPU and took three days. We calculated the PSNR and SSIM scores only on the luminance channel of the maximal information crops.
The results of our experiments are tabulated in Table 1
. We see that Meta-SRGAN clearly outperformed the baseline of Bicubic Interpolation, which suggests that the inclusion of the Meta-Upscale Module did not hinder the network’s performance. This is further reinforced by the fact that Meta-SRGAN was able to achieve similar performance on the x2 upsampling task as SRGAN, the architecture it was based on. It also had the added benefit of a smaller number of parameters compared to SRGAN. Visual quality is indicated by a proxy metric, Structural Similarity Index Metric (SSIM), which measures the perceived image similarity between two images. The high SSIM scores achieved by the GANs bodes well for brain MRI imaging which demands accurate images. We also see that Meta-SRGAN has the lowest number of parameters, which means its memory footprint is really small (98 smaller than EDSR), which suggests that it can be deployed easily.
|x2 Upsampling Task, BraTS Testing Set|
|PSNR (dB)||SSIM||PSNR (dB)||SSIM||PSNR (dB)||SSIM||PSNR (dB)||SSIM|
Sampled images are shown in Figure 7. The higher SSIM scores of SRGAN and Meta-SRGAN than the baseline of Bicubic Interpolation highlights the fact that they generate perceptually better images than the baseline. The very fact that Meta-SRGAN is even able to have comparable scores to EDSR despite being trained on a whole range of scales, and with considerably less parameters, attests to not just the power of adversarial training, but also the robustness of the Meta-Upscale Module.
Meta-SRGAN was also able to produce a sequence of images corresponding to different upscaling factors. This is shown in Figure 8. The relatively consistent PSNR and SSIM scores indicate that performance across different scales are not compromised. All these results suggests that we have a memory efficient network capable of generating super-resolved brain MRI images of high visual quality, and can be used to generate images of any scale.
With better visual quality than the baseline, lower memory footprint, and ability to tackle any upscaling factor with negligible loss in performance, Meta-SRGAN clearly can deliver comparable performance to state-of-the-art Super-Resolution techniques.
The implications of the results on the BraTS dataset are three-fold. The low memory footprint of Meta-SRGAN enables it to be wrapped as a helper tool to aid medical tasks such as MRI and endoscope videography. The ability to super-resolve images on any arbitrary scale allows one to easily extend this to an application of real-time zooming and enhancing of an image, which will be useful in medical screening and surgical monitoring. The fact that its performance is not hindered across multiple upscaling factors affords it the flexibility to be used in other works such segmentation, de-noising, and registration. Despite only testing it on brain MRI images, Meta-SRGAN can easily be extended to datasets of other modalities such as Ultrasound and CT scans. Its low memory footprint, high visual quality, and generalisability suggests that Meta-SRGAN can become a new foundation on which other architectures can be built to enhance the performance for medical images.
We have built a network that combines the ability to generate images of high visual quality with the ability to tackle arbitrary scales, and are the first to show that this does not compromise performance or memory footprint for brain MRI images. This means that unlike other state-of-the-art methods, our method works on arbitrary scales which means that only a single network is required to perform Super-Resolution on any upscaling factors. In future work we hope to enhance the performance of Meta-SRGAN and apply it to other modalities.
Jin Zhu’s PhD research is funded by China Scholarship Council
(grant No.201708060173), whilst Pietro Lio’ is supported by the GO-DS21 EU grant proposal.