Whole-body magnetic resonance imaging (wbMRI) is an essential part of well-established cancer screening protocols [Villani:2016]. These protocols were shown to improve early detection of cancer for both adult [attariwala2013whole] and pediatric [greer2017pediatric] patients. Machine learning methods have been successfully applied in staging adult cancer patients from wbMRIs [lavdas2019machine]. The same task is much more challenging for pediatric patients due to i) varying bone signals during growth, ii) the movement and limited compliance of young children during imaging, and iii) the rarity of positive cases. The lack of training data suggests the need for alternatives to standard CNN-based [10.1371/journal.pmed.1002699] approaches or augmentation-based detectors.
Generative models, such as generative adversarial networks (GANs), have shown promise in anomaly detection in numerous medical imaging applications[yi2019generative]. Given the need for an automated pediatric wbMRI cancer screening tool, we set out to study different generative models for the primary task of generating pediatric wbMRIs. We limited our study to evaluating the generation of images. The quality of the generated images can be seen as a measure of how well the model has captured the underlying data distribution which is essential to the eventual downstream task of cancer screening (anomaly detection). We applied these models to 360 wbMRI slices from The Hospital for Sick Children in Toronto. We trained multiple GAN architectures and used Fréchet Inception Distance (FID), Domain Fréchet Distance (DFD), and radiology blind tests to evaluate the image quality of each model.
We demonstrate that StyleGAN2 generates the best quality images and that DFD is a promising metric to compare image quality. We also demonstrate our preliminary results on the task of anomaly detection. Our analyses characterize the use of generative models for medical image generation and potential downstream tasks such as anomaly (cancer) detection, contributing to the much needed advances in pediatric medical imaging.
Our dataset is comprised of 90 de-identified healthy patients from a pediatric hospital, including males and females of ages 4 to 18. Four middle anatomically similar slides were selected from each volume. Each slice was preprocessed using N4ITK bias field correction [tustison2010n4itk], contrast-limited adaptive equalization [kaur2016mri], and noise reduction [senthilkumaran2014histogram]
. We cropped and padded the images to be a uniform size of 800256 to register the position of different patients.
We trained (Appendix A) four different generative models: DCGAN [radford2015unsupervised], StyleGAN [DBLP:journals/corr/abs-1812-04948], StyleGAN with progressive training (PGStyleGAN) [DBLP:journals/corr/abs-1710-10196], and StyleGAN2 [karras2019analyzing]. For evaluation, we measured the FID [DBLP:journals/corr/HeuselRUNKH17]
and the DFD in the feature space of a Variational Autoencoder (VAE) trained on the same dataset according to[1803.07474]
. For our blind tests with our radiology fellow, we randomly chose 10 real images and 10 generated images from each model. We then showed the radiologist each of the images in random order asking them to classify the image as real or fake (generated).
Finally, we performed anomaly detection using a GAN trained with healthy images [1703.05921]
. With a query image, we find the closest generated image and subtract the two images to provide areas of high disease probability fig:example2. Cancer tumours are simulated by generating a set of circles around a point on the image with varying pixel intensities and radii. For future work, we are working on acquiring and using real cancer images instead of simulating tumours. We compared the accuracy of our anomaly detection to watershed segmentation[mustaqeem2012efficient] which is traditionally used in low data settings as its performance is agnostic to data amount. This method is not the state of the art in classical image segmentation but it is a commonly used method that is low resource and fast which is why we selected it.
Generated Image Quality.
fig:example shows samples from StyleGAN2 have the highest visual quality, which is supported by the error rate in classification by our radiologist in tab:example. The radiologist was able to detect most images were fake across all of the chosen architectures most commonly due to artifacts generated by the model which would not be present in real images. Furthermore, we observed that StyleGAN2 generates more diverse samples and does not suffer as much from mode collapse compared to other approaches.
Domain Fréchet Distance Metric.
We observed that the FID metric is inconsistent with the visual quality of samples for this domain since StyleGAN2 should have the lowest FID (see tab:example and fig:example). We hypothesize the reason to be that our wbMRI images are very different from natural images used to train Inception v3. The DFD in the VAE feature space successfully captures the order of model performance for the same dataset.
fig:example2A shows a proof-of-concept of the anomaly detection method proposed by [1703.05921] for wbMRI. Since the GAN is only trained using healthy images, by finding the closest image in the generative distribution, we can highlight anomalous areas in a diseased query. We demonstrate that our GAN outperforms the classic watershed segmentation in fig:example2B.
In this paper, we demonstrate that state-of-the-art GANs are able to generate pediatric wbMRIs needed to enable automated cancer detection. In particular, samples generated using the StyleGAN2 architecture had high enough visual fidelity that our radiologist classified them as real. We also demonstrate that the FID metric used in the GAN literature is inappropriate for this domain and that DFD is a promising alternative. Finally, we show a downstream task of anomaly detection, using the GAN trained on healthy images to detect cancerous lesions, which may mitigate the need for scarce examples of wbMRIs with cancer.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the The Mark Foundation for Cancer Research. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute www.vectorinstitute.ai/#partners.
Appendix A GAN Training Settings
|Model||Batch Size||Instance Noise steps|
* Batch size at each progressive growth step between 1 and 6, respectively.
** After the complete growing of the last layer.
In all models, a similar architecture skeleton is used to upsample noise (512) to the image resolution (). The Generator first upsamples noise to 512 feature maps of size with a fully connected layer. Next, 5 convolutional blocks, each consisting of two convolutional layers (with
filters and stride 1) and an upsampling layer (by bilinear interpolation) in between, are used to double the width and height of the feature maps. The number of feature maps are also halved in the last 3 blocks. The result of dimensionis passed to one last convolutional layer to obtain the grayscale image. The discriminator is almost a mirror of the generator; it obtains intermediate feature maps of the same dimension with similar convolutional blocks, but downsamples the width and height with a convolutional stride of 2. Two fully connected layers of size 512 and an output layer are added at the end. The remaining hyperparameters, and training details are inherited from the original StyleGAN paper.
For the training of DCGAN, StyleGAN and PGStyleGAN, Gaussian noise with is independently added to each pixel in both real and fake images and is linearly reduced to 0 in the number of steps indicated in Table 2. In the training of all models, a latent dimension size of 512 is used to sample Gaussian noise.