The evident improvement in global cancer survival in the last decades is arguably attributable not only to health care reforms, but also to advances in clinical research (e.g., targeted therapy based on molecular markers) and diagnostic imaging technology (e.g whole-body magnetic resonance imaging (MRI) messiou2019guidelines, and positron emission tomography–computed tomography (PET-CT) arnold2019progress. Nonetheless, cancers still figure among the leading causes of morbidity and mortality worldwide ferlay2015cancer, with an approximated 9.6 million cancer related deaths in 2018 WHO2018Cancer. The most frequent cases of cancer death worldwide in 2018 are lung (1.76 million), colorectal (0.86 million), stomach (0.78 million), liver (0.78 million), and breast (0.63 million) WHO2018Cancer. These figures are prone to continue to increase in consequence of the ageing and growth of the world population jemal2011global.
A large proportion of the global burden of cancer could be prevented due to treatment and early detection jemal2011global. For example, an early detection can provide the possibility to treat a tumour before it acquires critical combinations of genetic alterations (e.g., metastasis with evasion of apoptosis hanahan2000hallmarks). Solid tumours become detectable by medical imaging modalities only at an approximate size of cells () after evolving from a single neoplastic cell typically following a Gompertzian norton1976predicting growth pattern frangioni2008new111In vitro studies reported a theoretical detection limit around to for human cancer cell lines using PET. In clinical settings, the theoretical detection limit is larger and depends, among others, on background radiation, cancer cell line, and cancer type fischer2006few.. To detect and diagnose tumours, radiologists inspect, normally by visual assessment, medical imaging modalities such as magnetic resonance imaging (MRI), computed tomography (CT), ultrasound (US), x-ray mammography (MMG), PET frangioni2008new; itri2018fundamentals; mccreadie2009eight.
Medical imaging data evaluation is time demanding and therefore costly in nature. In addition, volumes of new technologies (e.g., digital breast tomosynthesis swiecicki2021generative) become available and studies generally show an extensive increase in analysable imaging volumes mcdonald2015effects. Also, the diagnostic quality in radiology varies and is very much dependent on the personal experience, skills and invested time of the data examiner itri2018fundamentals; elmore1994variability; woo2020intervention
. Hence, to decrease cost and increase quality, automated or semi-automated diagnostic tools can be used to assist radiologists in the decision-making process. Such diagnostic tools comprise traditional machine learning, but also recent deep learning methods, which promise an immense potential for detection performance improvement in radiology.
The rapid increase in graphics processing unit (GPU) processing power has allowed training deep learning algorithms such as convolutional neural networks (CNNs)Fukushima1980; lecun1989backpropagation on large image datasets achieving impressive results in Computer Vision cireaan2012multi; krizhevsky2012imagenet, and Cancer Imaging cirecsan2013mitosis
. In particular, the success of AlexNet in the 2012 ImageNet challengekrizhevsky2012imagenet
triggered an increased adoption of deep neural networks to a multitude of problems in numerous fields and domains including medical imaging, as reviewed inshen2017deep; 9363915; litjens2017survey. Despite the increased use of medical imaging in clinical practice, the public availability of medical imaging data remains limited mcdonald2015effects. This represents a key impediment for the training, research, and use of deep learning algorithms in radiology and oncology. Clinical centres refrain from sharing such data for ethical, legal, technical, and financial (e.g., costly annotation) reasons bi2019artificial.
Such cancer imaging data not only is necessary to train deep learning models, but also to provide them with sufficient learning possibility to acquire robustness and generalisation capabilities. We define robustness as the property of a predictive model to remain accurate despite of variations in the input data (e.g., noise levels, resolution, contrast, etc). We refer to a model’s generalisation capability as its property of preserving predictive accuracy on new data from unseen sites, hospitals, scanners, etc. Both of these properties are in particular desirable in cancer imaging considering the frequent presence of biased or unbalanced data with sparse or noisy labels222Alongside tumour manifestation heterogeneity, and multi-centre, multi-organ, multi-modality, multi-scanner, and multi-vendor data.. Both robustness and generalisation are essential to demonstrate the trustworthiness of a deep learning model for usage in a clinical setting, where every edge-case needs to be detected and a false negative can cost the life of a patient.
We hypothesise that the variety of data needed to train robust and well-generalising deep learning models for cancer images can be largely synthetically generated using Generative Adversarial Networks (GANs) goodfellow2014generative
. The unsupervised adversarial learning scheme in GANs is based on a generator that generates synthetic (alias ”fake”) samples of a target distribution trying to fool a discriminator, which classifies these samples as either real or fake. Various papers have provided reviews of GANs in the medical imaging domainyi2019generative; kazeminia2020gans; tschuchnig2020generative; sorin2020creating; lan2020generative; singh2020medical, but they focused on general presentation of the main methods and possible applications. In cancer imaging, however, there are specificities and challenges that call for specific implementations and solutions based on GANs, including:
the small size and complexity of cancerous lesions
the high heterogeneity between tumours within as well as between patients and cancer types
the difficulty to annotate, delineate and label cancer imaging studies at large scale
the high data imbalance in particular between healthy and pathological subjects or between benign and malignant cases
the difficulty to gather large consented datasets from highly vulnerable patients undergoing demanding care plans
Hence, the present paper contributes a unique perspective and comprehensive analysis of GANs attempting to address the specific challenges in the cancer imaging domain. To the authors’ best knowledge, this is the first survey that exclusively focuses on GANs in cancer imaging. In this context, we define cancer imaging as the entirety of approaches for research, diagnosis, and treatment of cancer based on medical images. Our survey comprehensively analyses cancer imaging GAN applications focusing on radiology modalities. As presented in Figure 1, we recognise that non-radiology modalities are also widely used in cancer imaging. For this reason, we do not restrict the scope of our survey to radiology, but rather also analyse relevant GAN publications in these other modalities including histopathology and cytopathology (e.g., in section 4.5), and dermatology (e.g., in section 4.3.2 and 4.4).
Lastly, our survey uncovers and highlights promising GAN research directions that can facilitate the sustainable adoption of AI in clinical oncology and radiology.
In this paper, we start by introducing the GAN methodology, before highlighting extensions of the GAN framework that are relevant for cancer imaging. Next, we structure the paper around the challenges of cancer imaging, where each section corresponds to a challenge category. For each category, we provide descriptions and examples for various corresponding cancer imaging challenges and sub-challenges. After introducing each challenge, we analyse the literature to learn how GANs have addressed it in the past and highlight prospective avenues of future research to point out unexploited potential of GANs in cancer imaging. Figure 3 depicts an overview of the structure of our paper containing the mapping of cancer imaging challenges to solutions based on adversarial training techniques.
2 Review Methodology
Our review comprises two comprehensive literature screening processes. The first screening process surveyed the current challenges in the field of cancer imaging with a focus on radiology imaging modalities. After screening and gaining a deepened understanding of AI-specific and general cancer imaging challenges, we grouped these challenges for further analysis into the following five categories.
Data scarcity and usability challenges (section 4.1); discussing dataset shifts, class imbalance, fairness, generalisation, domain adaptation and the evaluation of synthetic data.
Data access and privacy challenges (section 4.2); comprising patient data sharing under privacy constraints, security risks, and adversarial attacks.
Data annotation and segmentation challenges (section 4.3.2); discussing costly human annotation, high inter and intra-observer variability, and the consistency of extracted quantitative features.
Detection and diagnosis challenges (section 4.4); analysing the challenges of high diagnostic error rates among radiologists, early detection, and detection model robustness.
The second screening process comprised first of a generic and second a specific literature search to find all papers that apply adversarial learning (i.e. GANs) to cancer imaging. In the generic literature search, generic search queries such as ”Cancer Imaging GAN”, ”Tumour GANs” or ”Nodule Generative Adversarial Networks” were used to recall a high number of papers. The specific search focused on answering key questions of interest to the aforesaid challenges such as ”Carcinoma Domain Adaptation Adversarial”, ”Skin Melanoma Detection GAN”, ”Brain Glioma Segmentation GAN”, or ”Cancer Treatment Planning GAN”.
In section 4, we map the papers that propose GAN applications applied to cancer imaging (second screening) to the surveyed cancer imaging challenges (first screening). The mapping of GAN-related papers to challenges categories facilitates analysing the extent to which existing GAN solutions solve the current cancer imaging challenges and helps to identify gaps and further potential for GANs in this field. The mapping is based on the evaluation criteria used in the GAN-related papers and on the relevance of the reported results to the corresponding section. For example, if a GAN generates synthetic data that is used to train and improve a tumour detection model, then this paper is assigned to the detection and diagnosis challenge section 4.4. If a papers describes a GAN that improves a segmentation model, then this paper is assigned to the segmentation and annotation challenge section 4.3.2, and so forth.
To gather the literature (e.g., first papers describing cancer imaging challenges, second papers proposing GAN solutions), we have searched in medical imaging, computer science and clinical conference proceedings and journals, but also freely on the web using the search engines Google, Google Scholar, and PubMed. After retrieving all papers with a title related to the subject, their abstract was read to filter out non-relevant papers. A full-text analysis was done for the remaining papers to determine whether they were to be included into our manuscript. We analysed the reference sections of the included papers to find additional relevant literature, which also underwent filtering and full-text screening. Applying this screening process, we reviewed and included a total of 163 GAN cancer imaging publications comprising both peer-reviewed articles and conference papers, but also relevant preprints from arXiv and bioRxiv.
Details about these 163 GAN cancer imaging publications can be found in tables 2-6. The distribution of these publications across challenge category, year, modality, and anatomy is outlined in Figure 2.
3 GANs and Extensions
3.1 Introducing the Theoretical Underpinnings of GANs
Generative Adversarial Networks (GANs) goodfellow2014generative are a type of generative model with a differentiable generator network goodfellow2016deep. GANs are formalised as a minimax two-player game, where the generator network (G) competes against an adversary network called discriminator (D). As visualised in Figure 4, given a random noise distribution , G generates samples that D classifies as either real (drawn from training data, i.e. ) or fake (drawn from G, i.e. ). is either sampled from or from
with a probability of 50%. D outputs a valueindicating the probability that is a real training example rather than one of G’s fake samples goodfellow2016deep. As defined by Goodfellow et al goodfellow2014generative, the task of the discriminator can be characterised as binary classification (CLF) of samples
. Hence, the discriminator can be trained using binary-cross entropy resulting in the following loss function:
D’s training objective is to minimise (or maximise ) while the goal of the generator is the opposite (i.e. minimise ) resulting in the value function of a two-player zero-sum game between D and G:
In theory, in convergence, the generator’s samples become indistinguishable from the real training data () and the discriminator outputs for any given sample goodfellow2016deep. As this is a state where both D and G cannot improve further on their objective by changing only their own strategy, it represents a Nash equilibrium farnia2020gans; nash1950equilibrium. In practice, achieving convergence for this or related adversarial training schemes is an open research problem kodali2017convergence; mescheder2018training; farnia2020gans.
3.2 Extensions of the Vanilla GAN Methodology
Numerous extensions of GANs have shown to generate synthetic images with high realism karras2017progressive; karras2019style; karras2020analyzing; chan2020pi and under flexible conditions mirza2014conditional; odena2017conditional; park2018mc
. GANs have been successfully applied to generate high-dimensional data such as images and, more recently, have also been proposed to generate discrete datahjelm2017boundary
. Apart from image generation, GANs have also widely been proposed and applied for paired and unpaired image-to-image translation, domain-adaptation, data augmentation, image inpainting, image perturbation, super-resolution, and image registration and reconstructionyi2019generative; kazeminia2020gans; wang2019generative.
Table 1 introduces a selection of common GAN extensions found to be frequently applied to cancer imaging. The key characteristics of each of these GAN extensions are described in the following paragraphs.
3.2.1 Noise-to-Image GAN Extensions
As depicted in blue in Figure 4, cGAN adds a discrete label as conditional information to the original GAN architecture that is provided as input to both generator and discriminator to generate class conditional samples mirza2014conditional.
AC-GAN feeds the class label only to the generator while the discriminator is tasked with correctly classifying both the class label and whether the supplied image is real or fake odena2017conditional.
LSGAN strives to overcome the vanishing gradient problem in GANs by replacing the binary sigmoid cross entropy loss function with a least squares loss functionmao2017squares.
WGAN is motivated by mathematical rationale and based on the Wasserstein-1 distance (alias ”earth mover distance” or ”Kantorovich distance”) between two distributions. WGAN extends on the theoretic formalisation and optimisation objective of the vanilla GAN to better approximate the distribution of the real data. By applying an alternative loss function (i.e. Wasserstein loss), the discriminator (alias ”critic” or ””) maximises - and the generator minimises - the difference between the critic’s scores for generated and real samples. A important benefit of WGAN is the empirically observed correlation of the loss with sample quality, which helps to interpret WGAN training progress and convergence arjovsky2017wasserstein.
In WGAN, the weights of the critic are clipped, which means they have to lie within a compact space . This is needed to fulfil that the critic is constraint to be in the space of 1-Lipschitz functions. With clipped weights, however, the critic is biased towards learning simpler functions and prone to have exploding or vanishing gradients if the clipping threshold is not tuned with care. gulrajani2017improved; arjovsky2017wasserstein.
In WGAN-GP, the weight clipping constraint is replaced with a gradient penalty. Gradient penalty of the critic is a tractable and soft version of the following notion: By constraining that the norm of the gradients of a differentiable function is at most 1 everywhere, the function (i.e. the critic) would fulfil the 1-Lipschitz criterion without the need of weight clipping. Compared, among others, to WGAN, WGAN-GP was shown to have improved training stability (i.e. across many different GAN architectures), training speed, and sample quality gulrajani2017improved.
DCGAN generates realistic samples using a convolutional network architecture with batch normalizationioffe2015batch
for both generator and discriminator and progressively increases the spatial dimension in the layers of the generator using transposed convolution (alias ”fractionally-strided convolution”)radford2015unsupervised.
PGGAN is tested with loss and configurations introduced in WGAN GP. It starts by generating low pixel resolution images, but progressively adds new layers to the generator and discriminator during training resulting in increased pixel resolution and finer image details. It is suggested that after early convergence of initial low-resolution layers, the introduced additional layers enforce the network to only refine the learned representations by increasingly smaller-scale effects and features karras2017progressive.
In SRGAN, the generator transforms a low-resolution (LR) to a high-resolution (HR, alias ”super-resolution”) image, while the discriminator learns to distinguish between real high-resolution images and fake super-resolution images. Apart from an adversarial loss, a perceptual loss called ’content loss’ measures how well the generator represents higher level image features. This content loss is computed as the euclidean distance between feature representations of the reconstructed image and the reference image based on feature maps of a pretrained 19 layer VGG simonyan2014very network ledig2017photo.
|Publication||Input G||Input D||Losses||Task|
|Noise to Image|
|GAN (2014) goodfellow2014generative||Noise||Image||Binary cross-entropy based adversarial loss ()||Image synthesis|
|conditional GAN (cGAN, 2014) mirza2014conditional||Noise & label||Image & label||Conditional image synthesis|
|Auxiliary Classifier GAN (AC-GAN, 2017) odena2017conditional||Noise & label||Image||& cross-entropy loss (label classification)||Conditional image synthesis, classification|
|Deep Convolutional GAN (DCGAN, 2015) radford2015unsupervised||Noise||Image||Image synthesis|
|Wasserstein GAN (WGAN, 2017) arjovsky2017wasserstein||Noise||Image||Wasserstein loss ()||Image synthesis|
|WGAN Gradient Penalty (WGAN GP, 2017) gulrajani2017improved||Noise||Image||with GP||Image synthesis|
|Least Squares GAN (LSGAN, 2017) mao2017squares||Noise||Image||Least squares loss||Image synthesis|
|Progressively Growing GAN (PGGAN, 2017) karras2017progressive||Noise||Image||with GP gulrajani2017improved||Image synthesis|
|Image to Image|
|Super-Resolution GAN (SRGAN, 2017) ledig2017photo||Image (LR)||Image (HR)||& content loss (based on VGG simonyan2014very features)||Super-resolution|
|CycleGAN (2017) zhu2017unpaired||Source image||Target image||& cycle consistency loss & identity loss||Unpaired image-to-image translation|
|pix2pix (2017) isola2017image||Source image||Concatenated source and target images||& reconstruction loss (i.e. L1)||Paired image-to-image translation|
|SPatially-Adaptive (DE)normalization (SPADE, 2019) park2019semantic||Noise or encoded source image & segmentation map||Concatenated target image and segmentation map||Hinge & perceptual & feature matching losses wang2018high||Paired image-to-image translation|
3.2.2 Image-to-Image GAN Extensions
In image-to-image translation, a mapping is learned from one image distribution to another. For example, images from one domain can be transformed to resemble images from another domain via a mapping function implemented by a GAN generator.
CycleGAN achieves realistic unpaired image-to-image translation using two generators (, ) with one traditional adversarial loss each and an additional cycle-consistency loss. Unpaired image-to-image translation transforms images from domain to another domain in the absence of paired training data i.e. corresponding image pairs for both domains. In CycleGAN, the input image from domain is translated by generator to resemble a sample from domain . Next, the sample is translated back from domain to domain by generator . The cycle consistency loss enforces that (forward cycle consistency) and that (backward cycle consistency) zhu2017unpaired.
Both pix2pix and SPADE are used in paired image-to-image translation where corresponding image pairs for both domains and are available. pix2pix (alias ”condGAN”) is a conditional adversarial network that adapts the U-Net architecture333To reduce information loss in latent space compression, U-Net uses skip connections between corresponding layers (e.g., first to last) in the encoder and decoder. ronneberger2015u for the generator to facilitate encoding an conditional input image into a latent representation before decoding it back into an output image. pix2pix uses L1 loss to enforce low level (alias ”low frequency”) image reconstruction and a patch-based discriminator (”PatchGAN”) to enforce high level (alias ”high frequency”) image reconstruction that the authors suggest to interpret as texture/style loss. Note that the input into the PatchGAN discriminator is a concatenation444Note the concatenation of real_A and fake_B before computing the loss in the discriminator backward pass (L93) in the authors’ pix2pix implementation. of the original image (i.e. the generator’s input image; e.g. this can be a segmentation map) and the real/generated image (i.e. the generator’s output image) isola2017image.
In SPADE, the generator architecture does not rely on an encoder for downsampling, but uses a conditional normalisation method during upsampling instead: A segmentation mask as conditional input into the SPADE generator is provided to each of its upsampling layers via spatially-adaptive residual blocks. These blocks embed the masks and apply two two-layer convolutions to the embedded mask to get two tensors with spatial dimensions. These two tensors are multiplied/added to each upsampling layer prior to its activation function. The authors demonstrate that this type of normalisation achieves better fidelity and preservation of semantic information in comparison to other normalisation methods that are commonly applied in neural networks (e.g., Batch Normalization). The multi-scale discriminators and the loss functions from pix2pixHDwang2018high are adapted in SPADE, which contains a hinge loss (i.e. as substitute of the adversarial loss), a perceptual loss, and a feature matching loss park2019semantic.
3.2.3 GAN Network Architectures and Adversarial Loss
For further methodological detail on the aforementioned GAN methods, loss functions, and architectures, we point the interested reader to the GAN methods review by Wang et al wang2019generative. Due to the image processing capabilities of CNNs lecun1989backpropagation, the above-mentioned GAN architectures generally rely on CNN layers internally. Recently, TransGAN jiang2021transgan
was proposed, which diverges from the CNN design pattern to using Transformer Neural Networksvaswani2017attention as the backbone of both generator and discriminator. Due to TransGAN’s promising performance on computer vision tasks, we encourage future studies to investigate the potential of transformer-based GANs for applications in medical and cancer imaging.
Multiple deep learning architectures apply the adversarial loss proposed in goodfellow2014generative together with other loss functions (e.g., segmentation loss functions) for other tasks than image generation (e.g., image segmentation). This adversarial loss is useful for learning features or representations that are invariant to some part of the training data. For instance, adversarial learning can be useful to discriminate a domain to learn domain-invariant representations ganin2015unsupervised, as has been successfully demonstrated for medical images kamnitsas2017unsupervised. In the scope of our GAN survey, we include and consider all relevant cancer imaging papers that apply or build upon the adversarial learning scheme defined in goodfellow2014generative.
4 Cancer Imaging Challenges Addressed by GANs
In this section we follow the structure presented in Figure 3, where we categorise cancer imaging challenges into five categories consisting of data scarcity and usability (4.1), data access and privacy (4.2), data annotation and segmentation (4.3.2), detection and diagnosis (4.4), and treatment and monitoring (4.5). In each subsection, we group and analyse respective cancer imaging challenges and discuss the potential and the limitations of corresponding GAN solutions. In this regard, we also identify and highlight key needs to be addressed by researchers in the field of cancer imaging GANs towards solving the surveyed cancer imaging challenges. We provide respective tables 2-6 for each subsection 4.1-4.5 containing relevant information (publication, method, dataset, modality, task, highlights) for all of the reviewed cancer imaging GAN solutions.
4.1 Data Scarcity and Usability Challenges
4.1.1 Challenging Dataset Sizes and Shifts
Although data repositories such as The Cancer Imaging Archive (TCIA) clark2013cancer have made a wealth of cancer imaging data available for research, the demand is still far from satisfied. As a result, data augmentation techniques are widely used to artificially enlarge the existing datasets, traditionally including simple spatial (e.g., flipping, rotation) or intensity transformations (e.g., noise insertion) of the true data. GANs have shown promise as a more advanced augmentation technique and have already seen use in medical and cancer imaging han2018ganBRATS; yi2019generative.
Aside from the issue of lacking sizeable data, data scarcity often forces studies to be constrained on small-scale single-centre datasets. The resulting findings and models are likely to not generalise well due to diverging distributions between the (synthetic) datasets seen in training and those seen in testing or after deployment, a phenomenon known as dataset shift quionero2009dataset555More concretely, this describes a case of covariate shift quionero2009dataset; shimodaira2000improving defined by a change of distribution within the independent variables between two datasets. An example of this in clinical practice are cases where training data is preselected from specific patient sub-populations (e.g., only high-risk patients) resulting in bias and limited generalisability to the broad patient population troyanskaya2020artificial; bi2019artificial.
From a causality perspective, dataset shift can be split into several distinct scenarios castro2020causality:
Population shift, caused by differences in age, sex, ethnicities etc.
Acquisition shift, caused by differences in scanners, resolution, contrast etc.
Annotation shift, caused by differences in annotation policy, annotator experience, segmentation protocols etc.
Prevalence shift, caused by differences in the disease prevalence in the population, often resulting from artificial sampling of data
Manifestation shift, caused by differences in how the disease is manifested
GANs may inadvertently introduce such types of dataset shifts (i.e. due to mode collapse goodfellow2014generative), but it has been shown that this shift can be studied, measured and avoided santurkar2018classification; arora2018gans. GANs can be a sophisticated tool for data augmentation or curation diaz2021data and by calibrating the type of shift introduced, they have the potential to turn it into an advantage, generating diverse training data that can help models generalise better to unseen target domains. The research line studying this problem is called domain generalisation, and it has presented promising results for harnessing adversarial models towards learning of domain-invariant features zhou2021domain. GANs have been used in various ways in this context, using multi-source data to generalise to unseen targets rahman2019multi; li2018domain or in an unsupervised manner using adaptive data augmentation to append adversarial examples iteratively volpi2018generalizing. As indicated in Figure 3(a), the domain generalisation research line has recently been further extended to cancer imaging lafarge2019learning; chen2021generative.
In the following, further cancer imaging challenges in the realm of data scarcity and usability are described and related GAN solutions are referenced. Given these challenges and solutions, we derive a workflow for clinical adoption of (synthetic) cancer imaging data, which is illustrated in Figure 5.
4.1.2 Imbalanced Data and Fairness
Apart from the rise of data-hungry deep learning solutions and the need to cover the different organs and data acquisition modalities, a major problem that arises from data scarcity is that of imbalance—i.e. the overrepresentation of a certain type of data over others bi2019artificial. In its more common form, imbalance of diagnostic labels can hurt a model’s specificity or sensitivity, as a prior bias from the data distribution may be learned. The Lung Screening Study (LSS) Feasibility Phase exemplifies the common class imbalance in cancer imaging data: 325 (20.5%) suspicious lung nodules were detected in the 1586 first low-dose CT screening, of which only 30 (1.89%) were lung cancers gohagan2004baseline; gohagan2005final; national2011national. This problem directly translates to multi-task classification (CLF), with imbalance between different types of cancer leading to worse sensitivity on the underrepresented categories yu2013recognition. It is important to note that by solving the imbalance with augmentation techniques, bias is introduced as the prior distribution is manipulated, causing prevalence shift. As such, the test set should preserve the population statistics. Aside from imbalance of labels, more insidious forms of imbalance such as that of race adamson2018machineRACEdisparity or gender larrazabal2020gender of patients are easily omitted in studies. This leads to fairness problems in real world applications as underrepresenting such categories in the training set will hurt performance on these categories in the real world (population shift) li2021estimating. Because of their potential to generate synthetic data, GANs are a promising solution to the aforementioned problems and have already been thoroughly explored in Computer Vision sampath2021surveyIMBALANCE; mullick2019generative. Concretely, the discriminator and generator can be conditioned on underrepresented labels, forcing the generator to create images for a specific class666The class can be something as simple as ”malignant” or ”benign”, or a more complex score for risk assessment of a tumour such as the BiRADs scoring system for breast tumours liberman2002breast, as indicated in Figure 3(d). Many lesions classifiable by complex scoring systems such as RADS reporting are rare and, hence, effective conditional data augmentation is needed to improve the recognition of such lesions by ML detection models kazuhiro2018generative. GANs have already been used to adjust label distributions in imbalanced cancer imaging datasets, e.g. by generating underrepresented grades in a risk assessment scoring system hu2018prostategan for prostate cancer. A further promising applicable method is to enrich the data using a related domain as proxy input addepalli2020degan. Towards the goal of a more diverse distribution of data with respect to gender and race, similar principles can be applied. For instance, Li et al li2021estimating proposed an adversarial training scheme to improve fairness in classification of skin lesions for underrepresented groups (age, sex, skin tone) by learning a neutral representation using an adversarial bias discrimination loss. Fairness imposing GANs can also generate synthetic data with a preference for underrepresented groups, so that models may ingest a more balanced dataset, improving demographic parity without excluding data from the training pipeline. Such models have been trained in computer vision tasks sattigeri2018fairnessGAN; wang2019balanced; zhang2018mitigating; xu2018fairgan; beutel2017data, but corresponding research on medical and cancer imaging denoted by Figure 3(c) has been limited li2021estimating; ghorbani2020dermgan.
4.1.3 Cross-modal Data Generation
In cancer, multiple acquisition modalities are enlisted in clinical practice kim2016predictive; chen2017direct; barbaro2017potential; chang2020multi; chang2020synthetic; thus automated diagnostic models should ideally learn to interpret various modalities as well or learn a shared representation of these modalities. Conditional GANs offer the possibility to generate one modality from another, alleviating the need to actually perform the potentially more harmful screenings—i.e. high-dose CT, PET—that expose patients to radiation, or require invasive contrast agents such as intravenous iodine-based contrast media (ICM) in CT haubold2021contrast, gadolinium-based contrast agents in MRI zhao2020tripartite(in Table 5) or radioactive tracers in PET wang20183dMRItoPET; zhao2020study. Furthermore, extending the acquisition modalities used in a given task would also enhance the performance and generalisability of AI models, allowing them to learn shared representations among these imaging modalities bi2019artificial; hosny2018artificial. Towards this goal, multiple GAN domain-adaptation solutions have been proposed to generate CT using MRI wolterink2017deep; kearney2020attention; tanner2018generative; kaiser2019mriMRItoCT; nie2017medical; kazemifar2020dosimetricMRItoCT; prokopenko2019unpairedMRItoCT, PET from MRI wang20183dMRItoPET, PET from CT ben2017virtual, bi2017synthesisCTtoPET (in Table 5), and CT from PET as in Armanious et al armanious2020medgan, where also PET denoising and MR motion correction are demonstrated. If not indicated otherwise, these image-to-image translation studies are outlined in Table 2. Because of its complexity, clinical cancer diagnosis is based not only on imaging but also non-imaging data (genomic, molecular, clinical, radiological, demographic, etc). In cases where this data is readily available, it can serve as conditional input to GANs towards the generation of images with the corresponding phenotype-genotype mapping, as is also elaborated in regard to tumour profiling for treatment in section 4.5.1. A multimodal cGAN was recently developed, conditioned on both images and gene expression code xu2020correlation; however, research along this line is otherwise limited.
4.1.4 Feature Hallucinations in Synthetic Data
As displayed in Figure 6 and denoted in Figure 3(b), conditional GANs can unintentionally777Intentional feature injection or removal is discussed in 4.2.5 hallucinate non-existent artifacts into a patient image. This is particularly likely to occur in cross-modal Data augmentation, especially but not exclusively if the underlying dataset is imbalanced. For instance, Cohen et al cohen2018distribution describe GAN image feature hallucinations embodied by added and removed brain tumours in cranial MRI. The authors tested the relationship between the ratio of tumour images in the GAN target distribution and the ratio of images diagnosed with tumours by a classifier. The classifier was trained on the GAN generated target dataset, but tested on a balanced holdout testset. It was thereby shown that the generator of CycleGAN effectively learned to hide source domain image features in target domain images, which arguably helped it to fool its discriminator. Paired image-to-image translation with pix2pix isola2017image was more stable, but still some hallucinations were shown to likely have occurred. A cause for this can be a biased discriminator that has learned to discriminating specific image features (e.g., tumours) that are more present in one domain. Cohen et al cohen2018distribution; cohen2018cure and Wolterink et al wolterink2018generative warn that models that map source to target images, have an incentive to add/remove features during translation if the feature distribution in the target domain is distinct from the feature distribution in the source domain888For example, if one domain contains mainly healthy images, while the other domain contains mainly pathological images..
Domain-adaptation with unpaired image-to-image translation GANs such as CyleGAN has become increasingly popular in cancer imaging wolterink2017deep; tanner2018generative; modanwal2019normalization; fossen2020synthesizing; zhao2020study; hognon2019standardization; mathew2020augmenting; kearney2020attention; peng2020magnetic; jiang2018tumor; sandfort2019data. As described, these unsupervised methods are hallucination-prone and, thus, can put patients at risk when used in clinical settings. More research is needed on how to robustly avoid or detect and eliminate hallucinations in generated data. To this end, we highlight the potential of investigating feature preserving image translation techniques and methods for evaluating features have been accurately translated. For instance, in the presence of feature masks or annotations, an additional local reconstruction loss function can be introduced in GANs that enforces feature translation in specific image areas.
4.1.5 Data Curation and Harmonisation
Aside from the limited availability of cancer imaging datasets, a major problem is that the ones available are often not readily usable and require further curation hosny2018artificial. Curation includes dataset formatting, normalising, structuring, de-identification, quality assessment and other methods to facilitate subsequent data processing steps, one of which is the ingestion of the data into AI models diaz2021data. In the past, GANs have been proposed for curation of data labelling, segmentation and annotation of images (details in section 4.3.2) and de-identification of facial features, EHRs, etc (details in section 4.2). Particular to cancer imaging datasets and of significant importance is the correction of artifacts (patient motion, metallic objects, chemical shifts and others caused by the image processing pipeline pusey1986magnetic; nehmeh2002effect) which run the risk of confusing models with spurious information. Towards the principled removal of artifacts, several GAN solutions have been proposed vu2020generativeARTIFACTREMOVAL; koike2020deepARTIFACTREMOVAL. As for the task of reconstruction of compressed data (e.g., compressed sensing MRI mardani2017deep), markedly, Yang et al yang2018dagan proposed DAGAN, which is based on U-net ronneberger2015u, reduces aliasing artifacts, and faithfully preserves texture, boundaries and edges (of brain tumours) in the reconstructed images. Kim et al kim2018improving feed down-sampled high-resolution brain tumour MRI into a GAN framework similar to pix2pix to reconstruct high-resolution images with different contrast. The authors highlight the possible acceleration of MR imagery collection while retaining high-resolution images in multiple contrasts, necessary for further clinical decision-making. As relevant to the context of data quality curation, GANs have also been proposed for image super-resolution in cancer imaging (e.g., for lung nodule detection gu2020medsrgan, abdominal CT you2019ct, and breast histopathology shahidi2021breast).
Beyond the lack of curation, a problem particular to multi-centre studies is that of inconsistent curation between data derived in different centres. These discontinuities arise from different scanners, segmentation protocols, demographics, etc, and can cause significant problems to subsequent ML algorithms that may overfit or bias towards one configuration over another (i.e. acquisition and annotation shifts). GANs have the potential to contribute in this domain as well by bringing the distributions of images across different centres closer together. In this context a recent work by Li et al li2021normalization and Wei et al wei2020using used GAN-based volumetric normalisation to reduce the variability of heterogeneous 3D chest CT scans of different slice thickness and dose levels. The authors showed that features in subsequent radiomics analysis exhibit increased alignment. Other works in this domain include a framework that could standardise heterogeneous datasets with a single reference image and obtained promising results on an MRI dataset hognon2019standardization, and GANs that learn bidirectional mappings between different vendors to normalise dynamic contrast enhanced (DCE) breast MRI modanwal2019normalization. An interesting research direction to be explored in the future is synthetic multi-centre data generation using GANs, simulating the distribution of various scanners/centres.
4.1.6 Synthetic Data Assessment
As indicated in Figure 3(e), a condition of paramount importance is proper evaluation of GAN-generated or GAN-curated data. This evaluation is to verify that synthetic data is usable for a desired downstream task (e.g., segmentation, classification) and/or indistinguishable from real data while ensuring that no private information is leaked. GANs are commonly evaluated based on fidelity (realism of generated samples) and diversity (variation of generated samples compared to real samples) borji2021pros. Different quantitative measures exist to assess GANs based on the fidelity and diversity of its generated synthetic medical images yi2019generative; borji2021pros.
Visual Turing tests (otherwise referred to as Visual Assessment, Mean Opinion Score (MOS) Test, and sometimes used interchangeably with In-Silico Clinical Trials) are arguably the most reliable approach, where clinical experts are presented with samples from real and generated data and are tasked to identify which one is generated. Korkinof et al korkinof2020perceived showed that their PGGAN karras2017progressive-generated 1280x1024 mammograms were inseparable by the majority of participants, including trained breast radiologists. A similar visual Turing test was successfully done in the case of skin disease ghorbani2020dermgan, super-resolution of CT you2019ct, brain MRI kazuhiro2018generative; han2018ganBRATS, lung cancer CT scans chuquicusma2018fool, and histopathology images levine2020synthesis. For instance, Chuquicusma et al chuquicusma2018fool trained a DCGAN radford2015unsupervised on the LIDC-IDRI datasetarmato2011lung to generate 2D (56x56 pixel) pulmonary lung nodule scans that were realistic enough to deceive 2 radiologists with 11 and 4 years of experience. In contrast to computer vision techniques where synthetic data can often be easily evaluated by any non-expert, the requirement of clinical experts makes Visual Turing Tests in this domain much more costly. Furthermore, a lack of scalability and consistency in medical judgement needs to be taken into account as well brennan1992statistical and visual Turing tests should in the ideal case engage a range of experts to address inter-observer variation in the assessments. Also, iterating over the same observer addresses intra-observer variation—i.e. repeating the process within a certain amount of intervals that could be days or weeks. These problems are further magnified by the shortage of radiology experts mahajan2020auditSHORTAGE; rimmer2017radiologistSHORTAGE which brings up the necessity for supplementary metrics that can automate the evaluation of generative models. Such metrics allow for preliminary evaluation and can enable research to progress without the logistical hurdle of enlisting experts.
Furthermore, in cases where the sole purpose of the generated data is to improve a downstream task—i.e. classification or segmentation—then the prediction success of the downstream task would be the metric of interest. The latter can reasonably be prioritised over other metrics given that the underlying reasons why the synthetic data alters downstream task performance are examined and clarified999For example, synthetic data may balance imbalanced datasets, reduce overfitting on limited training data, or improve model robustness to better capture domain shifts in the test dataset..
Image Quality Assessment Metrics
Wang et al wang2004imageAUTOMATICEVALUATION have thoroughly investigated image quality assessment metrics. The most commonly applied metrics include structural similarity index measure (SSIM)101010SSIM predicts perceived quality and considers image statistics to assess structural information based on luminance, contrast, and structure. between generated image and reference image wang2004imageAUTOMATICEVALUATION, mean squared error (MSE)111111MSE is computed by averaging the squared intensity differences between corresponding pixels of the generated image and the reference image.
and peak signal-to-noise ratio (PSNR)121212PSNR is an adjustment to the MSE score, commonly used to measure reconstruction quality in lossy compression.. In a recent example that followed this framework of evaluation, synthetic brain MRI with tumours generated by edge-aware EA-GAN yu2019eaEVALUATION was assessed using three such metrics: PSNR, SSIM, and normalised mean squared error (NMSE). The authors integrated an end-to-end sobel edge detector to create edge maps from real/synthetic images that are input into the discriminator in the dEa-GAN variant to enforce improved textural structure and object boundaries. Interestingly, aside from evaluating on the whole image, the authors demonstrated evaluation results focused on the tumour regions, which were overall significantly lower than the whole image. Other works that have evaluated their synthetic images in an automatic manner have focused primarily on the SSIM and PSNR metrics and include generation of CT kearney2020attention; mathew2020augmenting and PET scans zhao2020study
. While indicative of image quality, these similarity-based metrics might not generalise well to human judgement of image similarity, the latter depending on high-order image structure and contextzhang2018unreasonable
. Finding evaluation metrics that are strong correlates of human judgement of perceptual image similarity is a promising line of research. In the context of cancer and medical imaging, we highlight the need for evaluation metrics for synthetic images that correlate with the perceptual image similarity judged by medical experts. Apart from perceptual image similarity, further evaluation metrics in cancer and medical imaging are to be investigated that are able to estimate the diagnostic value of (synthetic) images and, in the presence of reference images, the diagnostic value proportion between target and reference image.
GAN-specific Assessment Metrics
In recent years, the Inception score (IS) salimans2016improved and Fréchet Inception distance (FID) heusel2017gans
have emerged, offering a more sophisticated alternative for the assessment of synthetic data. The IS uses a classifier to generate a probability distribution of labels given a synthetic image. If the probability distribution is highly skewed, it is indicative that a specific object is present in the image (resulting in a higher IS), while in the case where it is uniform, the image contains a jumble of objects and that is more likely to be non-sense (resulting in a lower IS).131313Not only a low label entropy within an image is desired, but also a high label entropy across images: IS also assesses the variety of peaks in the probability distributions generated from the synthetic images, so that a higher variety is indicative of more diverse objects being generated by the GAN (resulting in a higher IS). The FID metric compares the distance between the synthetic image distribution to that of the real image distribution by comparing extracted high-level features from one of the layers of a classifier (e.g., Inception v3 as in IS). Both metrics have shown great promise in the evaluation of GAN-generated data; however, they come with several bias issues that need to be taken into account during evaluation chong2020effectively; devries2019evaluation; borji2019pros. As these metrics have not been used in cancer imaging yet, their applicability on GAN-synthesised cancer images remains to be investigated. In contrast to computer vision datasets containing diverse objects, medical imaging datasets commonly only contain images of one specific organ. We promote further research as to how object diversity based methods such as IS can be applied to medical and cancer imaging, which requires, among others, meaningful adjustments of the dataset-specific pretrained classifications models (i.e. Inception v3) that IS and FID rely upon.
Uncertainty Quantification as GAN Evaluation Metric?
A general problem facing the adoption of deep learning methods in clinical tasks is their inherent unreliability exemplified by high prediction variation caused by minimal input variation (e.g., one pixel attack korpihalkola2020one). This is further exacerbated by the nontransparent decision making process inside deep neural networks thus often described as ”black box models” bi2019artificial; Also, the performance of deep learning methods in out-of-domain datasets has been assessed as unreliable lim2019buildingTRUST. To eventually achieve beneficial clinical adoption and trust, examining and reporting the inherent uncertainty of these models on each prediction becomes a necessity. Besides classification, segmentation hu2020coarse; alshehhi2021quantification, etc, uncertainty estimation is applicable to models in the context of data generation as well lim2019buildingTRUST; abdar2020reviewUNCERTAINTY; hu2020coarse. Edupuganti et al edupuganti2019uncertainty
studied a GAN architecture based on variational autoencoders (VAE)kingma2013auto
on the task of MRI reconstruction, with emphasis on uncertainty studies. Due to their probabilistic nature, VAEs allowed for a Monte Carlo sampling approach which enables quantification of pixel-variance and the generation of uncertainty maps. Furthermore, they used Stein’s Unbiased Risk Estimator (SURE)stein1981estimation as a measure of uncertainty that serves as surrogate of MSE even in the absence of ground truth. Their results indicated that adversarial losses introduce more uncertainty. Parallel to image reconstruction, uncertainty has also been studied in the context of brain tumours (glioma) in MRI enhancement tanno2021uncertainty. In this study, a probabilistic deep learning framework for model uncertainty quantification was proposed, decomposing the problem into two uncertainty types: intrinsic uncertainty (particular to image enhancement and pertaining to the one-to-many nature of the super-resolution mapping) and parameter uncertainty (a general challenge, it pertains to the choice of the optimal model parameters). The overall model uncertainty in this case is a combination of the two and was evaluated for image super-resolution. Through a series of systematic studies the utility of this approach was highlighted, as it resulted in improved overall prediction performance of the evaluated models even for out-of-distribution data. It was further shown that predictive uncertainty highly correlated with reconstruction error, which not only enabled spotting unrealistic synthetic images, but also highlights the potential in further exploring uncertainty as an evaluation metric for GAN-generated data. A further use-case of interest for GAN evaluation via uncertainty estimation is the ”adherence” to provided conditional inputs. As elaborated in 4.1.4 for image-to-image translation, conditional GANs are likely to introduce features that do not correspond to the conditional class label or source image. After training a classification model on image features of interest (say, tumour vs non-tumour features), we can examine the classifier’s prediction and estimated uncertainty141414The uncertainty can be estimated using methods such as Monte-Carlo Dropout gal2016dropout or Deep Ensembles lakshminarayanan2016simple. for the generated images. Given the expected features in the generated images are known beforehand, the classifier’s uncertainty of the presence of these features can be used to estimate not only image fidelity (e.g., image features are not generated realistic enough), but also ”condition adherence” (e.g., expected image features are altered during generation).
Outlook on Clinical Adoption
Alongside GAN-specific and standard image assessment metrics, uncertainty-based evaluation schemes can further automate the analysis of generative models. To this end, the challenge of clinical validation for predictive uncertainty as a reliability metric for synthetic data assessment remains tanno2021uncertainty. In practice, building clinical trust in AI models is a non-trivial endeavour and will require rigorous performance monitoring and calibration especially in the early stages kelly2019key; duran2021afraid. This is particularly the case when CADe and CADx models are trained on entirely (or partially) synthetic data given that the data itself was not first assessed by clinicians. Until a certain level of trust is built in these pipelines, automatic metrics will be a preliminary evaluation step that is inevitably followed by diligent clinical evaluation for deployment. A research direction of interest in this context would be ”gatekeeper” GANs—i.e. GANs that simulate common data (and/or difficult edge cases) of the target hospital, on which deployment-ready candidate models (e.g., segmentation, classification, etc) are then tested to ensure they are sufficiently generalisable. If the candidate model performance on such test data satisfies a predefined threshold, it has passed this quality gate for clinical deployment.
|Imbalanced Data & Fairness|
|Hu et al (2018) hu2018prostategan||ProstateGAN||Private||Prostate MRI||Conditional Synthesis||Gleason score (cancer grade) class conditions.|
|Ghorbani et al (2020) ghorbani2020dermgan||DermGAN||Private||Dermoscopy||Image synthesis||Adapted pix2pix, evaluated via Turing Tests and rare skin condition CLF.|
|Li et al (2021) li2021estimating||Encoder||ISIC 2018 codella2018skin||Dermoscopy||Representation learning||Fair Encoder with bias discriminator and skin lesion CLF.|
|Cross-Modal Data Generation|
|Wolterink et al (2017) wolterink2017deep||CycleGAN||Private||Cranial MRI/CT||Unpaired translation||First CNN for unpaired MR-to-CT translation. Evaluated via PSNR and MAE.|
|Ben-Cohen et al (2017) ben2017virtual||pix2pix||Private||Liver PET/CT||Paired translation||Paired CT-to-PET translation with focus on hepatic malignant tumours.|
|Nie et al (2017) nie2017medical||context-aware GAN||ADNI wyman2013standardization; weiner2017alzheimer||Cranial/pelvic MRI/CT||Paired translation||Supervised 3D GAN for MR-to-CT translation with ”Auto-Context Model” (ACM).|
|Wang et al (2018) wang20183dMRItoPET||Locality Adaptive GAN (LA-GAN)||BrainWeb brainweb||Cranial MRI, PET phantom||Paired translation||3D auto-context, synthesising PET from low-dose PET and multimodal MRI.|
|Tanner et al (2018) tanner2018generative||CycleGAN||VISCERAL jimenez2016cloud||Lung/abdominal MRI/CT||Image registration||Unsupervised MR-CT CycleGAN for registration.|
|Kaiser et al (2019) kaiser2019mriMRItoCT||pix2pix, context-aware GAN nie2017medical||RIRE rire1998||Cranial MRI/CT||Paired translation||Detailed preprocessing description, MR-to-CT translation, comparison with U-Net.|
|Prokopenko et al (2019) prokopenko2019unpairedMRItoCT||DualGAN, SRGAN||CPTAC3 clark2013cancer; national2018radiology, Head-Neck-PET-CT vallieres9data; clark2013cancer||Cranial MRI/CT||Unpaired translation||DualGAN for unpaired MR-to-CT, visual Turing tests.|
|Zhao et al (2020) zhao2020study||S-CycleGAN||Private||Cranial low/full dose PET||Paired translation||Low (LDPET) to full dose (FDPET) translation with supervised loss for paired images.|
|Kearney et al (2020) kearney2020attention||VAE-enhanced A-CycleGAN||Private||Cranial MRI/CT||Unpaired translation||MR-to-CT evaluated via PSNR, SSIM, MAE. Superior to paired alternatives.|
|Kazemifar et al (2020) kazemifar2020dosimetricMRItoCT||context-aware GAN||Private||Cranial MRI/CT||Paired translation||Feasibility of generated CT from MRI for dose calculation for radiation treatment.|
|Armanious et al (2020) armanious2020medgan||MedGAN||Private||Liver PET/CT||Paired translation||CasNet architecture, PET-to-CT, MRI motion artifact correction, PET denoising.|
|Xu et al (2020) xu2020correlation||multi-conditional GAN||NSCLC zhou2018non||Lung CT, gene expression||Conditional synthesis||Image-gene data fusion/correlation, nodule generator input: background image, gene code.|
|Haubold et al (2021) haubold2021contrast||Pix2PixHD wang2018high||Private||Arterial phase CT||Paired translation||Low-to-full ICM CT (thorax, liver, abdomen), 50% reduction in intravenous ICM dose.|
|Cohen et al (2018) cohen2018distribution; cohen2018cure||CycleGAN, pix2pix||BRATS2013 menze2014multimodal||Cranial MRI||Paired/unpaired translation||Removed/added tumours during image translation can lead to misdiagnosis.|
|Yang et al (2018) yang2018dagan||DAGAN||MICCAI 2013 grand challenge dataset||Cranial MRI||Image reconstruction||Fast GAN compressed sensing MRI reconstruction outperformed conventional methods.|
|Kim et al (2018) kim2018improving||pix2pix-based||BRATS menze2014multimodal||Cranial MRI||Reconstruction/ super-resolution||Information transfer between different contrast MRI, effective pretraining/fine-tuning.|
|Hognon et al (2019) hognon2019standardization||CycleGAN, pix2pix||BRATS menze2014multimodal, BrainWeb brainweb||Cranial MRI||Paired/unpaired translation, normalisation||CycleGAN translation to BrainWeb reference image, pix2pix back-translation to source.|
|Modanwal et al (2019) modanwal2019normalization||CycleGAN||Private||Breast MRI||Unpaired translation||Standardising DCE-MRI across scanners, anatomy preserving mutual information loss.|
|You et al (2019) you2019ct||CycleGAN-based||Mayo Low Dose CT LowDoseCTGrandChallenge||Abdominal CT||Super-resolution||Joint constraints to Wasserstein loss for structural preservation. Evaluated by 3 radiologists.|
|Gu et al (2020) gu2020medsrgan||MedSRGAN||LUNA16 setio2017validation||MRI/thoracic CT||Super-resolution||Residual Whole Map Attention Network (RWMAN) in G. Evaluated by 5 radiologists.|
|Vu et al (2020) vu2020generativeARTIFACTREMOVAL||WGAN-GP-based||k-Wave toolbox treeby2010k||Photoacoustic CT (PACT)||Paired translation||U-NET and WGAN-GP based generator for artifact removal. Evaluated via SSIM, PSNR.|
|Koike et al (2020) koike2020deepARTIFACTREMOVAL||CycleGAN||Private||Head/neck CT||Unpaired translation||Metal artifact reduction via CT-to-CT translation, evaluated via radiotherapy dose accuracy.|
|Wei et al (2020) wei2020using||WGAN-GP-inspired||Private||Chest CT||Paired translation||CT normalisation of dose/slice thickness. Evaluated via Radiomics Feature Variability.|
|Shahidi et al (2021) shahidi2021breast||WA-SRGAN||BreakHis benhammou2020breakhis, Camelyon litjens20181399||Breast/lymph node histopathology||Super-resolution||Wide residual blocks, self-attention SRGAN for improved robustness & resolution|
|Li et al (2021) li2021normalization||SingleGAN-based yu2018singlegan||Private||Spleen/colorectal CT||Unpaired translation||Multi-centre (4) CT normalisation. Evaluated via cross-centre radiomics features variation. Short/long-term survivor CLF improvement.|
|Synthetic Data Assessment|
|Kazuhiro et al (2018) kazuhiro2018generative||DCGAN||Private||Cranial MRI||Image synthesis||Feasibility study for brain MRI synthesis evaluated by 7 radiologists.|
|Han et al (2018) han2018ganBRATS||DCGAN, WGAN||BRATS 2016 menze2014multimodal||Cranial MRI||Image synthesis||128x128 brain MRI synthesis evaluated by one expert physician.|
|Chuquicusma et al (2018) chuquicusma2018fool||DCGAN||LIDC-IDRIarmato2011lung||Thoracic CT||Image synthesis||Malignant/benign lung nodule ROI generation evaluated by two radiologists.|
|Yu et al (2019) yu2019eaEVALUATION||Ea-GAN||BRATS 2015 menze2014multimodal, IXI ixidataset||Cranial MRI||Paired translation||Loss based on edge maps of synthetic images. Evaluated via PSNR, NMSE, SSIM.|
|Korkinof et al (2020) korkinof2020perceived||PGGAN||Private||Full-field digital MMG||Image synthesis||1280×1024 MMG synthesis from image dataset. Evaluated by 55 radiologists.|
|Levine et al (2020) levine2020synthesis||PGGAN, VAE, ESRGAN||TCGAgrossman2016toward, OVCARE archive||Ovarian Histopathology||Image synthesis||1024×1024 whole slide synthesis. Evaluated via FID and by 15 pathologists (9 certified)|
4.2 Data Access and Privacy Challenges
Access to sufficiently large and labelled data resources is the main constraint for the development of deep learning models for medical imaging tasks esteva2019guide. In cancer imaging, the practice of sharing validated data to aid the development of AI algorithms is restricted due to technical, ethical, and legal concerns bi2019artificial. The latter is exemplified by regulations such as the Health Insurance Portability and Accountability Act (HIPAA) hipaa2004 in the United States of America (USA) or the European Union’s General Data Protection Regulation (GDPR) gdpr2016 with which respective clinical centres must comply with. Alongside the need and numerous benefits of patient privacy preservation, it can also limit data sharing initiatives and restrict the availability, size and usability of public cancer imaging datasets. Bi et al bi2019artificial assess the absence of such datasets as a noteworthy challenge for AI in cancer imaging.
4.2.1 Decentralised Data Generation
As AI systems are often developed and trained outside of medical institutions, prior approval to transfer data out of their respective data silos is required, adding significant hurdles to the logistics of setting up a training pipeline or rendering it entirely impossible. In addition, medical institutions can often not guarantee a secured connection to systems deployed outside their centres hosny2018artificial, which further limits their options to share valuable training data.
One privacy preserving approach solving this problem is federated learning mcmahan2017communication, where copies of an AI model are trained in a distributed fashion inside each clinical centre in parallel and are aggregated to a global model in a central server. This eliminates the need for sensitive patient data to leave any of the clinical centres Kaissis2020; sheller2020federated. However, it is to be noted that federated learning cannot guarantee full patient privacy. Hitaj et al hitaj2017deep demonstrated that any malicious user can train a GAN to violate the privacy of the other users in a federated learning system. While difficult to avoid, the risk of such GAN-based attacks can be minimised, e.g., by using a combination of selective parameter updates shokri2015privacy
(sharing only a small selected part of the model parameters across centres) and the sparse vector technique151515Sparse Vector Technique (SVT) lyu2016understanding is a Differential Privacy (DP) 10.1007/11787006_1 method that introduces noise into a deep learning model’s gradients. as shown by Li et al li2019privacy.
To solve the challenge of privacy assurance of clinical medical imaging data, a distributed GAN hardy2019md; xin2020private; guerraoui2020fegan; rasouli2020fedgan; zhang2021training can be trained on sensitive patient data to generate synthetic training data. The technical, legal, and ethical constraints for sharing such de-identified synthetic data are typically less restrictive than for real patient data. Such generated data can be used instead of the real patient data to train models on disease detection, segmentation, or prognosis.
For instance, Chang et al chang2020multi; chang2020synthetic proposed the Distributed Asynchronized Discriminator GAN (AsynDGAN), which consists of multiple discriminators deployed inside various medical centres and one central generator deployed outside the medical centres. The generator never needs to see the private patient data, as it learns by receiving the gradient updates of each of the discriminators. The discriminators are trained to differentiate images of their medical centre from synthetic images received from the central generator. After training AsynDGAN, its generator is used and evaluated based on its ability to provide a rich training set of images to successfully train a segmentation model. AsynDGAN is evaluated on MRI brain tumour segmentation and cell nuclei segmentation. The segmentation models trained only on AsynDGAN-generated data achieves a competitive performance when compared to segmentation models trained on the entire dataset of real data. Notably, models trained on AsynDGAN-generated data outperform models trained on local data from only one of the medical centres. To our best knowledge, AsynDGAN is the only distributed GAN applied to cancer imaging to date. Therefore, we promote further research in this line to fully exploit the potential of privacy-preservation using distributed GANs. As demonstrated in Figure 7 and suggested in Figure 3(f), for maximal privacy preservation we recommend exploring methods that combine privacy during training (e.g., federated GANs) with privacy after training (i.e. differentially-private GANs), the latter being described in the following section.
4.2.2 Differentially-Private Data Generation
Shin et al shin2018medical train a GAN to generate brain tumour images and highlight the usefulness of their method for anonymisation, as their synthetic data cannot be attributed to a single patient but rather only to an instantiation of the training population. However, it is to be scrutinised whether such synthetic samples are indeed fully private, as, given a careful analysis of the GAN model and/or its generated samples, a risk of possible reconstruction of part of the GAN training data exists papernot2016semi. For example, Chen et al chen2020improved propose a GAN for model inversion (MI) attacks, which aim at reconstructing the training data from a target model’s parameters. A potential solution to avoid training data reconstruction is highlighted by Xie et al xie2018differentially, who propose the Differentially Private Generative Adversarial Network (DPGAN). In Differential Privacy (DP) 10.1007/11787006_1 the parameters () denote the privacy budget torfi2020differentially, where measures the privacy loss and represents the probability that a range of outputs with a privacy loss exists161616For example, if an identical model is trained two times, once with training data resulting in and once with marginally different training data resulting in , it is ()-DP if the following holds true: For any possible output , the output probability for of model differs no more than from the output probability for of .. Hence, the smaller the parameters () for a given model, the less effect a single sample in the training data has on model output. The less effect of such a single sample, the stronger is the confidence in the privacy of the model to not reveal samples of the training data.
Examples of GANs with Differential Privacy Guarantees
In DPGAN noise is added to the model’s gradients during training to ensure training data privacy. Extending on the concept of DPGAN, Jordon et al jordon2018pate train a GAN coined PATE-GAN based on the Private Aggregation of Teacher Ensembles (PATE) framework papernot2016semi; papernot2018scalable. In the PATE framework, a student model learns from various unpublished teacher models each trained on data subsets. The student model cannot access an individual teacher model nor its training data. PATE-GAN consists of discriminator teachers, , and a student discriminator
that backpropagates its loss back into the generator. This limits the effect of any individual sample in PATE-GAN’s training. In a ()-DP setting, classification models trained on PATE-GAN’s synthetic data achieves competitive performances e.g. on a non-imaging cervical cancer dataset fernandes2017transfer compared to an upper bound vanilla GAN baseline without DP.
On the same dataset, Torfi et al torfi2020differentially achieve competitive results using a Rényi Differential Privacy and Convolutional Generative Adversarial Networks (RDP-CGAN) under an equally strong ()-DP setting.
For the generation of biomedical participant data in clinical trials, Beaulieu-Jones et al beaulieu2019privacy apply an AC-GAN under a ()-DP setting based on Gaussian noise added to AC-GAN’s gradients during training.
Bae et al bae2020anomigan propose AnomiGAN to anonymise private medical data via some degree of output randomness during inference. This randomness of the generator is achieved by randomly adding, for each layer, one of its separately stored training variances. AnomiGAN achieves competitive results on a non-imaging breast cancer dataset and a non-imaging prostate cancer for any of the reported privacy parameter values compared to DP, where Laplacian noise is added to samples.
Outlook on Synthetic Cancer Image Privacy
Despite the above efforts, DP in GANs has only been applied to non-imaging cancer data which indicates research potential for extending these methods reliably to cancer imaging data. According to Stadler et al stadler2021synthetic
, using synthetic data generated under DP can protect outliers in the original data from linkage attacks, but likely also reduces the statistical signal of these original data points, which can result in lower utility of the synthetic data. Apart from this privacy-utility tradeoff, it may not be readily controllable/predictable which original data features are preserved and which omitted in the synthetic datasetsstadler2021synthetic. In fields such as cancer imaging where patient privacy is critical, desirable privacy-utility tradeoffs need to be defined and thoroughly evaluated to enable trust, shareability, and usefulness of synthetic data. Consensus is yet to be found as to how privacy preservation in GAN-generated data can be evaluated and verified in the research community and in clinical practice. Promising approaches include methods that define a privacy gain/loss for synthetic samples stadler2021synthetic; yoon2020anonymization. Yoon et al yoon2020anonymization, for instance, define and backpropagate an identifiability loss to the generator to synthesis anonymised electronic health records (EHRs). The identifiability loss is based on the notion that the minimum weighted euclidean distance between two patient records from two different patients can serve as a desirable anonymisation target for synthetic data. Designing or extending reliable methods and metrics for standardised quantitative evaluation of patient privacy preservation in synthetic medical images is a line of research that we call attention to.
4.2.3 Obfuscation of Identifying Patient Features in Images
If the removal of all sensitive patient information within a cancer imaging dataset allows for sharing such datasets, then GANs can be used to obfuscate such sensitive data. As indicated by Figure 3(g), GANs can learn to remove the features from the imaging data that could reveal a patient’s identity, e.g. by learning to apply image inpainting to pixel or voxel data of burned in image annotations or of identifying body parts. Such identifying body parts could be the facial features of a patient, as was shown by Schwarz et al schwarz2019identification on the example of cranial MRI. Numerous studies exist where GANs accomplish facial feature de-identification on non-medical imaging modalities wu2018privacy; hukkelaas2019deepprivacy; li2019anonymousnet; maximov2020ciagan. For medical imaging modalities, GANs have yet to prove themselves as tool of choice for anatomical and facial feature de-identification against common standards segonne2004hybrid; bischoff2007technique; schimke2011preserving; milchenko2013obscuring with solid baselines. These standards, however, have shown to be susceptible to reconstruction achieved by unpaired image-to-image GANs on MRI volumes with high reversibility for blurred faces and partial reversibility for removed facial features abramian2019refacing. Van der Goten et al goten2021adversarial provide a first proof-of-concept for GAN-based facial feature de-identification in 3D ( voxel) cranial MRI. The generator of their conditional de-identification GAN (C-DeID-GAN) receives brain mask, brain intensities and a convex hull of the brain MRI as input and generates de-identified MRI slices. C-DeID-GAN generates the entire de-identified brain MRI scan and, hence, may not be able to guarantee that the generation process does not alter any of the original brain features. A solution to this can be to only generate and replace the 2D MRI slices or parts thereof that do contain non-pathological facial features while retaining all other original 2D MRI slices. Presuming preservation of medically relevant features and robustness of de-identification, GAN-based approaches can allow for subsequent medical analysis, privacy preserving data sharing and provision of de-identified training data. Hence, we highlight the research potential of GANs for robust medical image de-identification e.g. via image inpainting GANs that have already been successful applied to other tasks in cancer imaging such as synthetic lesion inpainting into mammograms wu2018conditional; becker2019injecting and lung CT scans mirsky2019ct. Also, GAN methods that are adjustable and trainable to remain quantifiably robust against adversarial image reconstruction are a research line of interest.
4.2.4 Identifying Patient Features in Latent Representations
In line with Figure 3(g), a further example for privacy preserving methods are autoencoders171717For example adversarial autoencoders makhzani2015adversarial; creswell2018denoising, which adversarially learn latent space representations that match a chosen prior distribution. that learn patient identity-specific features and obfuscate such features when encoding input images into latent space representation. Such an identity-obfuscated representation can be used as input into further models (classification, segmentation, etc) or decoded back into a de-identified image. Adversarial training has been shown to be effective for learning a privacy-preserving encoding function, where a discriminator tries to succeed at classifying the private attribute from the encoded data raval2017protecting; wu2018towards; yang2018learning; pittaluga2019learning. Apart from being trained via the backpropagated adversarial loss, the encoder needs at least one further utility training objective to learn to generate useful representations, such as denoising vincent2008extracting or classification of a second attribute (e.g., facial expressions chen2018vgan; oleszkiewicz2018siamese). Siamese Neural Networks bromley1994signature such as the Siamese Generative Adversarial Privatizer (SGAP) have been used effectively for adversarial training of an identity-obfuscated representation encoder. In SGAP, two weight-sharing Siamese Discriminators are trained using a distance based loss function to learn to classify whether a pair of images belongs to the same person oleszkiewicz2018siamese. As visualised in Figure 8, Kim et al kim2019privacy follow a similar approach with the goal of de-identifying and segmenting brain MRI data. Two feature maps are encoded from a pair of MRI scans and fed into a Siamese Discriminator that evaluates via binary classification whether the two feature maps are from the same patient. The generated feature maps are also fed into a segmentation model that backpropagates a Dice loss sudre2017generalised to train the encoder. Figure 8 illustrates the scenario where the encoder is deployed in a trusted setting after training, e.g. in a clinical centre, and the segmentation model is deployed in an untrusted setting, e.g. outside the clinical centre at a third party. The encoder shares the identity-obfuscated feature maps with the external segmentation model without the need of transferring the sensitive patient data outside the clinical centre. This motivates further research into adversarial identity-obfuscated encoding methods e.g., to allow sharing and usage of cancer imaging data representations and models across clinical centres.
4.2.5 Adversarial Attacks Putting Patients at Risk
Apart from preserving privacy, GANs can also be a threat to sensitive patient data when used to launch adversarial attacks. Such adversarial attacks can fool machines (i.e. deep learning models) as well as humans (i.e. radiologists). Consider an intrusive generator that maliciously alters MRI images that are subsequently ingested into Computer-Aided Detection (CADe) or Diagnosis (CADx) systems. Apart from generating malicious images, adversarial models can also remove or add data perturbations such as artifacts, evidence, or adversarial noise into existing images goodfellow2014explaining; kurakin2018adversarial; ma2021understanding. Tampering with medical images, e.g. by introducing anatomical or appearance variations chen2019intelligent, puts the patient at risk of being wrongfully provisioned with an incorrect diagnosis or treatment. The susceptibility of medical imaging deep learning systems to white box and black box adversarial attacks is reviewed finlayson2019adversarial; ma2021understanding and investigated paschali2018generalizability; mirsky2019ct; li2020anatomical in recent studies. Cancer imaging models show a high level of susceptibility joel2021adversarial; korpihalkola2020one; vatian2019impact. Besides imaging data, image quantification features such as radiomics features lambin2012radiomics are also commonly used in cancer imaging as input into CADe and CADx systems. Adversarial perturbations of radiomics features threaten patient safety and highlight the need for defence strategies against adversarial radiomics examples barucci2020adversarial.
Examples of GAN-based Tampering with Cancer Imaging Data
For instance, Mirsky et al mirsky2019ct added and removed evidence of cancer in lung CT scans. Of two identical deep 3D convolutional cGANs (based on pix2pix), one was used to inject (diameter ) and the other to remove (diameter ) multiple solitary pulmonary nodules indicating lung cancer. The GANs were trained on 888 CT scans from the Lung Image Database Consortium image collection (LIDC-IDRI) dataset armato2011lung and inpainted on an extracted region of interest of voxel cuboid shape. The trained GANs can be autonomously executed by malware and are capable of ingesting nodules into standard CT scans that are realistic enough to deceive both radiologists and AI disease detection systems. Three radiologists with 2, 5 and 7 years of experience analysed 70 tampered and 30 authentic CT scans. Spending on average 10 minutes on each scan, the radiologists diagnosed 99% of the scans with added nodules as malignant and 94% of the scans with removed nodules as healthy. After disclosing the presence of the attack to the radiologists, the percentages dropped to 60% and 87%, respectively mirsky2019ct.
Becker et al becker2019injecting trained a CycleGAN zhu2017unpaired on 680 down-scaled mammograms from the Breast Cancer Digital Repository (BCDR) lopez2012bcdr and the INbreast moreira2012inbreast datasets to generate suspicious features and was able to remove or inject them into existing mammograms. They showed that their approach can fool radiologists at lower pixel dimensions (i.e. 256×256) demonstrating that alterations in patient images by a malicious attacker can remain undetected by clinicians, influence the diagnosis, and potentially harm the patient becker2019injecting.
Defending Adversarial Attacks
In regard to fooling diagnostic models, one measure to circumvent adversarial attacks is to increase model robustness against adversarial examples madry2017towards, as suggested by Figure 3(h). Augmenting the robustness has been shown to be effective for medical imaging segmentation models he2019non; park2020robustification, lung nodule detection models liu2020no; paul2020mitigating, skin cancer recognition huq2020analysis; hirano2021universal, and classification of histopathology images of lymph node sections with metastatic tissue wetstein2020adversarial. Liu et al liu2020no provide model robustness by adding adversarial chest CT examples to the training data. These adversarial examples are composed of synthetic nodules that are generated by a 3D convolutional variational encoder trained in conjunction with a WGAN-GP gulrajani2017improved discriminator. To further enhance robustness, Projected Gradient Descent (PGD) madry2017towards is applied to find and protect against noise patterns for which the detector network is prone to produce over-confident false predictions liu2020no.
Apart from being the adversary, GANs can also detect adversarial attacks and thus are applicable as security counter-measure enabling attack anticipation, early warning, monitoring and mitigation. Defense-GAN, for example, learns the distribution of non-tampered images and can generate a close output to an inference input image that does not contain adversarial modifications samangouei2018defense.
We highlight the research potential in adversarial attacks and examples, alongside prospective GAN detection and defence mechanisms that can, as elaborated, highly impact the field of cancer imaging. Apart from the image injection of entire tumours and the generation of adversarial radiomics examples, a further attack vector to consider in future studies is the perturbation of the specific imaging features within an image that are used to compute radiomics features.
|Chang et al (2020) chang2020multi; chang2020synthetic||AsynDGAN, PatchGAN isola2017image||BRATS 2018 menze2014multimodal; bakas2018identifying, Multi-Organ kumar2017dataset||Cranial MRI, nuclei images||Paired translation||Mask-to-image, central G gets distributed Ds’ gradients, synthetic only-trained segmentation.|
|Xie et al (2018) xie2018differentially||DPGAN||MNIST lecun1998gradient, MIMIC-III johnson2016mimic||MNIST images, EHRs||Image synthesis||Noisy gradients during training ensure DP guarantee.|
|Jordon et al (2018) jordon2018pate||PATE-GAN||Cervical cancer fernandes2017transfer||[non-imaging] Medical records||Data synthesis||DP via PATE framework. G gradient from student D that learns from teacher Ds.|
|Beaulieu-Jones et al (2019) beaulieu2019privacy||AC-GAN||MIMIC-III johnson2016mimic||[non-imaging] EHRs, clinical trial data||Conditional synthesis||DP via Gaussian noise added to AC-GAN gradient. Treatment arm as conditional input.|
|Bae et al (2020) bae2020anomigan||AnomiGAN||UCI Breast blake1998uci & Prostate blake1998uci||[non-imaging] Cell nuclei tabular data||Conditional synthesis, classification||DP via training variances added to G’s layers in inference. Real data row as G’s condition.|
|Torfi et al (2020) torfi2020differentially||RDP-CGAN||Cervical cancer fernandes2017transfer, MIMIC-III johnson2016mimic||[non-imaging] Medical records, EHRs||Data synthesis||DP GAN based on Rényi divergence. Allows to track a DP loss.|
|Abramian et al (2019) abramian2019refacing||CycleGAN||IXI ixidataset||Cranial MRI||Unpaired translation||Reconstruction of blurring/removed faces in MRI shows privacy vulnerability.|
|Kim et al (2019) kim2019privacy||PrivacyNet||PPMI marek2011parkinson||Cranial MRI||Representation learning, segmentation||Segmenting de-identified representations learned via same-person CLF by Siamese Ds.|
|Van der Goten et al (2021) goten2021adversarial||C-DeID-GAN||ADNI wyman2013standardization; weiner2017alzheimer, OASIS-3 lamontagne2019oasis||Cranial MRI||Paired translation||Face de-id. Concatenated convex hull, brain mask & brain volumes as G & D inputs.|
|Adversarial Data Tampering|
|Mirsky et al (2019) mirsky2019ct||pix2pix-based CT-GAN||LIDC-IDRI armato2011lung||Lung CT||Image inpainting||Injected/removed lung nodules in CT fool radiologists and AI models.|
|Becker et al (2019) becker2019injecting||CycleGAN||BCDR lopez2012bcdr, INbreast moreira2012inbreast||Digital/Film MMG||Unpaired translation||Suspicious features can be learned and injected/removed from MMG.|
|Liu et al (2020) liu2020no||Variational Encoder, WGAN-GP||LUNA setio2017validation, NLST national2011national||Lung CT||Conditional synthesis||Robustness via adversarial data augmentation, reduce false positives in nodule detection|
4.3 Data Annotation and Segmentation Challenges
4.3.1 Annotation-Specific Issues in Cancer Imaging
Missing Annotations in Datasets
In cancer imaging, not only the availability of large datasets is rare, but also the availability of labels, annotations, and segmentation masks within such datasets. The generation and evaluation of such labels, annotations, and segmentation masks is a task for which trained health professionals (radiologists, pathologists) are needed to ensure validity and credibility hosny2018artificial; bi2019artificial. Nonetheless, radiologist annotations of large datasets can take years to generate bi2019artificial. The tasks of labelling and annotating (e.g., bounding boxes, segmentation masks, textual comments) cancer imaging data is, hence, expensive both in time and cost, especially considering the large amount of data needed to train deep learning models.
Intra/Inter-Observer Annotation Variability
This cancer imaging challenge is further exacerbated by the high intra- and inter-observer variability between both pathologists gilles2008pathologist; dimitriou2018principled; martin2018interobserver; klaver2020interobserver and radiologists elmore1994variability; hopper1996analysis; hadjiiski2012inter; teh2017inter; wilson2018inter; woo2020intervention; brady2017error in interpreting cancer images across imaging modalities, affected organs, and cancer types. Automated annotation processes based on deep learning models allow to produce reproducible and standardised results in each image analysis. In one of most common case where the annotations consist of a segmentation mask, reliably segmenting both tumour and non-tumour tissues is crucial for disease analysis, biopsy, and subsequent intervention and treatment hosny2018artificial; huynh2020artificial the latter being further discussed in section 4.5. For example, automatic tumour segmentation models are useful in the context of radiotherapy treatment planning cuocolo2020machine.
Human Biases in Cancer Image Annotation
During routine tasks, such as medical image analysis, humans are prone to account for only a few of many relevant qualitative image features. On the contrary, the strength of GANs and deep learning models is the evaluation of large numbers of multi-dimensional image features alongside their (non-linear) inter-relationships and combined importance hosny2018artificial. Deep learning models are likely to react to unexpected and subtle patterns in the imaging data (e.g., anomalies, hidden comorbidities, etc.) that medical practitioners are prone to overlook for instance due to any of multiple existing cognitive biases (e.g., anchoring bias, framing bias, availability bias) brady2017error or inattentional blindness drew2013invisible. Inattentional blindness occurs when radiologists (or pathologist) have a substantial amount of their attention drawn to a specific task, such as finding an expected pattern (e.g., a lung nodule) in the imaging data, that they become blind to other patterns in that data.
Implications of Low Segmentation Model Robustness
As for the common annotation task of segmentation mask delineation, automated segmentation models can minimise the risk of the aforesaid human biases. However, to date, segmentation models have difficulties when confronted with intricate segmentation problems including domain shifts, rare diseases with limited sample size, or small lesion and metastasis segmentation. In this sense, the performance of many automated and semi-automated clinical segmentation models has been sub-optimal sharma2010automated. This emphasises the need for expensive manual verification of segmentation model results by human experts hosny2018artificial
. The challenge of training automated models for difficult segmentation problems can be approached by applying unsupervised learning methods that learn discriminative features without explicit labels. Such methods include GANs and variational autoencoderskingma2013auto capable of automating robust segmentation hosny2018artificial.
In addition, segmented regions of interest (ROI) are commonly used to extract quantitative imaging features with diagnostic value such as radiomics features. The latter are used to detect and monitor tumours (e.g., lymphoma kang2018diffusion), biomarkers, and tumour-specific phenotypic attributes lambin2012radiomics; parmar2015machine. The accuracy and success of such commonly applied diagnostic image feature quantification methods, hence, depends on accurate and robust ROI segmentations. Segmentation models need to be able to provide reproducibility of extracted quantitative features and biomarkers bi2019artificial with reliably-low variation, among others, across different scanners, CT slice thicknesses, and reconstruction kernels balagurunathan2014test; zhao2016reproducibility. To this end, we promote lines of research that use adversarial training schemes to target the robustification of segmentation models. Progress in this open research challenge can beneficially unlock trust, usability, and clinical adoption of biomarker quantification methods in clinical practice.
4.3.2 GAN Cancer Image Segmentation Examples
Table 4 summarises the collection of segmentation publications that utilise such adversarial training approaches and GAN-based data synthesis for cancer imaging. In the following sections, we provide a summary of the commonly used techniques and trends in the GAN literature that address the challenges in cancer image segmentation.
Robust quantitative imaging feature extraction
For example, Xiao et al xiao2019radiomics addressed the challenge of robustification of segmentation models and reliable biomarker quantification. The authors in xiao2019radiomics
provide radiomics features as conditional input to the discriminator of their adversarially trained liver tumour segmentation model. Their learning procedure strives to inform the generator to create segmentations that are specifically suitable for subsequent radiomics feature computation. Apart from adversarially training segmentation models, we also highlight the research potential of adversarially training quantitative imaging feature extraction models (e.g., deep learning radiomics) for reliable application in multi-centre and multi-domain settings.
Synthetic Segmentation Model Training Data
By augmenting and varying the training data of segmentation models, it is possible to substantially decrease the amount of manually annotated images during training while maintaining the performance foroozandeh2020synthesizing. A general pipeline of such usage of GAN based generative models is demonstrated in Figure 9(a) and mentioned in Figure 3(j).
Over the past few years, CycleGAN zhu2017unpaired based approaches have been widely used for synthetic data generation due to the possibility of using unpaired image sets in training, as compared to paired image translation methods like pix2pix isola2017image or SPADE park2019semantic. CycleGAN based data augmentation has been shown to be useful for segmentation model training, in particular, for generating images with different acquisition characteristics such as contrast enhanced MRI from non-contrast MRI wang2021contrast, cross-modality image translation between different modalities such as CT and MRI images huo2018splenomegaly, and domain adaptation tasks jiang2018tumor. The popularity of the CycleGAN based methods lies not only in image synthesis or domain adaptation, but also in the inclusion of simultaneous image segmentation in its pipeline lee2020study.
Although pix2pix methods require paired samples, it is also a widely used type of GAN in data augmentation for medical image segmentation (see Table 4). Several works on segmentation have demonstrated its effectiveness in generating synthetic medical images. By manipulating its input, the variability of the training dataset for image segmentation could be remarkably increased in a controlled manner abhishek2019mask2lesion; oliveira2020implanting. Similarly, the conditional GAN methods have also been used for controllable data augmentation for improving lesion segmentation oliveira2020controllable. Providing a condition as an input to generate a mask is particularly useful to specify the location, size, shape, and heterogeneity of the synthetic lesions. One of the recent examples, proposed in kim2020synthesis, demonstrates this in brain MRI tumour synthesis by conditioning an input with simplified controllable concentric circles to specify lesion location and characteristics. A further method for data augmentation is the inpainting of generated lesions into healthy real images or into other synthetic images, as depicted by Figure 9(f). Overall, the described data augmentation techniques have shown to improve generalisability and performance of segmentation models by increasing both the number and the variability of training samples qasim2020red; foroozandeh2020synthesizing; lee2020study.
Segmentation Models with Integrated Adversarial Loss
As stated in Figure 3(i), GANs can also be used as the algorithm that generates robust segmentation masks, where the generator is used as a segmentor and the discriminator scrutinises the segmentation masks given an input image. One intuition behind this approach is the detection and correction of higher-order inconsistencies between the ground truth segmentation maps and the ones created by the segmentor via adversarial learning luc2016semantic; hu2020coarse; cirillo2020vox2vox. This approach is demonstrated in Figure 9(b). With the additional adversarial loss when training a segmentation model, this approach has been shown to improve semantic segmentation accuracy hung2019adversarial; shi2020automatic. Using adversarial training, similarity of a generated mask to manual segmentation given an input image is taken under consideration by the discriminator allowing a global assessment of the segmentation quality. A unique way of incorporating the adversarial loss from the discriminator has been recently proposed in nie2020adversarial
. In their work, the authors utilise a fully-convolutional network as a discriminator, unlike its counterparts that use binary, single neuron output networks. In doing so, a dense confidence map is produced by the discriminator, which is further used to train the segmentor with an attention mechanism.
Overall, using an adversarial loss as an additional global segmentation assessment is likely to be a helpful further signal for segmentation models, in particular, for heterogeneously structured datasets of limited size kohl2017adversarial, as is common for cancer imaging datasets. We highlight potential further research in GAN-based segmentation models to learn to segment increasingly fine radiologic distinctions. These models can help to solve further cancer imaging challenges, for example, accurate differentiation between neoplasms and tissue response to injury in the regions surrounding a tumour after treatment bi2019artificial.
Segmentation Models with Integrated Adversarial Domain Discrimination
Moreover, a similar adversarial loss can also be performed intrinsically on the segmentation model features as illustrated in Figure 9(c). Such an approach can benefit domain adaptation and domain generalisation by enforcing the segmentation model to learn to base its prediction on domain invariant feature representations kamnitsas2017unsupervised.
4.3.3 Limitations and Future Prospects for Cancer Imaging Segmentation
As shown in Table 4, the applications of GANs in cancer image segmentation cover a variety of clinical requirements. Remarkable steps have been taken to advance this field of research over the past few years. However, the following limitations and future prospects can be considered for further investigation:
Although the data augmentation using GANs could increase the number of training samples for segmentation, the variability of the synthetic data is limited to the training data. Hence, it may limit the potential of improving the performance in terms of segmentation accuracy. Moreover, training a GAN that produce high sample variability requires a large dataset also with a high variability, and, in most of the cases, with corresponding annotations. Considering the data scarcity challenge in the cancer imaging domain, this can be difficult to achieve.
In some cases, using GANs could be excessive, considering the difficulties related to convergence of competing generator and discriminator parts of the GAN architectures. For example, the recently proposed SynthSeg model billot2020partial
is based on Gaussian Mixture Models to generate images and train a contrast agnostic segmentation model. Such approaches can be considered as an alternative to avoid common pitfalls of the GAN training process (e.g., mode collapse). However, this approach needs to be further investigated for cancer imaging tasks where the heterogeneity of tumours is challenging.
A great potential for using synthetic cancer images is to generate common shareable datasets as benchmarks for automated segmentation methods bi2019artificial. Although this benchmark dataset needs its own validation, it can be beneficial in testing the limits of automated methods with systematically controlled test cases. Such benchmark datasets can be generated by controlling the shape, location, size, intensities of tumours, and can simulating diverse images of different domains that reflect the distributions from real institutions. To avoid learning patterns that are only available in synthetic datasets (e.g., checkerboard artifcats), it is a prospect to investigate further metrics that measure the distance of such synthetic datasets to real-world datasets and the generalisation and extrapolation capabilities of models trained on synthetic benchmarks to real-world data.
|Publication||Method||Dataset||Modality||Task||Metric w/o GAN (Baseline)||Metric with GAN (Baseline+)||Highlights|
|Rezaei et al (2017) rezaei2017conditional||pix2pix, MarkovianGAN li2016precomputed||BRATS 2017 menze2014multimodal; bakas2018identifying; bakas2017advancing||MRI||Adversarial training||Dice: 0.80||n.a.||G generates masks that D tries to detect for high/low grade glioma (HGG/LGG) segmentation.|
|Mok et al (2018) mok2018learning||CB-GAN||BRATS menze2014multimodal||MRI||Data augmentation||Dice: 0.79||0.84||Coarse-to-fine G captures training set manifold, generates generic samples in HGG & LGG segmentation.|
|Yu et al (2018) yu20183d||pix2pix-based||BRATS menze2014multimodal||MRI||Data augmentation||Dice: 0.82||0.68||Cross-modal paired FLAIR to T1 translation, training tumour segmentation with T1+real/synthetic FLAIR.|
|Shin et al (2018) shin2018medical||pix2pix||BRATS menze2014multimodal||MRI||Data augmentation||Dice: 0.81||0.81||Training on synthetic, before fine-tuning on 10% of the real data.|
|Kim et al (2020) kim2020synthesis BrainTumor||GAN||BRATS bakas2018identifying; menze2014multimodal||MRI||Image inpainting, data augmentation||Dice: 0.57||0.59||Simplifying tumour features into concentric circle & grade mask to inpaint.|
|Hu et al (2020) hu2020coarse||UNet-based GAN segmenter||Private||CT, PET||Adversarial training||Dice: 0.69||0.71||Spatial context information & hierarchical features. Nasal-type lymphoma segmentation with uncertainty estimation.|
|Qasim et al (2020) qasim2020red||SPADE-based park2019semantic||BRATS bakas2018identifying, ISIC codella2019skin||MRI, dermoscopy||Cross-domain translation||Dice: n.a.||B:0.66 S:0.62||Brain and skin segmentation. Frozen segmenter as 3rd player in GAN to condition on local apart from global information.|
|Foroozandeh et al (2020) foroozandeh2020synthesizing||PGGAN, SPADE||BRATS menze2014multimodal||MRI||Data augmentation||Av. dice error(%): 16.80||16.18||Sequential noise-to-mask and paired mask-to-image translation to synthesise labelled tumour images.|
|Lee et al (2020) lee2020study||CycleGAN-based||Private||MRI||Adversarial training||n.a.||n.a.||Unpaired image-to-mask translation and vice versa. Comparison of UNet and ResNet CycleGAN backbones.|
|Cirillo et al (2020) cirillo2020vox2vox||vox2vox: 3D pix2pix||BRATS menze2014multimodal||MRI||Adversarial training||Dice: 0.87||0.93||3D adversarial training to enforce segmentation results to look realistic.|
|Han et al (2021) han2021deep||Symmetric adaptation network||BRATS menze2014multimodal||MRI||Cross-domain translation||Dice: 0.77||0.67||Simultaneous source/target cross-domain translation and segmentation. T2 to other sequences translation.|
|Xue et al (2018) xue2018segan||SegAN||BRATS menze2014multimodal||MRI||Adversarial training||Dice: n.a.||0.85||Paired image-to-mask. New multi-scale loss: L1 distance of D representations between GT- & prediction-masked input MRI.|
|Giacomello et al (2020) giacomello2020brain||SegAN-CAT||BRATS menze2014multimodal||MRI||Adversarial training||Dice: n.a.||0.859||Paired image-to-mask. Combined dice loss & multi-scale loss using concatenation on channel axis instead of masking.|
|Alshehhi et al (2021) alshehhi2021quantification||U-Net based GAN segmenter||BRATS menze2014multimodal||MRI||Adversarial training||Dice: n.a.||0.89||
Paired image-to-mask translation. Uncertainty estimation using Bayesian active learning in brain tumour segmentation.
|Singh et al (2018) singh2018conditional||pix2pix||DDSM heath2001digital||Film MMG||Adversarial training||Dice: 0.86||0.94||Adversarial loss to make automated segmentation close to manual masks for breast mass segmentation.|
|Caballo et al (2020) caballo2020deep||DCGAN radford2015unsupervised||Private||CT||Data augmentation||Dice: 0.70||0.93||GAN based data augmentation; Validated by matching extracted radiomics features.|
|Negi et al (2020) negi2020rda||GAN, WGAN-RDA-UNET||Private||Ultrasound||Adversarial training||Dice: 0.85||0.88||Outperforms state-of-the-art using Residual-Dilated-Attention-Gate-UNet and WGAN for lesion segmentation.|
|Huo et al (2018) huo2018splenomegaly||Conditional PatchGAN||Private||MRI||Adversarial training||Dice: n.a.||0.94||Adversarial loss as segmentation post-processing for spleen segmentation.|
|Chen et al (2019) chen2019liver||DC-FCN-based GAN segmenter||LiTS bilic2019liver||CT||Adversarial training||Dice: n.a.||0.684||Cascaded framework with densely connected adversarial training.|
|Sandfort et al (2019) sandfort2019data||CycleGAN||NIH prior2017public, Decathlon simpson2019large, DeepLesion yan2018deeplesion||CT||Cross-domain translation||Dice (od): 0.916, 0.101||0.932, 0.747||Contrast enhanced to non-enhanced translation. Image synthesis to improve out-of-distribution (od) segmentation.|
|Xiao et al (2019) xiao2019radiomics||Radiomics-guided GAN||Private||MRI||Adversarial training||Dice: n.a.||0.92||Radiomics-guided adversarial mechanism to map relationship between contrast and non-contrast images.|
|Oliveira (2020) oliveira2020implanting||pix2pix, SPADE||LiTS bilic2019liver||CT||Image inpainting||Dice: 0.58||0.61||Realistic lesion inpainting in CT slices to provide controllable set of training samples.|
|Chest and lungs|
|Jiang et al (2018 jiang2018tumor||CycleGAN-based||NSCLC prior2017public||CT, MRI||Cross-domain translation||Dice: n.a.||0.70||Tumour-aware loss for unsupervised cross-domain adaptation.|
|Jin et al (2018) jin2018ct||cGAN||LIDC armato2011lung||CT||Image inpainting||Dice: 0.96||0.99||Generated lung nodules to improve segmentation robustness; A novel multi-mask reconstruction loss.|
|Dong et al (2019) dong2019automatic||UNet-based GAN segmenter||AAPM Challenge yang2018autosegmentation||CT||Adversarial training||Dice: 0.97 (l), 0.83 (sc), 0.71 (o), 0.85 (h)||0.97, 0.90, 0.75, 0.87||Adversarial training to discriminate manual and automated segmentation of lungs, spinal cord, oesophagus, heart.|
|Tang et al (2019) tang2019xlsor XLSor||MUNIT huang2018multimodal||JSRT shiraishi2000development, Montgomery jaeger2013automatic||Chest X-Ray||Lung segmentation||Dice: 0.97||0.98||Unpaired normal-to-abnormal (pathological) translation. Synthetic data augmentation for lung segmentation.|
|Shi et al (2020) shi2020automatic||AUGAN||LIDC-IDRIarmato2011lung||CT||Adversarial training||Dice: 0.82||0.85||A deep layer aggregation based on U-net++.|
|Munawar et al (2020) munawar2020segmentation||Unet-based GAN segmenter||JSRT shiraishi2000development, Montgomery jaeger2013automatic, Shenzhen jaeger2013automatic||Chest X-Ray||Adversarial training||Dice: 0.96||0.97||Adversarial training to discriminate manual and automated segmentation.|
|Kohl et al (2017) kohl2017adversarial||UNet-based GAN segmenter||Private||MRI||Adversarial training||Dice: 0.35||0.41||Adversarial loss to discriminate manual and automated segmentation.|
|Grall et al (2019) grall2019using Prostate cGAN||pix2pix||Private||MRI||Adversarial training||Dice: 0.67 (ADC), 0.74 (DWI), 0.67 (T2)||0.73, 0.79, 0.74||Paired prostate image-to-mask translation. Investigated the robustness of the pix2pix against noise.|
|Nie et al (2020) nie2020adversarial||GAN||Private, PROMISE12 litjens2014evaluation||MRI||Adversarial confidence learning||Dice: 88.25 (b), 90.11 (m), 86.67 (a)||89.52, 90.97, 88.20||Difficulty-aware mechanism to alleviate the effect of easy samples during training. b - base, m - middle, a = apex.|
|Zhang et al (2020) zhang2020semi||PGGAN||Private||CT||Data augmentation||Dice: 0.85||0.90||Semi-supervised training using both annotated and un-annotated data. Synthetic data augmentation using PGGAN.|
|Cem Birbiri et al (2020) cem2020investigating||pix2pix, CycleGAN||Private, PROMISE12 litjens2014evaluation||MRI||Data augmentation||Dice: 0.72||0.76||Compared the performance of pix2pix, U-Net, and CycleGAN.|
|Liu et al (2019) liu2019accurate||GAN, LAGAN||Private||CT||Adversarial training||Dice: 0.87||0.92||Automatic post-processing to refine the segmentation of deep networks.|
|Poorneshwaran et al (2019) poorneshwaran2019polyp||pix2pix||CVC-clinic vazquez2017benchmark||Endoscopy||Adversarial training||Dice: n.a.||0.88||Adversarial learning to make automatic segmentation close to manual.|
|Xie et al (2020) xie2020mi||MIGAN, CycleGAN||CVC-clinic vazquez2017benchmark, ETIS-Larib silva2014toward||Endoscopy||Cross-domain translation||Dice: 0.66||0.73||Content features and domain information decoupling and maximising the mutual information.|
|Quiros et al (2019) quiros2019pathologygan||PathologyGAN||Private||Histopathology||Data augmentation||n.a.||n.a.||Images generated from structured latent space, combining BigGAN brock2018large, StyleGAN karras2019style, and RAD jolicoeur2018relativistic|
|Pandey et al (2020) pandey2020image||GAN & cGAN||Kaggle ljosa2012annotated||Histopathology||Data augmentation||Dice: 0.79||0.83||Two-stage GAN to generate masks and conditioned synthetic images.|
|Chi et al (2018) chi2018controlled||pix2pix||ISBI ISIC codella2018skin||Dermoscopy||Data augmentation||Dice: 0.85||0.84||Similar performance replacing half of real with synthetic data. Colour labels as lesion specific characteristics.|
|Abhishek et al (2019) abhishek2019mask2lesion||pix2pix||ISBI ISIC codella2018skin||Dermoscopy||Data augmentation||Dice: 0.77||0.81||Generate new lesion images given any arbitrary mask.|
|Sarker et al (2019) sarker2019mobilegan||MobileGAN||ISBI ISIC codella2018skin||Dermoscopy||Adversarial training||Dice: n.a.||0.91||Lightweight and efficient GAN model with position and channel attention.|
|Zaman et al (2020) zaman2020generative||pix2pix||Private||Ultrasound||Data augmentation||Dice: n.a.||+0.25||Recommendations on standard data augmentation approaches.|
|Chaitanya et al (2021) chaitanya2021semi||cGAN, DCGAN-based D||ACDC bernard2018deep, Decathlon simpson2019large||MRI, CT||Data augmentation||Dice: 0.40||0.53||GAN data augmentation for intensity and shape variations.|
4.4 Detection and Diagnosis Challenges
4.4.1 Common Issues in Diagnosing Malignancies
Clinicians’ High Diagnostic Error Rates
Studies of radiological error report high ranges of diagnostic error rates (e.g., discordant interpretations in 31–37% in Oncologic CT, 13% major discrepancies in Neuro CT and MRI). brady2017error. After McCreadie et al mccreadie2009eight critically reviewed the radiology cases of the last 30 months in their clinical centre, they found that from 256 detected errors (62% CT, 12% Ultrasound, 11% MRI, 9% Radiography, 5% Fluoroscopy) in 222 patients, 225 errors (88%) were attributable to poor image interpretation (14 false positive, 155 false negative, 56 misclassifications). A recent literature review on diagnostic errors by Newman et al newman2021rate estimated a false negative rate181818In newman2021rate, the false negative rates includes both missed (patient encounters at which the diagnosis might have been made but was not) and delayed diagnosis (diagnostic delay relative to urgency of illness detection). of 22.5% for lung cancer, 8.9% for breast cancer, 9.6% for colorectal cancer, 2.4% for prostate and 13.6% for melanoma. These findings exemplify the uncomfortably high diagnostic and image interpretation error rates that persist in the field of radiology despite decades of interventions and research itri2018fundamentals.
The Challenge of Reducing Clinicians’ High Workload
In some settings, radiologists must interpret one CT or MRI image every 3–4 seconds in an average 8-hour workday mcdonald2015effects
. Automated CADe and CADx systems can provide a more balanced quality-focused workload for radiologists, where radiologists focus on scrutinising the automated detected lesions (false positive reduction) and areas/patches with high predictive uncertainty (false negative reduction). A benefit of CADe/CADx deep learning models are their real-time inference and strong pattern recognition capabilities that are not readily susceptible to cognitive bias (discussed in4.3.1), environmental factors itri2018fundamentals, or inter-observer variability (discussed in 4.3.1).
Detection Model Performance on Critical Edge Cases
Challenging cancer imaging problems are the high intra- and inter-tumour heterogeneity bi2019artificial, the detection of small lesions and metastasis across the body (e.g., lymph node involvement and distant metastasis hosny2018artificial) and the accurate distinction between malignant and benign tumours (e.g., for detected lung nodules that seem similar on CT scans hosny2018artificial). Methods are needed to extend on and further increase the current performance of deep learning detection models bi2019artificial.
4.4.2 GAN Cancer Detection and Diagnosis Examples
As we detail in the following, the capability of the unsupervised adversarial learning to improve malignancy detection has been demonstrated for multiple tumour types and imaging modalities. To this end, Table 5 summarises the collection of recent publications that utilise GANs and adversarial training for cancer detection, classification, and diagnosis.
Adversarial Anomaly and Outlier Detection Examples
Schlegl et al schlegl2017unsupervised
captured imaging markers relevant for disease prediction using a deep convolutional GAN named AnoGAN. AnoGAN learnt a manifold of normal anatomical variability, accompanying a novel anomaly scoring scheme based on the mapping from image space to a latent space. While Schlegl et al validated their model on retina optical coherence tomography images, their unsupervised anomaly detection approach is applicable to other domains including cancer detection, as indicated in Figure3(l). Chen et al chen2018deep
used a Variational Autoencoder GAN for unsupervised outlier detection using T1 and T2 weighted brain MRI images. The scans from healthy subjects were used to train the auto-encoder model to learn the distribution of healthy images and detect pathological images as outliers. Creswell et alcreswell2018denoising proposed a semi-supervised Denoising Adversarial Autoencoder (ssDAAE) to learn a representation based on unlabelled skin lesion images. The semi-supervised part of their CNN-based architecture corresponds to malignancy classification of labelled skin lesions based on the encoded representations of the pretrained DAAE. As the amount of labelled data is smaller than the unlabelled data, the labelled data is used to fine-tune classifier and encoder. In ssDAAE, not only the adversarial autoencoder’s chosen prior distribution makhzani2015adversarial, but also the class label distribution is discriminated by a discriminator, the latter distinguishing between predicted continuous labels and real binary (malignant/benign) labels. Kuang et al kuang2020unsupervised applied unsupervised learning to distinguish between benign and malignant lung nodules. In their multi-discriminator GAN (MDGAN) various discriminators scrutinise the realness of generated lung nodule images. After GAN pretraining, an encoder is added in front of the generator to the end-to-end architecture to learn the feature distribution of benign pulmonary nodule images and to map these features into latent space. The benign and malignant lung nodules were scored similarly as in the f-AnoGAN framework schlegl2019f, computing and combining an image reconstruction loss and a feature matching loss, the latter comparing the discriminators’ feature representations between real and encoded-generated images from intermediate discriminator layers. As exemplified in Figure 9(g), the model yielded high anomaly scores on malignant images and low anomaly scores on benign images despite limited dataset size. Benson et al benson2020gan used GANs trained from multi-modal MRI images as a 3-channel input (T1-T2 weighted, FLAIR, ADC MRI) in brain anomaly detection. The training of the generative network was performed using only healthy images together with pseudo-random irregular masks. Despite the training dataset consisting of only 20 subjects, the resulting model increased the anomaly detection rate.
Synthetic Detection Model Training Data
Among the GAN publications trying to improve classification and detection performance, data augmentation is the most recurrent approach to balance, vary, and increase the detection model’s training set size, as suggested in Figure 3(k). For instance in breast imaging, Wu et al wu2018conditional trained a class-conditional GAN to perform contextual in-filling to synthesise lesions in healthy scanned mammograms. Guan et al guan2019breast trained a GAN on the same dataset heath2001digital to generate synthetic patches with benign and malignant tumours. The synthetic generated patches had clear artifacts and did not match the original dataset distribution. Jendele et al jendele2019adversarial used a CycleGAN zhu2017unpaired and both film scanned and digital mammograms to improve binary (malignant/benign) lesion detection using data augmentation. Detecting mammographically-occult breast cancers is another challenging topic addressed by GANs. For instance, Lee et al lee2020simulating exploit asymmetries between mammograms of the left and right breasts as signals for finding mammography-occult cancer. They trained a conditional GAN (pix2pix) to generate a healthy synthetic mammogram image of the contralateral breast (e.g., left breast) given the corresponding single-sided mammogram (e.g., right breast) as input. The authors showed that there is a higher similarity (MSE, 2D-correlation) between simulated-real (SR) mammogram pairs than real-real (RR) mammogram pairs in the presence of mammography-occult cancer. Consequently, distinguishing between healthy and mammography-occult mammograms, their classifier yielded a higher performance when trained with both RR and SR similarity as input (AUC = 0.67) than when trained only with RR pair similarity as input (AUC = 0.57). 3-dimensional image synthesis with GANs has been shown, for instance, by Han et al han2019synthesizing, who proposed a 3D Multi-Conditional GAN (3DMCGAN) to generate realistic and diverse nodules placed naturally on lung CT images to boost sensitivity in 3D object detection. Bu et al bu20203d built a 3D conditional GAN based on pix2pix, where the input is a 3D volume of interest (VOI) that is cropped from a lung CT scan and contains a missing region in its centre. Both generator and discriminator contain squeeze-and-excitation hu2018squeeze residualhe2016deep neural network (SE-ResNet) modules to improve the quality of the synthesised lung nodules. Another example based on lung CT images is the method by Nihio et al nishio2020attribute, where the proposed GAN model used masked 3D CT images and nodule size information to generate images.
As to multi-modal training data synthesis, van Tulder et al van2015does replaced missing sequences of a multi-sequence MRI with synthetic data. The authors illustrated that if the synthetic data generation model is more flexible than the classification model, the synthetic data can provide features that the classifier has not extracted from the original data, which can improve the performance. During colonoscopy, depth maps can enable navigation alongside aiding detection and size measurements of polyps. For this reason, Rau et al rau2019implicit demonstrated the synthesis of depth maps using a conditional GAN (pix2pix) with monocular endoscopic images as input, reporting promising results on synthetic, phantom and real datasets. In breast cancer detection, Muramatsu et al muramatsu2020improving translated lesions from lung CT to breast MMG using cycleGAN yielding a performance improvement in breast mass classification when training a classifier with the domain-translated generated samples.
4.4.3 Future Prospects for Cancer Detection and Diagnosis
Granular Class Distinctions for Synthetic Tumour Images
Further research opportunity exists in exploring a more fine-grained classification of tumours that characterises different subtypes and disease grades instead of binary malignant-benign classification. Being able to robustly distinguish between different disease subtypes with similar imaging phenotypes (e.g., glioblastoma versus primary central nervous system lymphoma kang2018diffusion) addresses the challenge of reducing diagnostic ambiguity bi2019artificial. GANs can be explored to augment training data with samples of specific tumour subtypes to improve the distinction capabilities of disease detection models. This can be achieved by training a detection model on training data generated by various GANs, where each GAN is trained on a different tumour subtype distribution. Another option we estimate worth exploring is to use the tumour subtype or the disease grade (e.g., the Gleason Score for prostate cancer hu2018prostategan) as a conditional input into the GAN to generate additional labelled synthetic training data.
Cancer Image Interpretation and Risk Estimation
Besides the detection of prospectively cancerous characteristics in medical scans, ensuring a high accuracy in the subsequent interpretation of these findings are a further challenge in cancer imaging. Improving the interpretation accuracy can reduce the number of unnecessary biopsies and harmful treatments (e.g., mastectomy, radiation therapy, chemotherapy) of indolent tumours bi2019artificial. For instance, the rate of overdiagnosis of non-clinically significant prostate cancer ranges widely between 1.7% up to a noteworthy 67% loeb2014overdiagnosis. To address this, detection models can be extended to provide risk and clinical significance estimations. For example, given both an input image, and an array of risk factors (e.g., BRCA1/BRCA2 status for breast cancer li2017deep, comorbidity risks), a deep learning model can weight and evaluate a patient’s risk based on learned associations between risk factors and input image features. The GAN framework is an example of this, where clinical, non-clinical and imaging data can be combined, either as conditional input for image generation or as prediction targets. For instance, given an input image, an AC-GAN odena2017conditional; kapil2018deep can classify the risk as continuous label (see Figure 9(e)) or, alternatively, a discriminator can be used to asses whether a risk estimate provided by a generator is realistic. Also, a generator can learn a function for transforming and normalising an input image given one or several conditional input target risk factors or tumour characteristics (e.g., a specific mutation status, a present comorbidity, etc) to generate labelled synthetic training data.
|Publication||Method||Dataset||Modality||Task||Metric w/o GAN (Baseline)||Metric with GAN (Baseline+)||Highlights|
|Chen et al (2018) chen2018deep||VAE GAN||Cam-CAN shafto2014cambridge BRATS menze2014multimodal ATLAS liew2018large||MRI||Anomaly detection||AUC(%): 80.0||70.0||Comparison of unsupervised outlier detection methods.|
|Han et al (2018) han2020infinite||PGGAN||BRATS menze2014multimodal||MRI||Data augmentation||Accuracy(%): 90.06||91.08||PGGAN-based augmentation method of whole brain MRI.|
|Han et al (2018) han2019combining||PGGAN, SimGAN||BRATS menze2014multimodal||MRI||Data augmentation||Sensitivity(%): 93.67||97.48||Two-step GAN for noise-to-image and image-to-image data augmentation.|
|Benson et al (2018) benson2020gan||GAN||TCIA schmainda2018data||MRI||Anomaly detection||Accuracy(%): 73.48||74.96||Multi-modal MRI images as input to the GAN.|
|Han et al (2019) han1904h2019||GAN||Private||MRI||Data augmentation||Sensitivity(%): 67.0||77.0||Synthesis and detection of brain metastases in MRI with bounding boxes.|
|Sun et al (2020) sun2020adversarial||Fixed-Point GAN||BRATS menze2014multimodal||MRI||Anomaly detection||Sensitivity(%): n.a.||84.5||Fixed-point translation concept.|
|Siddiquee et al (2020) siddiquee2019learning||ANT-GAN||BRATS 2013 menze2014multimodal||MRI||Data augmentation||F1-Score(%): 89.6||91.7||Abnormal to normal image translation in cranial MRI, lesion detection.|
|Wu et al (2018) wu2018conditional Mammo-ciGAN||ciGAN||DDSM heath2001digital||Film MMG||Data augmentation||ROC AUC(%): 88.7||89.6||Synthetic lesion in-filling in healthy mammograms.|
|Guan et al (2019) guan2019breast||GAN||DDSM heath2001digital||Film MMG||Data augmentation||Accuracy(%): 73.48||74.96||Generate patches containing benign and malignant tumours.|
|Jendele et al (2019) jendele2019adversarial BreastGAN||CycleGAN||BCDR lopez2012bcdr, INbreast moreira2012inbreast, CBIS–DDSM lee2016curated||Digital/Film MMG||Data augmentation||ROC AUC(%): 83.50, F1(%): 62.53||-1.46, +1.28||Scanned & digital mammograms evaluated together for lesion detection.|
|Lee et al (2020) lee2020simulating||CGAN||Private||Digital MMG||Data augmentation||ROC AUC: 0.57||0.67||Synthesising contralateral mammograms.|
|Wu et al (2020) wu2020synthesizing||U-Net based GAN||OPTIMAM halling2020optimam||Digital MMG||Data augmentation||ROC AUC(%): 82.9||84.6||Removing/adding lesion into MMG for malignant/benign classification.|
|Alyafi et al (2020) alyafi2020dcgans||DCGAN||OPTIMAM halling2020optimam||Digital MMG||Data augmentation||F1-Score(%): n.a.||+0.09(%)||Synthesise breast mass patches with high diversity.|
|Desai et al (2020) desai2020breast||DCGAN||DDSM heath2001digital||Film MMG||Data augmentation||Accuracy(%): 78.23||87.0||Synthesise full view MMGs and used them in visual Turing test.|
|Muramatsu et al (2020) muramatsu2020improving||CycleGAN||DDSM heath2001digital||CT, Film MMG||Data augmentation||Accuracy(%): 79.2||81.4||CT lung nodule to MMG mass translation and vice versa.|
|Swiderski et al (2020) swiderski2021deep||AGAN||DDSM heath2001digital||Film MMG||Data augmentation||ROC AUC(%): 92.50||94.10||AutoencoderGAN (AGAN) augments data in normal abnormal classification.|
|Kansal et al (2020) kansal2020generative||DCGAN||Private||OCT||Data augmentation||Accuracy(%): 92.0||93.7||Synthetic Optical Coherence Tomography (OCT) images.|
|Shen et al (2021) shen2021mass||ciGAN||DDSM heath2001digital||Film MMG||Data augmentation||Detection rate(%): n.a.||+5.03||Generate labelled breast mass images for precise detection.|
|Pang et al (2021) pang2021semi||TripleGAN-based||Private||Ultrasound||Data augmentation||Sensitivity(%): 86.60||87.94||Semi-supervised GAN-based Radiomics model for mass CLF.|
|Frid-Adar et al (2018) frid2018gan||DCGAN, ACGAN||Private||CT||Data augmentation||Sensitivity(%): 78.6||85.7||Synthesis of high quality focal liver lesions from CT for lesion CLF.|
|Zhao et al (2020) zhao2020tripartite||Tripartite GAN||Private||MRI||Data augmentation||Accuracy(%): 79.2||89.4||Synthetic contrast-enhanced MRI tumour detection without contrast agents.|
|Doman et al (2020) doman2020lesion||DCGAN||JAMIT jamit, 3Dircadb 3dircadb||CT||Data augmentation||Detection rate: 0.65||0.95||Generate metastatic liver lesions in abdominal CT for improved cancer detection.|
|Kanayama et al (2019) kanayama2019gastric||DCGAN||Private||Endoscopy||Data augmentation||AP(%): 59.6||63.2||Synthesise lesion images for gastric cancer detection.|
|Shin et al (2018) shin2018abnormal||cGAN||CVC-CLINIC bernal2015wm, CVC-ClinicVideoDB angermann2017towards||Colonoscopy||Data augmentation||Precision(%): 81.9||85.3||Synthesise polyp images from normal colonoscopy images for polyp detection.|
|Rau et al (2019) rau2019implicit ColonoscopyDepth||pix2pix-based||Private||Colonoscopy||Data augmentation||Mean RMSE: 2.207||1.655||Transform monocular endoscopic images from two domains to depth maps.|
|Yu et al (2020) yu2020synthesis||CapGAN||BrainWeb phantom brainweb, Prostate MRI choyke2016data||MRI||Data augmentation||ROC AUC(%): 85.1||88.5||Synthesise prostate MRI using Capsule Network-Based DCGAN instead of CNN.|
|Krause et al (2020) krause2021deep||CGAN||TCGA, NLCS van1990large||Histopathology||Data augmentation||ROC AUC(%): 75.7||77.7||GANs to enhance genetic alteration detection in colorectal cancer histology.|
|Bissoto et al (2018) bissoto2018skin gan-skin-lesion||PGAN, DCGAN, pix2pix||Dermofit ballerini2013color, ISIC 2017 codella2018skin, IAD argenziano2002dermoscopy||Dermoscopy||Data augmentation||ROC AUC(%): 83.4||84.7||Comparative study of GANs for skin lesions generation|
|Creswell et al (2018) creswell2018denoising||ssDAAE||ISIC 2017 codella2018skin||Dermoscopy||Representation learning, classification||ROC AUC(%): 89.0||89.0||Adversarial autoencoder fine-tuned on few labelled lesion classification samples.|
|Baur et al (2018) baur2018melanogans||DCGAN, LAPGAN||ISIC 2017 codella2018skin||Dermoscopy||Data augmentation||Accuracy (%): n.a.||74.0||Comparative study, 256x256px skin lesions synthesis. LAPGAN acc=74.0%|
|Rashid et al (2019) rashid2019skin||GAN||ISIC 2017 codella2018skin||Dermoscopy||Data augmentation||Accuracy(%): 81.5||86.1||Boost CLF performance with GAN-based skin lesions augmentation.|
|Fossen-Romsaas et al (2020) fossen2020synthesizing||AC-GAN, CycleGAN||HAM10000 & BCN20000 tschandl2018ham10000; combalia2019bcn20000, ISIC 2017 codella2018skin||Dermoscopy||Data augmentation||Recall(%): 72.1||76.3||Realistic-looking, class-specific synthetic skin lesions.|
|Qin et al (2020) qin2020gan||Style-based GAN||ISIC 2017 codella2018skin||Dermoscopy||Data augmentation||Precision(%): 71.8||76.9||Style control & noise input tuning in G to synthesise high quality lesions for CLF.|
|Bi et al (2017) bi2017synthesisCTtoPET||M-GAN||Private||PET||Data augmentation||F1-Score(%): 66.38||63.84||Synthesise PET data via multi-channel GAN for tumour detection.|
|Salehinejad et al (2018) salehinejad2018generalization||DCGAN||Private||Chest X-Rays||Data augmentation||Accuracy(%): 70.87||92.10||Chest pathology CLF using synthetic data.|
|Zhao et al (2018) zhao2018synthetic||F(&)BGAN||LIDC-IDRI armato2011lung||CT||Data augmentation||Accuracy(%): 92.86||95.24||Forward GAN generates diverse images, Backward GAN improves their quality.|
|Onishi et al (2019) onishi2019automated||WGAN||Private||CT||Data augmentation||Accuracy(%): 63 (Benign), 82 (Malign)||67, 94||Synthesise pulmonary nodules on CT images for nodule CLF.|
|Gao et al (2019) gao2019augmenting 3DGANLungNodules||WGAN-GP||LIDC-IDRI armato2011lung||CT||Data augmentation||Sensitivity(%): 84.8||95.0||Synthesise lung nodule 3D data for nodule detection.|
|Han et al (2019) han2019synthesizing||3DMCGAN||LIDC-IDRI armato2011lung||CT||Data augmentation||CPM(%): 51.8||55.0||3D multi-conditional GAN (2 Ds) for misdiagnosis prevention in nodule detection.|
|Yang et al (2019) yang2019class||GAN||LIDC-IDRI armato2011lung||CT||Data augmentation||ROC AUC(%): 87.56||88.12||Class-aware 3D lung nodule synthesis for nodule CLF.|
|Wang et al (2020) wang2020class CA-MW-AS||CGAN||LIDC-IDRI armato2011lung||CT||Data augmentation||F1-Score(%): 85.88||89.03||Nodule synthesis conditioned on semantic features.|
|Kuang et al (2020) kuang2020unsupervised||Multi-D GAN||LIDC-IDRI armato2011lung||CT||Anomaly detection||Accuracy(%): 91.6||95.32||High anomaly scores on malignant images, low scores on benign.|
|Ghosal et al (2020) ghosal2020lung||WGAN-GP||LIDC-IDRI armato2011lung||CT||Data augmentation||ROC AUC(%): 95.0||97.0||Unsupervised AE & clustering augmented learning method for nodule CLF.|
|Sun et al (2020) sun2020classification||DCGAN||LIDC-IDRI armato2011lung||CT||Data augmentation||Accuracy(%): 93.8||94.5||Nodule CLF: Pre-training AlexNetkrizhevsky2012imagenet on synthetic, fine-tuning on real.|
|Wang et al (2020) wang2020combination||pix2pix, PGWGAN, WGAN-GP||Private||CT||Data augmentation||Accuracy(%): 53.2||60.5||Augmented CNN for subcentimeter pulmonary adenocarcinoma CLF.|
|Bu et al (2020) bu20203d||3D CGAN||LUNA16 setio2017validation||CT||Data augmentation||Sensitivity(%): 97.81||98.57||Squeeze-and-excitation mechanism and residual learning for nodule detection.|
|Nishio et al (2020) nishio2020attribute||3D pix2pix||LUNA16 setio2017validation||CT||Data augmentation||Accuracy(%): 85||85||Nodule size CLF. Masked image + mask + nodule size conditioned paired translation.|
|Onishi et al (2020) onishi2020multiplanar||WGAN||Private||CT||Data augmentation||Specificity(%): 66.7||77.8||AlexNet pretrained on synthetic, fine-tuned on real nodules for malign/benign CLF.|
|Teramoto et al (2020) teramoto2020deep||PGGAN||Private||Cytopathology||Data augmentation||Accuracy(%): 81.0||85.3||Malignancy CLF: CNN pretrained on synthetic cytology images, fine-tuned on real.|
|Schlegl et al (2017) schlegl2017unsupervised||AnoGAN||Private||OCT||Anomaly detection||ROC AUC: 0.73||0.89||D representations trained on healthy retinal image patches to score abnormal patches.|
|Zhang et al (2018) zhang2018medical||DCGANs, WGAN, BEGANs||Private||OCT||Data augmentation||Accuracy(%): 95.67||98.83||CLF of thyroid/non-thyroid tissue. Comparative study for GAN data augmentation.|
|Chaudhari et al (2019) chaudhari2019data||MG-GAN||NCBI edgar2002gene||Expression microarray data||Data augmentation||Accuracy(%) P:71.43 L:68.21 B:69.8 C:67.59||93.6, 88.1, 90.3, 91.7||Prostate, Lung, Breast, Colon. Interesting for fusion with imaging data.|
|Liu et al (2019) liu2019wasserstein||WGAN-based||Private||Serum sample staging data||Data augmentation||Accuracy(%): 64.52||70.97||Synthetic training data for CLF of stages of Hepatocellular carcinoma.|
|Rubin et al (2019) rubin2019top||TOP-GAN||Private||Holographic microscopy||Data augmentation||AUC(%): 89.2||94.7||Pretrained D adapted to CLF of optical path delay maps of cancer cells (colon, skin).|
4.5 Treatment and Monitoring Challenges
After a tumour is detected and properly described, new challenges arise related to planning and execution of medical intervention. In this section we examine these challenges, in particular: tumour profiling and prognosis; challenges related to choice, response and discovery of treatments; as well as further disease monitoring. Table 6 provides an overview of the cancer imaging GANs that are applied to treatment and monitoring challenges, which are discussed in the following.
4.5.1 Disease Prognosis and Tumour Profiling
Challenges for Disease Prognosis
An accurate prognosis is crucial to plan suitable treatments for cancer patients. However, in specific cases, it could be more beneficial to actively monitor the tumours instead of treating them bi2019artificial. Challenges in cancer prognosis include the differentiation between long-term and short term survivors bi2019artificial, patient risk estimation considering the complex intra-tumour heterogeneity of the tumour microenvironment (TME) nearchou2021comparison, or the estimation of the probability of disease stages and tumour growth patterns, which can strongly affect outcome probabilities bi2019artificial. In this sense, GANs li2021normalization; kimoh2018genesbiomarkers and AI models in general cuocolo2020machine; dimitriou2018principled have shown potential in prognosis and survival prediction for oncology patients.
GAN Disease Prognosis Examples
Li et al li2021normalization (in Table 2) show that their GAN-based CT normalisation framework for overcoming the domain shift between images from different centres significantly improves accuracy of classification between short-term and long-term survivors. Ahmed et al ahmed2021multi trained omicsGAN to translate between microRNA and mRNA expression data pairs, but could be readily enhanced to also translate between cancer imaging features and genetic information. The authors evaluate omicsGAN on breast and ovarian cancer datasets and report improved prediction signals fo synthetic data tested via cancer outcome classification. Another non-imaging approach is provided by Kim et al kimoh2018genesbiomarkers, who apply a GAN for patient cancer prognosis prediction based on identification of prognostic biomarker genes. They train their GAN on reconstructed human biology pathways data, which allows for highlighting genes relevant to cancer development, resulting in improvement of the prognosis prediction accuracy. In regard to these works on non-imaging approaches, we promote future extensions combining prognostic biomarker genes and -omics data with the phenotypic information present in cancer images into multi-modal prognosis models.
GAN Tumour Profiling Examples
Related to Figure 3(l), Vu et al vu2020unsupervised propose that conditional GANs (BenignGAN) can learn latent characteristics of tissues of tumours that correlate with specific tumour grade. The authors show that when inferring BenignGAN on malignant tumour tissue images after training it exclusively on benign ones, it generates less realistic results. This allows for quantitative measurement of the differences between the original and the generated image, whereby these differences can be interpreted as tumour grade.
Kapil et al kapil2018deep explore AC-GAN odena2017conditional on digital pathology imagery for semi-supervised quantification of the Non-Small-Cell-Lung-Cancer biomarker programmed death ligand 1 (PD-L1)
. Their class-conditional generator receives a one-hot encoded PD-L1 label as input to generate a respective biopsy tissue image, while their discriminator receives the image and predicts both PD-L1 label and whether the image is fake or real. The AC-GAN method compares favourably to other supervised and non-generative semi-supervised approaches, and also systematically yields high agreement with visual191919A visual estimation of pathologists of the tumour cell percentage showing PD-L1 staining. tumour proportional scoring (TPS).
As for the analysis of the TME, Quiros et al quiros2021pathologygan propose PathologyGAN, which they train on breast and colorectal cancer tissue imagery. This allows for learning the most important tissue phenotype descriptions, and provides a continuous latent representation space, enabling quantification and profiling of differences and similarities between different tumours’ tissues. Quiros et al show that lesions encoded in an (adversarially trained) model’s latent space enable using vector distance measures to find similar lesions that are close in the latent space within large patient cohorts. We highlight the research potential in lesion latent space representations to assess inter-tumour heterogeneity. Also, the treatment strategies and successes of patients with a similar lesion can inform the decision-making process of selecting treatments for a lesion at hand, as denoted by Figure 3(m).
Outlook on Genotypic Tumour Profiling with Phenotypic Data
A further challenge is that targeted oncological therapies require genomic and immunological tumour profiling cuocolo2020machine and effective linking of tumour genotype and phenotype. Biopsies only allow to analyse the biopsied portion of the tumour’s genotype, while also increasing patient risk due to the possibility of dislodging and seeding of neoplastic altered cells shyamala2014risk; parmar2015machine. Therefore, a trade-off202020Due to this and due to the high intra-tumour heterogeneity, available biopsy data likely only describes a subset of tumour’s clonal cell population. exists between minimising the number of biopsies and maximising the biopsy-based information about a tumour’s genotype. These reasons and the fact that current methods are invasive, expensive, and time-consuming cuocolo2020machine make genotypic tumour profiling an important issue to be addressed by AI cancer imaging methods. In particular adversarial deep learning models are promising to generate the non-biopsied portion of a tumour’s genotype after being trained on paired genotype and radiology imaging data 212121Imaging data on which the entire lesion is visible to allow learning correlations between phenotypic tumour manifestations and genotype signatures.. We recommend future studies to explore this line of research, which is regarded as a key challenge for AI in cancer imaging bi2019artificial; parmar2015machine.
4.5.2 Treatment Planning and Response Prediction
Challenges for Cancer Treatment Predictions
A considerable number of malignancies and tumour stages have various possible treatment options and almost no head-to-head evidence to compare them to. Due to that, oncologists need to subjectively select an approved therapy based on their individual experience and exposure troyanskaya2020artificial.
Furthermore, despite existing treatment response assessment frameworks in oncology, inter- and intra-observer variability regarding choice and measurement of target lesions exists among oncologists and radiologists levy2008tool. To achieve consistency and accuracy in standardised treatment response reporting frameworks levy2008tool, AI and GAN methods can identify quantitative biomarkers222222For example, characteristics and density variations of the parenchyma patterns on breast images bi2019artificial from medical images in a reproducible manner useful for risk and treatment response predicts hosny2018artificial.
Apart from the treatment response assessment, treatment response prediction is also challenging, particularly for cancer treatments such as immunotherapy bi2019artificial. In cancer immunogenomics, for instance, unsolved challenges comprise the integration of multi-modal data (e.g., radiomic and genomic biomarkers bi2019artificial), immunogenicity prediction for neoantigens, and the longitudinal non-invasive monitoring of the therapy response troyanskaya2020artificial. In regard to the sustainability of a therapy, the inter- and intra-tumour heterogeneity (e.g., in size, shape, morphology, kinetics, texture, etiology) and potential sub-clone treatment survival complicates individual treatment prediction, selection, and response interpretation bi2019artificial.
GAN Treatment Effect Estimation Examples
In line with Figure 3(n), Yoon et al Yoon2018GANITEEO
propose the conditional GAN framework ”GANITE”, where individual treatment effect prediction allows for accounting for unseen, counterfactual outcomes of treatment. GANITE consists of two GANs: first, a counterfactual GAN is trained on feature and treatment vectors along with the factual outcome data. Then, the trained generator’s output is used for creating a dataset, on which the other GAN, called ITE (Individual Treatment Response) GAN, is being trained. GANITE provides confidence intervals along with the prediction, while being readily scalable for any number of treatments. However, it does not allow for taking time, dosage or other treatment parameters into account. MGANITE, proposed by Ge et alge2020mcgantreatmentresponse, extends GANITE by introducing dosage quantification, and thus enables continuous and categorical treatment effect estimations. SCIGAN bica2020estimating also extends upon GANITE and predicts outcomes of continuous rather than one-time interventions and the authors further provide theoretical justification for GANs’ success in learning counterfactual outcomes. As to the problem of individual treatment response prediction, we suggest that quantitative comparisons of GAN-generated expected post-treatment images with real post-treatment images can yield interesting insight for tumour interpretation. We encourage future work to explore generating such post-treatment tumour images given a treatment parameter and a pre-treatment tumour image as conditional inputs. With varying treatment parameters as input, it is to be investigated whether GANs can inform treatment selection by simulating various treatment scenarios prior to treatment allocation or whether GANs can help to understand and evaluate treatment effects by generating counterfactual outcome images after treatment application.
Goldsborough et al goldsborough2017cytogan presented an apporach called CytoGAN, where they synthesises fluorescence microscopy cell images using DCGAN, LSGAN, or WGAN. The discriminator’s latent representations learnt during synthesis enable grouping encoded cell images together that have similar cellular reactions to treatment by chemicals of known classes (morphological profiling)232323CytoGAN uses an approach comparable to the one shown in Figure 9(g).. Even though the authors reported that CytoGAN obtained inferior result242424i.e. mechanism-of-action classification accuracy compared to classical, widely applied methods such as CellProfiler singh2014cellprofiler, using GANs to group tumour cells representations to inform chemical cancer treatment allocation decisions is an interesting approach in the realm of treatment selection, development kadurin2017cornucopia; kadurin2017drugan and response prediction.
GAN Radiation Dose Planning Examples
As radiation therapy planning is labour-intensive and time-consuming, researchers have been spurred to pursue automated planning processes sharpeDOSEINTRO. As outlined in the following and suggested by Figure 3(o), the challenge of automated radiation therapy planning can be approached using GANs.
By framing radiation dose planning as an image colourisation problem, Mahmood et al mahmood2018automatedDOSE introduced an end-to-end GAN-based solution, which predicts 3D radiation dose distributions from CT without the requirement of hand-crafted features. They trained their model on Oropharyngeal cancer data along with three traditional ML models and a standard CNN as baselines. The authors trained a pix2pix isola2017image GAN on 2D CT imagery, and then fed the generated dose distributions to an inverse optimisation (IO) model babier2018inverse, in order to generate optimised plans. Their evaluation showed that their GAN plans outperformed the baseline methods in all clinical metrics.
Kazemifar kazemifar2020dosimetricMRItoCT (in Table 2) proposed a cGAN with U-Net generator for paired MRI to CT translation. Using conventional dose calculation algorithms, the authors compared the dose computed for real CT and generated CT, where the latter showed high dosimetric accuracy. The study, hence, demonstrates the feasibility of synthetic CT for intensity-modulated proton therapy planning for brain tumour cases, where only MRI scans are available.
Maspero et al maspero2018dose proposed a GAN-assisted approach to quicken the process of MR-based radiation dose planning, by using a pix2pix for generating synthetic CTs (sCTs) required for this task. They show that a conditional GAN trained on prostate cancer patient data can successfully generate sCTs of the entire pelvis.
A similar task has also been addressed by Peng et al peng2020magnetic. Their work compares two GAN approaches: one is based on pix2pix and the other on a CycleGAN zhu2017unpaired. The main difference between these two approaches was that pix2pix was trained using registered MR-CT pairs of images, whereas CycleGAN was trained on unregistered pairs. Ultimately, the authors report pix2pix to achieve results (i.e. mean absolute error) superior to CycleGAN, and highlight difficulties in generating high-density bony tissues using CycleGAN.
The recently introduced attention-aware DoseGAN kearney2020doseganDOSE overcomes the challenges of volumetric dose prediction in the presence of diverse patient anatomy. As illustrated in Figure 10, DoseGAN is based on a variation of the pix2pix architecture with a 3D encoder-decoder generator (L1 loss) and a patch-based patchGAN discriminator (adversarial loss). The generator was trained on concatenated CT, planning target volume (PTV) and organs at risk (OARs) data of prostate cancer patients, and the discriminator’s objective was to distinguish the real dose volumes from the generated ones. Both qualitatively and quantitatively, DoseGAN was able to synthesise more realistic volumetric doses compared to current alternative state-of-the-art methods.
Murakami et al murakami2020doseprediction published another GAN-based fully automated approach to dose distribution of Intensity-Modulated Radiation Therapy (IMRT) for prostate cancer. The novelty of their solution is that it does not require the tumour contour information, which is time-consuming to create, to successfully predict the dose based on the given CT dataset. Their approach consists of two pix2pix-based architectures, one trained on paired CT and radiation dose distribution images, and the other trained on paired structure images and radiation dose distribution images. From the generated radiation dose distribution images the dosimetric parameters for the PTV and OARs are computed. The generated dosimetric parameters differed on average only between 1-3% with respect to the original ground truth dosimetric parameters.
Koike et al koike2020deepARTIFACTREMOVAL proposed a CycleGAN for dose estimation for head and neck CT images with metal artifact removal in CT-to-CT image translation as described in Table 2). Providing consistent dose calculation against metal artifacts for head and neck IMRT, their approach achieves dose calculation performance similar to commercial metal artifact removal methods.
4.5.3 Disease Tracking and Monitoring
Challenges in Tracking and Modelling Tumour Progression
Tumour progression is challenging to model huang2020artificial and commonly requires rich, multi-modal longitudinal data sets. As cancerous cells acquire growth advantages through genetic mutation in a process arguably analogous to Darwinian evolution hanahan2000hallmarks, it is difficult to predict which of the many sub-clones in the TME will outgrow the other clones. A tumour lesion is, hence, constantly evolving in phenotype and genotype bi2019artificial and might acquire dangerous further mutations over time, anytime. The TME’s respective impact is exemplified by Dimitriou et al’s dimitriou2018principled stage II colorectal cancer outcome classification performance gain being likely attributable to the high prognostic value of the TME information in their training data.
In addition, concurrent conditions and alterations in the organ system surrounding a tumour, but also in distant organs may not only remain undetected, but could also influence patient health and progression bi2019artificial. GANs can generate hypothetical comorbidity data 252525For example from EHR hwang2017adversarial; dashtban2020predicting, imaging data, or a combination thereof. to aid awareness, testing, finding, and analysis of complex disease and comorbidity patterns. A further difficulty for tumour progression modelling is the a priori unknown effect of treatment. Treatment effects may even remain partly unknown after treatment for example in the case of radiation therapy262626Radiation therapy can result in destruction of the normal tissue (e.g., radionecrosis) surrounding the tumour. Such heterogeneous normal tissue can become difficult to characterise and distinguish from the cancerous tissue verma2013differentiating. verma2013differentiating or after surgery272727It is challenging to quantify the volume of remaining tumour residuals after surgical removal bi2019artificial. bi2019artificial.
|Kim et al (2018) kimoh2018genesbiomarkers||GAN-based||TCGA tomczak2015TCGA, Reactome croft2014reactome; fabregat2017reactome||[non-imaging] multi-omics cancer data||Data synthesis||Biomarker gene identification for pancreas, breast, kidney, brain, and stomach cancer with GANs and PageRank.|
|Ahmed et al (2021) ahmed2021multi||omicsGAN||TCGA cancer2011integrated; ciriello2015comprehensive||[non-imaging] ovarian/ breast gene expression||Paired translation||microRNA to mRNA translation and vice versa. Synthetic data improves cancer outcome classification.|
|Kapil et al (2018) kapil2018deep||AC-GAN||Private||Lung histopathology||Classification||AC-GAN CLF of PD-L1 levels for lung tumour tissue images obtained via needle biopsies.|
|Vu et al (2019) vu2020unsupervised||BenignGAN||Private||Colorectal histopathology||Paired translation||Edge map-to-image. As trained on only benign, malignant images quantifiable via lower realism.|
|Quiros et al (2021) quiros2021pathologygan PathologyGAN||PathologyGAN||VGH/NKI beck2011nki_vgh, NCT kather_jakob_nikolas_2018_1214456||Breast/colorectal histopathology||Representation learning||Learning tissue phenotype descriptions and latent space representations of tumour histology image patches.|
|Treatment Response Prediction|
|Kadurin et al (2017) kadurin2017cornucopia; kadurin2017drugan||AAE-based druGAN||Pubchem BioAssay wang2014pubchem||[non-imaging] Molecular fingerprint data||Representation learning||AAE for anti-cancer agent drug discovery. AAE input/output: molecular fingerprints & log concentration vectors.|
|Goldsborough et al (2017) goldsborough2017cytogan||CytoGAN||BBBC021 ljosa2012bbbc021||Cytopathology||Representation learning||Grouping cells with similar treatment response via learned cell image latent representations.|
|Yoon et al (2018) Yoon2018GANITEEO||GANITE||USA 89-91 Twins almond2004twins||[non-imaging] individualized treatment effects||Conditional synthesis||cGANs for individual treatment effect prediction, including unseen counterfactual outcomes and confidence intervals.|
|Ge et al (2018) ge2020mcgantreatmentresponse||MGANITE||AML clinical trial kornblau2009aml_dataset||[non-imaging] individualized treatment effects||Conditional synthesis||GANITE extension introducing dosage quantification and continuous and categorical treatment effect estimation.|
|Bica et al (2020) bica2020estimating||SCIGAN||TCGA weinstein2013cancer, News schwab2020learning; johansson2016learning, MIMIC III johnson2016mimic||[non-imaging] individualized treatment effects||Conditional synthesis||GANITE extension introducing continuous interventions and theoretical explanation for GAN counterfactuals.|
|Radiation Dose Planning|
|Mahmood et al (2018) mahmood2018automatedDOSE||pix2pix-based||Private||Oropharyngeal CT||Paired translation||Translating CT to 3D dose distributions without requiring hand-crafted features.|
|Maspero et al (2018) maspero2018dose||pix2pix||Private||Prostate/rectal/cervical CT/MRI||Paired translation||MR-to-CT translation for MR-based radiation dose planning without CT acquisition.|
|Murakami et al (2018) murakami2020doseprediction||pix2pix||Private||Prostate CT||Paired translation||CT-to-radiation dose distribution image translation without time-consuming contour/organs at risk (OARs) data.|
|Peng et al (2020) peng2020magnetic||pix2pix, CycleGAN||Private||Nasopharyngeal CT/MRI||Unpaired/Paired translation||Comparison of pix2pix & CycleGAN-based generation of CT from MR for radiation dose planning.|
|Kearney et al (2020) kearney2020doseganDOSE||DoseGAN||Private||Prostate CT/PTV/OARs||Paired translation||Synthesis of volumetric dosimetry from CT+PTV+OARs even in the presence of diverse patient anatomy.|
|Disease Tracking & Monitoring|
|Kim et al (2019) kim2019hepaticcgan||CycleGAN||Private||Liver MRI/CT/dose||Unpaired translation||Pre-treatment MR+CT+dose translation to post-treatment MRI predicting hepatocellular carcinoma progression.|
|Elazab et al (2020) elazab2020gpgan||GP-GAN||Private, BRATS 2014 menze2014multimodal||Cranial MRI||Paired translation||3D U-Net G generating progression image from longitudinal MRI to predict glioma growth between time-step.|
|Li et al (2020) li2020dc||DC-AL GAN, DCGAN||Private||Cranial MRI||Image synthesis||CLF uses D representations to distinguish pseudo- and true glioblastoma progression.|
GAN Tumour Progression Modelling Examples
Relating to Figure 3(p), GANs can not only diversify the training data, but can also be applied to simulate and explore disease progression scenarios elazab2020gpgan. For instance, Elazab et al elazab2020gpgan propose GP-GAN, which uses stacked 3D conditional GANs for growth prediction of glioma based on longitudinal MR images. The generator is based on the U-Net architecture ronneberger2015u and the segmented feature maps are used in the training process. Kim et al kim2019hepaticcgan trained a CycleGAN on concatenated pre-treatment MR, CT and dose images (i.e. resulting in one 3-channel image) of patients with hepatocellular carcinoma to generate follow-up enhanced MR images. This enables tumour image progression prediction after radiation treatment, whereby CycleGAN outperformed a vanilla GAN baseline.
Li et al’s li2020dc proposed deep convolutional (DC) radford2015unsupervised - AlexNet (AL) krizhevsky2012imagenet GAN (DC-AL GAN) is trained on longitudinal diffusion tensor imaging (DTI) data of pseudoprogression (PsP) and true tumour progression (TTP) in glioblastoma multiforme (GBM) patients. Both of these progression types can occur after standard treatment282828Pseudoprogression occurs in 20-30% of GBM patients li2020dc.
and they are often difficult to differentiate due to similarities in shape and intensity. In DC-AL GAN, representations are extracted from various layers of its AlexNet discriminator that is trained on discriminating between real and generated DTI images. These representations are then used to train a support vector machine (SVM) classifier to distinguish between PsP and TTP samples achieving promising performance.
We recommend further studies to extend on these first adversarial learning disease progression modelling approaches. One potential research direction are GANs that simulate environment and tumour dependent progression patterns based on conditional input data such as the tumour’s gene expression data xu2020correlation or the progressed time between original image and generated progression image (e.g., time passed between image acquisitions or since treatment exposure). To this end, unexpected changes of a tumour may be uncovered between time points or deviations from a tumour’s biopsy proven genotypic growth expectations292929For example, by comparing the original patient image after progression with the GAN-generated predicted image (or its latent representation) after progression for time spans of interest..
5 Discussion and Future Perspectives
As presented in Figure 2(c), we have included 163 of the surveyed GAN publications in the timeframe from 2017 until March 7 2021. We observe that the numbers of cancer imaging GANs publications has been increasing from 2017 to 2020 from 9 to 64 with a surprising slight drop between 2018 to 2019 (41 to 37). The final number of respective publications for 2021 is still pending. The trend towards publications that propose GANs to solve cancer imaging challenges demonstrates the considerable research attention that GANs have been receiving in this field. Following our literature review in Section 4, the need for further research in GANs seems not yet to be met. We were able to highlight various lines of research for GANs in oncology, radiology, and pathology that have received limited research attention or are untapped research potentials. These potentials indicate a continuation of the trend towards more GAN applications and standardised integration of GAN generated synthetic data into medical image analysis pipelines and software solutions.
In regard to imaging modalities, we analyse in Figure 2(b) how much research attention each modality has received in terms of the number of corresponding publications. By far, MRI and CT are the most dominant modalities with 57, and 54 publications, respectively, followed by MMG (13), dermoscopy (12) and PET (6). The wide spread between MRI and CT and less investigated domains such as endoscopy (3), ultrasound (3), and tomosynthesis (0) is to be critically examined. Due to variations in the imaging data between these modalities (e.g., spatial resolutions, pixel dimensions, domain shifts), it cannot be readily assumed that a GAN application with desirable results in one modality will produce equally desirable results in another. Due to that and with awareness of the clinical importance of MRI and CT, we suggest a more balanced application of GANs across modalities including experiments on rare modalities to demonstrate the clinical versatility and applicability of GAN-based solutions.
In comparison, the GAN-based solutions per anatomy are more evenly spread, but still show a clear trend towards brain, head, neck (47), lung, chest, thorax (33) and breast (25). We suspect these spreads are due to the availability of few well-known widely-used curated benchmark datasets menze2014multimodal; armato2011lung; heath2001digital; moreira2012inbreast resulting in underexposure of organs and modalities with less publicly available data resources. Where possible, we recommend evaluating GANs on a range of different tasks and organs. This can avoid iterating towards non-transferable solutions tuned for specific datasets with limited generalisation capabilities. Said generalisation capabilities are critical for beneficial usage in clinical environments where dynamic data processing requirements and dataset shifts (e.g., multi-vendor, multi-scanner, multi-modal, multi-organ, multi-centre) commonly exist.
Figure 2(a) displays the distribution of GAN publications across cancer imaging challenge categories that correspond to the subsections of Section 4. While the sections 4.4 detection and diagnosis (54) and 4.4 data annotation and segmentation (44), and 4.1 data scarcity and usability (35) have received much research attention, the sections 4.5 treatment and monitoring (18) and 4.2 data access and privacy (12) contain substantially less GAN-related publications. This spread can be anticipated considering that classification and segmentation are popular computer vision problems and common objectives in publicly available medical imaging benchmark datasets. Early detected cancerous cells likely have had less time to acquire malignant genetic mutations hanahan2000hallmarks; hanahan2011hallmarks than their latter detected counterparts, which, by then, might have acquired more treatment-resistant alterations and subclone cell populations. Hence, automated early detection, location and diagnosis can provide high clinical impact via improved cancer treatment prospects, which likely influences the trend towards detection and segmentation related GAN publications.
Nonetheless, we also promote future work on the less researched open challenges in Section 4.2, where we describe the promising research potential of GANs in patient data privacy and security. We note that secure patient data is required for legal and ethical patient data sharing and usage, which, on the other hand, is required for successful training of state-of-the-art detection and segmentation models. As diagnosis is an intermediate step in the clinical workflow, we further encourage more research on GAN solutions extending to subsequent clinical workflow steps such as oncological treatment planning and disease monitoring as elaborated in Section 4.5.
In closing, we emphasise the versatility and the resulting modality-independent wide applicability of the unsupervised adversarial learning scheme of GANs. In this survey, we strive to consider and communicate this versatility by describing the wide variety of problems in the cancer imaging domain that can be approached with GANs. For example, we highlight GAN solutions that range from domain adaptation to patient privacy preserving distributed data synthesis, to adversarial segmentation mask discrimination, to multi-modal radiation dose estimation, amongst others.
Before reviewing and describing GAN solutions, we surveyed the literature to understand the current challenges in the field of cancer imaging with a focus on radiology, but without excluding non-radiology modalities common to cancer imaging. After screening and understanding the cancer imaging challenges, we grouped them into the challenge categories Data Scarcity and Usability, Data Access and Privacy, Data Annotation and Segmentation, Detection and Diagnosis, and Treatment and Monitoring. After categorisation, we surveyed the literature for GANs applied to the field of cancer imaging and assigned each relevant GAN publication to its respective cancer imaging challenge category. Finally, we provide a comprehensive analysis for each challenge and its assigned GAN publications to determine to what extent it has and can be solved using GANs. To this end, we also highlight research potential for challenges where we were able to propose a GAN solution that has not yet been covered by the literature.
With this work, we strive to uncover and motivate promising lines of research in adversarial learning that we imagine to ultimately benefit the field of cancer imaging in clinical practice.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 952103.