Generative Adversarial Networks (GANs) have seen a remarkable progress since their introduction (Goodfellow et al. ), giving high resolution and detail representations of diverse data. Recently, there has been an increasing interest in applying GANs to solve a range of specific tasks in digital pathology, including staining normalization (Ghazvinian Zanjani et al. ), staining transformation (Rana et al. , Xu et al. ), and nuclei segmentation (Mahmood et al. 
). Together with deep learning-based classification frameworks (Esteva et al. , Ardila et al. ), these advances offer hope for better disease diagnosis than standard pathology (Niazi et al. ). Deep learning approaches have a lack of interpretability, which is a major limiting factor in making a real impact in clinical practice. In this paper, we propose a holistic approach that uses GANs to learn representations of entire tissue architectures (i.e. colour, texture, spatial features of cancer and normal cells, and their interaction). These tissue architectures are encoded in histopathological images of tumours, and map these representations to clinical characteristics. To this end, we present the following contributions:
We study how state-of-art evaluation metrics distinguish distribution changes when different types of cancer tissue and marker are introduced.
We propose a different feature space when using GAN evaluation metrics, based on cell density and morphology in cancer tissue.
We show how these models capture pathologically meaningful representations, and when evaluated by pathologists, generated tissue images are not distinct from real tissue images.
GANs (Goodfellow et al. ) are generative models that are able to learn high fidelity and diverse data representations from a target distribution. This is done with a generator, , that maps random noise, , to samples that resemble the target data, , and a discriminator, , whose goal is to distinguish between real and generated samples. The goal of a GAN is find the equilibrium in the min-max problem:
Since its introduction, modeling distributions of images has become the mainstream application for GANs, firstly proposed by (Radford et al. ). State-of-the-art GANs such as BigGAN (Brock et al. ) and ProGAN (Karras et al. ) have recently shown impressive high-resolution images, and proposed solutions like Spectral Normalization GANs (Miyato et al. ), Self-Attention GANs (Zhang et al. 
), and also BigGAN achieved high diversity images in data sets like ImageNet (Deng et al. ) with 14 million images and 20 thousand different classes.
At the same time, evaluating these models has been a challenging task. Many different metrics such as Inception Score (IS, Salimans et al. ), Frechet Inception Distance (FID, Heusel et al. ), Maximum Mean Discrepancy (MMD, Gretton et al. ), Kernel Inception Distance (KID, Binkowski et al. 
), and 1-Nearest Neighbor classifier (1-NN,Lopez-Paz and Oquab ) have been proposed to do so, and thorough empirical studies (Xu et al. , Barratt and Sharma ) have shed some light on the advantages and disadvantages of each them. However, the selection of a feature space is crucial for using these metrics.
Given the high fidelity and diversity of the generated samples of GANs, these models are excellent candidates to learn complex latent representations of cancer tissue. Cancer is a disease with extensive heterogeneity, where malignant cells interact with immune cells, stromal cells, surrounding tissues and blood vessels. Tissue micro-arrays (TMAs) collected from tumour biopsies are a common type of histopathological image to study cancer tissue. Stainings on TMAs reveal tumour micro-environment, allowing understanding of the disease. Currently, machine learning approaches have been focusing on building classifiers to achieve pathologist-level diagnosis (Esteva et al. , Wei et al. , Han et al. ), and assisting in the decision process through computer-human interaction (Cai et al. ).
For breast cancer, traditional computer vision approaches such asBeck et al.  and Yuan et al. 
have identified correlation between morphological features of cells and patient survival. Based these findings, we view the Pathology GAN as an approach to learn clinically/pathologically representations within the cancer tissue images. In this work we demonstrate that Pathology GAN is a first step in applying unsupervised learning to help close the gap in understanding the connections between the cancer tissue and molecular information.
3 Pathology GAN
We use BigGAN from Brock et al.  as a baseline and explored some changes which empirically worked best for our purpose. We followed the same architecture, employed Spectral Normalization in both generator and discriminator, self attention layer, projection discriminator, and we also use orthogonal initialization and regularization as mentioned in the original paper.
We make use of the Relativistic Average Discriminator (Jolicoeur-Martineau Lim and Ye ) as the GAN objective. We find that this change makes the model convergence faster and produce higher quality images. Images using the Hinge loss did not capture the morphological structure of the tissue, we show this in Figure 2
. The discriminator, and generator loss function are formulated as in Equations 2 and 3, whereis the distribution of real data, is the distribution for the fake data, and is the non-transformed discriminator output or critic:
We use the Adam optimizer (Kingma and Ba ) with
and different learning rates from BigGAN, using the same value of 0.0001 for both generator and discriminator, the discriminator takes 5 steps for each of the generator. We also use Conditional Batch Normalization on our models, but we find that feeding the entire latent vector produced higher quality images. Each model was trained on an NVIDIA Titan Xp 12 GB for approximately 72 hours.
To train our model, we utilize two H&E breast cancer databases from the Netherlands Cancer Institute (NKI) cohort and the Vancouver General Hospital (VGH) cohort with 248 and 328 patients respectively ( Beck et al. ). Each of them include tissue micro-array (TMA) images, along with clinical patient data such as survival time, and estrogen-receptor (ER) status. The original TMA images all have a resolution of pixels, and we split each of the images into smaller patches of , and allow them to overlap by 50%. We also perform data augmentation on these images, a rotation of , and , and vertical and horizontal inversion. We filter out images in which the tissue covers less than 70% of the area. In total this yields a training set of 249K images, and a test set of 62K.
We use this data set to train three different models: an unconditional one, without feeding any kind of label, one on Estrogen-receptor (ER) status (positive or negative), and a final one on survival time. For our survival time model we categorize the images into two groups, a group with survival times greater than 5 years and the second group with 5 years or less.
In Figure 3
we show how the generator is able to learn low dimensional representations of the tissue images, allowing linear interpolation in the latent space that translate into smooth transitions in images. Notably, we picked two ends of the image spectrum, a malignant tissue and a benign tissue. We can see that as we follow the transition, the cancer cell population decreases and the surrounding tissue architecture adapts to that change.
4 GAN evaluation metrics for digital pathology
In this section, we investigate how relevant GAN evaluation metrics perform on distinguishing differences in cancer tissue distributions. We center our attention on metrics that are model agnostic and work with a set of generated images. We focus on Fréchet Inception distance (FID), Kernel Inception Distance (KID), and 1-Nearest Neighbor classifier (1-NN) as common metrics to evaluate GANs. We do not include Inception Score and Mode Score because they do not compare to real data directly, they require a classification network on survival times and estrogen-receptor (ER), and they have also showed lower performance when evaluating GANs (Barratt and Sharma , Xu et al. ).
Xu et al.  reported that the choice of feature space is critical for evaluation metrics, so we follow these results by using the ’pool_3’ layer from an ImageNet trained Inception-V1 as a convolutional feature space.
We set up two experiments to test how the evaluation metrics capture:
Artificial contamination from different staining markers and cancer types.
Consistency when two sample distributions of the same database are compared.
4.1 Detecting changes in markers and cancer tissue features
We used multiple cancer types and markers to account for alterations of color and shapes in the tissue. Markers highlight parts of the tissue with different colors, and cancer types have distinct tissues structures. Examples of these changes are displayed in Figure 4.
We constructed one reference image set with 5000 H&E breast cancer images from our data sets of NKI and VGH, and compared it against another set of 5000 H&E breast cancer images contaminated with other markers and cancer types. We used three types of marker-cancer combinations for contamination, all from the Stanford TMA Database (Marinelli et al. ): H&E - Bladder cancer, Cathepsin-L - Breast cancer, and CD137 - Lymph/Colon/Liver cancer.
Each set of images was constructed by randomly sampling from the respective marker-cancer type data set, which is done to minimize the overlap between the clean and contaminated sets.
Figure 5 shows how (a) FID, (b) KID, (c) 1-NN behave when the reference H&E breast cancer set is measured against multiple percentage of contaminated H&E breast cancer sets. Marker types have a large impact due to color change and all metrics capture this except for 1-NN. Cathepsin-L highlights parts of the tissue with brown colors and CD137 has similar color to necrotic tissue on H&E breast cancer, but still far from the characteristic pink color of H&E. Accordingly, H&E-Bladder has a better score in all metrics due to the color stain, again expect for 1-NN. Cancer tissue type differences are captured by all the metrics, which shows a marker predominance, but we can see that on the H&E marker the differences between breast and bladder types are still captured.
In this experiment, we find that FID and KID have a gradual response distinguishing between markers and cancer tissue types, however 1-NN is not able to give a measure that clearly defines these changes.
4.2 Reliability on evaluation metrics
Another evaluation we performed was to study which metrics are consistent when two independent sample distributions with the same contamination percentage are compared. To construct this test, for each contamination percentage, we create two independent sample sets of 5000 images and compare them against each other. Again, we constructed these image sets by randomly selecting images for each of the marker-cancer databases. We do this to ensure there are no overlapping images between the distributions.
In Figure 6 we show that (a) FID has a stable performance, compared to (b) KID, and especially (c) 1-NN. The metrics should show a close to zero distance for each of the contamination rates since we are comparing two sample-distributions from the same data set. This shows that only FID has a close to zero constant behavior across different data sets when comparing the same tissue image distributions.
Based on these two experiments, we argue that 1-NN does not clearly represent changes in the cancer types and marker, and both KID and 1-NN do not give a constant reliable measure across different markers and cancer types. Therefore we focused on FID as the most promising evaluation metrics.
5 Pathology GANs Results
5.1 Image analysis of cellular heterogeneity as feature extraction.
We explore an alternative feature space that has a direct relation to pathology and cellular profiling, the motivation behind this approach is to ensure that our models capture meaningful and faithful representations of the tissue. We do not use this method in the two previous experiments since this tool specifically works with H&E breast cancer images.
The CRImage tool (Yuan et al. ) uses a SVM classifier to provide quantitative information about tumor cellular characteristics in cancer tissue. This approach allows us to gather pathologically meaningful features in the images, namely the number of cancer cells, the number of other types of cells (such as stromal or lymphocytes), and the ratio of tumorous cells per area. We use this information as an alternative feature space for the FID metric. Figure 7 displays an example of how the tool captures the different cells in the generated images, such as cancer cells, stromal cells, and lymphocytes.
We evaluate our three models with the FID, generating 5000 fake images, and randomly selecting 5000 real images. We also use both approaches for feature space selection, using CRImage cell classifier and the convolutional features of an Inception-V1.
|Model||Inception FID||CRImage FID|
|Survival > 5 years||13.230.4||3.811.8|
|Survival <= 5 years||27.304.6||16.460.40|
Evaluation of different Pathology GANs. Mean and standard deviations are computed over three different random initializations. The low FID scores in both feature space suggest consistent and accurate representations.
Table 1 shows that our models are able to achieve a accurate characterization of the cancer tissue images. Using the Inception feature space, FID shows a stable representation for all models with values similar to ImageNet models of BigGAN (Brock et al. ) and SAGAN (Zhang et al. ), with FIDs of 7.4 and 18.65 respectively. Using the CRImage cellular information as feature space, FID shows again close representations to real tissue.
5.3 Capturing differences between phenotype and Pathology GAN
In this section, we show the differences between real sample distributions in the ER and survival time cases, and we do the same with our models. We compare whether our models are able to capture the same differences within phenotype. We present this data in two different ways:
In Figure 8
, Inception feature space, (a) and (c), show similar trends between real and generated distributions for the ER and survival cases. Differences within phenotype distributions are highlighted by the CRImage feature space, specially on the ER case (b), where differences between real ER positive, negative, and the joint distributions are followed by our models. For survival time (d), our model has a harder time reproducing data from less or equal to five years but it is still able to find differences between this distribution, the greater than five years, and the joint one, which concurs with the results in Table1.
With Figure 9, we display the density of real and generative distributions using two important factors from cancer tissue: the cancer cell density per area and the cancer cells per all cells ratio. We used the CRImage cell classifier to extract these from the cancer tissue images. We can see that our model characterization follows the real distributions, ER Positive (a)-(e) and Survival time greater than five years (c)-(g) show close representations. At the same time ER Negative (b)-(f) and Survival time less than five years (d)-(h) show a larger difference between real and generated, which again follows both the evaluation metrics in Table 1 and plots in Figure 8.
Our goal with this work is to show that the generative model is able to condition the cancer tissue images depending on the label, and this way reproduce the real image’s distributions. This opens the door to using much richer clinical information of the cancer tissue sample, for instance molecular properties associated with cancer tissue.
5.4 Pathologists’ results
To demonstrate that the generated images can sustain the scrutiny of clinical examination, we asked two expert pathologists to take two different tests, setup as follows:
Test I: 5 Sets of 8 images - Pathologists were asked to find the only fake image in each set.
Test II: 10 Individual images - Pathologists were asked to rate all individual images from 1 to 5, where 5 meant the image appeared the most real.
In total, each of the pathologists classified 50 individual images and 25 sets of 8 images. We chose fake images in two ways, with half of them hand-selected and the other half with fake images that had the smallest Euclidean distance to real images in the convolutional feature space (Inception-V1). All the real images are randomly selected between the three closest neighbors of the fake images.
On Test I, pathologist 1 and 2 were able to find only 2 fake out of the 25 sets, and 3 out of the 25, respectively. This is indicative of the images’ quality because we argue that pathologists should be less challenged to find fake images amongst real ones, since having other references to compare with facilitates the task. Figure 10 shows Test II in terms of false positive vs true positive, and we can see that pathologist classification is close to random. The pathologists mentioned that the usual procedure is to work with larger images with bigger resolution, but that the generated fake images were of such a quality, that at the size used in this work, they were not able to differentiate between real and fake tissue.
We presented a new approach to the use of machine learning in pathology, using GANs to learn clinically/pathologically meaningful representations of cancer tissue. We studied GAN evaluation metrics on cancer tissue images and showed that FID gives a more precise measure, and we also proposed a new feature space with a close relationship to pathology. Additionally, our results show that these models capture faithful representations of cancer tissue, and we demonstrate that pathologists are not able to reliably find differences between real and generated images.
This paper lays the foundation for further work on the following ideas: increase the resolution of the images to capture larger cancer tissue, investigate how feature disentanglement could help to understand common features between cancer types, and map to molecular properties (e.g. genome, transcriptome, and etc) that dictate tissue phenotype. To this end, we are working towards massively scaling up our model to handle whole-slide images across multiple cancer types with large patient cohorts such as the The Cancer Genome Atlas.
We would like to thank Joanne Edwards and Elizabeth Mallon for helpful insights and discussions on this work. We will also like to acknowledge the support from EPSRC grant EP/R018634/1.
- Ardila et al.  Diego Ardila, Atilla P. Kiraly, Sujeeth Bharadwaj, Bokyung Choi, Joshua J. Reicher, Lily Peng, Daniel Tse, Mozziyar Etemadi, Wenxing Ye, Greg Corrado, David P. Naidich, and Shravya Shetty. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, page 1, may 2019. ISSN 1078-8956.
- Barratt and Sharma  Shane Barratt and Rishi Sharma. A note on the Inception score. CoRR, abs/1801.01973, 2018. URL http://arxiv.org/abs/1801.01973.
- Beck et al.  A.H. Beck, A.R. Sangoi, and S Leung. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Science translational medicine, 3, 01 2011.
- Binkowski et al.  Mikolaj Binkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=r1lUOzWCW.
- Brock et al.  Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. CoRR, abs/1809.11096, 2018. URL http://arxiv.org/abs/1809.11096.
- Cai et al.  Carrie J. Cai, Emily Reif, Narayan Hegde, Jason D. Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda B. Viégas, Gregory S. Corrado, Martin C. Stumpe, and Michael Terry. Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May 04-09, 2019, page 4, 2019. doi: 10.1145/3290605.3300234. URL https://doi.org/10.1145/3290605.3300234.
- Deng et al.  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
Esteva et al. 
Andre Esteva, Brett Kuprel, Roberto Novoa, Justin Ko, Susan M Swetter, Helen
M Blau, and Sebastian Thrun.
Dermatologist-level classification of skin cancer with deep neural networks.Nature, 542, 01 2017. doi: 10.1038/nature21056.
- Ghazvinian Zanjani et al.  Farhad Ghazvinian Zanjani, Svitlana Zinger, Babak Ehteshami Bejnordi, Jeroen van der Laak, and Peter With. Stain normalization of histopathology images using generative adversarial networks. In 2018 IEEE 15th International Symposium on Biomedical Imaging, pages 573–577, 04 2018. doi: 10.1109/ISBI.2018.8363641.
- Goodfellow et al.  Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, abs/1406.2661, 2014. URL http://arxiv.org/abs/1406.2661.
- Gretton et al.  Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012. URL http://dl.acm.org/citation.cfm?id=2188410.
- Han et al.  Zhongyi Han, Benzheng Wei, Yuanjie Zheng, Yilong Yin, Kejian Li, and Shuo Li. Breast cancer multi-classification from histopathological images with structured deep learning model. Scientific Reports, 7, 06 2017. doi: 10.1038/s41598-017-04075-z.
- Heusel et al.  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6629–6640, 2017.
- Jolicoeur-Martineau  Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard GAN. CoRR, abs/1807.00734, 2018. URL http://arxiv.org/abs/1807.00734.
- Karras et al.  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Lim and Ye  Jae Hyun Lim and Jong Chul Ye. Geometric GAN. CoRR, abs/1705.02894, 2017. URL http://arxiv.org/abs/1705.02894.
- Lopez-Paz and Oquab  David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum?id=SJkXfE5xx.
- Mahmood et al.  Faisal Mahmood, Daniel Borders, Richard Chen, Gregory McKay, Kevan J Salimian, Alexander Baras, and Nicholas Durr. Deep adversarial training for multi-organ nuclei segmentation in histopathology images, 09 2018.
- Marinelli et al.  Robert J. Marinelli, Kelli Montgomery, Chih Long Liu, Nigam Shah, Wijan Prapong, Michael Nitzberg, Zachariah K Zachariah, Gavin Sherlock, Yasodha Natkunam, Robert B West, Matt van de Rijn, Patrick O Brown, and Catherine A Ball. The stanford tissue microarray database. Nucleic acids research, 36:D871–7, 02 2008. doi: 10.1093/nar/gkm861.
- Miyato et al.  Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=B1QRgziT-.
Niazi et al. 
Muhammad Khalid Khan Niazi, Anil V Parwani, and Metin N Gurcan.
Digital pathology and artificial intelligence.The Lancet Oncology, 20(5):e253–e261, may 2019. ISSN 1470-2045.
- Radford et al.  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06434.
- Rana et al.  Aman Rana, Gregory Yauney, Alarice Lowe, and Pratik Shah. Computational histological staining and destaining of prostate core biopsy RGB images with generative adversarial neural networks. In 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, December 17-20, 2018, pages 828–834, 2018. doi: 10.1109/ICMLA.2018.00133. URL https://doi.org/10.1109/ICMLA.2018.00133.
- Salimans et al.  Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226–2234, 2016. URL http://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.
- Wei et al.  Jason W. Wei, Laura J. Tafe, Yevgeniy A. Linnik, Louis J. Vaickus, Naofumi Tomita, and Saeed Hassanpour. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. CoRR, abs/1901.11489, 2019. URL http://arxiv.org/abs/1901.11489.
- Xu et al.  Qiantong Xu, Gao Huang, Yang Yuan, Chuan Guo, Yu Sun, Felix Wu, and Kilian Q. Weinberger. An empirical study on evaluation metrics of generative adversarial networks. CoRR, abs/1806.07755, 2018. URL http://arxiv.org/abs/1806.07755.
- Xu et al.  Zhaoyang Xu, Carlos Fernández Moro, Béla Bozóky, and Qianni Zhang. GAN-based Virtual Re-Staining: A Promising Solution for Whole Slide Image Analysis. Arxiv, jan 2019.
- Yuan et al.  Yinyin Yuan, Henrik Failmezger, Oscar M Rueda, Hamid Ali, Stefan Gräf, Suet-Feung Chin, Roland F Schwarz, Christina Curtis, Mark Dunning, Helen Bardwell, Nicola Johnson, Sarah Doyle, Gulisa Turashvili, Elena Provenzano, Samuel Aparicio, Carlos Caldas, and Florian Markowetz. Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Science translational medicine, 4:157ra143, 10 2012. doi: 10.1126/scitranslmed.3004330.
- Zhang et al.  Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. CoRR, abs/1805.08318, 2018. URL http://arxiv.org/abs/1805.08318.