Pathology GAN: Learning deep representations of cancer tissue

07/04/2019 ∙ by Adalberto Claudio Quiros, et al. ∙ University of Glasgow 13

We apply Generative Adversarial Networks (GANs) to the domain of digital pathology. Current machine learning research for digital pathology focuses on diagnosis, but we suggest a different approach and advocate that generative models could help to understand and find fundamental morphological characteristics of cancer tissue. In this paper, we develop a framework which allows GANs to capture key tissue features, and present a vision of how these could link cancer tissue and DNA in the future. To this end, we trained our model on breast cancer tissue from a medium size cohort of 526 patients, producing high fidelity images. We further study how a range of relevant GAN evaluation metrics perform on this task, and propose to evaluate synthetic images with clinically/pathologically meaningful features. Our results show that these models are able to capture key morphological characteristics that link with phenotype, such as survival time and Estrogen-receptor (ER) status. Using an Inception-V1 network as feature extraction, our models achieve a Frechet Inception Distance (FID) of 18.4. We find that using pathology meaningful features on these metrics show consistent performance, with a FID of 8.21. Furthermore, we asked two expert pathologists to distinguish our generated images from real ones, finding no significant difference between them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (a): Images () from Pathology GAN trained on H&E breast cancer tissue. (b): Real images, Inception-V1 closest neighbor to the generated above.

Generative Adversarial Networks (GANs) have seen a remarkable progress since their introduction (Goodfellow et al. [2014]), giving high resolution and detail representations of diverse data. Recently, there has been an increasing interest in applying GANs to solve a range of specific tasks in digital pathology, including staining normalization (Ghazvinian Zanjani et al. [2018]), staining transformation (Rana et al. [2018], Xu et al. [2019]), and nuclei segmentation (Mahmood et al. [2018]

). Together with deep learning-based classification frameworks (

Esteva et al. [2017], Ardila et al. [2019]), these advances offer hope for better disease diagnosis than standard pathology (Niazi et al. [2019]). Deep learning approaches have a lack of interpretability, which is a major limiting factor in making a real impact in clinical practice. In this paper, we propose a holistic approach that uses GANs to learn representations of entire tissue architectures (i.e. colour, texture, spatial features of cancer and normal cells, and their interaction). These tissue architectures are encoded in histopathological images of tumours, and map these representations to clinical characteristics. To this end, we present the following contributions:

  1. We propose Pathology GANs to generate high fidelity cancer tissue images, which combines BigGAN (Brock et al. [2018]) and Relativistic Average Discriminator (Jolicoeur-Martineau [2018]).

  2. We study how state-of-art evaluation metrics distinguish distribution changes when different types of cancer tissue and marker are introduced.

  3. We propose a different feature space when using GAN evaluation metrics, based on cell density and morphology in cancer tissue.

  4. We show how these models capture pathologically meaningful representations, and when evaluated by pathologists, generated tissue images are not distinct from real tissue images.

2 Background

GANs (Goodfellow et al. [2014]) are generative models that are able to learn high fidelity and diverse data representations from a target distribution. This is done with a generator, , that maps random noise, , to samples that resemble the target data, , and a discriminator, , whose goal is to distinguish between real and generated samples. The goal of a GAN is find the equilibrium in the min-max problem:

(1)

Since its introduction, modeling distributions of images has become the mainstream application for GANs, firstly proposed by (Radford et al. [2016]). State-of-the-art GANs such as BigGAN (Brock et al. [2018]) and ProGAN (Karras et al. [2018]) have recently shown impressive high-resolution images, and proposed solutions like Spectral Normalization GANs (Miyato et al. [2018]), Self-Attention GANs (Zhang et al. [2018]

), and also BigGAN achieved high diversity images in data sets like ImageNet (

Deng et al. [2009]) with 14 million images and 20 thousand different classes.

At the same time, evaluating these models has been a challenging task. Many different metrics such as Inception Score (IS, Salimans et al. [2016]), Frechet Inception Distance (FID, Heusel et al. [2017]), Maximum Mean Discrepancy (MMD, Gretton et al. [2012]), Kernel Inception Distance (KID, Binkowski et al. [2018]

), and 1-Nearest Neighbor classifier (1-NN,

Lopez-Paz and Oquab [2017]) have been proposed to do so, and thorough empirical studies (Xu et al. [2018], Barratt and Sharma [2018]) have shed some light on the advantages and disadvantages of each them. However, the selection of a feature space is crucial for using these metrics.

Given the high fidelity and diversity of the generated samples of GANs, these models are excellent candidates to learn complex latent representations of cancer tissue. Cancer is a disease with extensive heterogeneity, where malignant cells interact with immune cells, stromal cells, surrounding tissues and blood vessels. Tissue micro-arrays (TMAs) collected from tumour biopsies are a common type of histopathological image to study cancer tissue. Stainings on TMAs reveal tumour micro-environment, allowing understanding of the disease. Currently, machine learning approaches have been focusing on building classifiers to achieve pathologist-level diagnosis (Esteva et al. [2017], Wei et al. [2019], Han et al. [2017]), and assisting in the decision process through computer-human interaction (Cai et al. [2019]).

For breast cancer, traditional computer vision approaches such as

Beck et al. [2011] and Yuan et al. [2012]

have identified correlation between morphological features of cells and patient survival. Based these findings, we view the Pathology GAN as an approach to learn clinically/pathologically representations within the cancer tissue images. In this work we demonstrate that Pathology GAN is a first step in applying unsupervised learning to help close the gap in understanding the connections between the cancer tissue and molecular information.

3 Pathology GAN

We use BigGAN from Brock et al. [2018] as a baseline and explored some changes which empirically worked best for our purpose. We followed the same architecture, employed Spectral Normalization in both generator and discriminator, self attention layer, projection discriminator, and we also use orthogonal initialization and regularization as mentioned in the original paper.

We make use of the Relativistic Average Discriminator (Jolicoeur-Martineau [2018]

), where the discriminator’s goal is to estimate the probability of the real data being more realistic than the fake. We take this approach instead of following the Hinge loss (

Lim and Ye [2017]) as the GAN objective. We find that this change makes the model convergence faster and produce higher quality images. Images using the Hinge loss did not capture the morphological structure of the tissue, we show this in Figure 2

. The discriminator, and generator loss function are formulated as in Equations 2 and 3, where

is the distribution of real data, is the distribution for the fake data, and is the non-transformed discriminator output or critic:

(2)
(3)

We use the Adam optimizer (Kingma and Ba [2015]) with

and different learning rates from BigGAN, using the same value of 0.0001 for both generator and discriminator, the discriminator takes 5 steps for each of the generator. We also use Conditional Batch Normalization on our models, but we find that feeding the entire latent vector produced higher quality images. Each model was trained on an NVIDIA Titan Xp 12 GB for approximately 72 hours.

Figure 2: Images of the Relativistic Average Discriminator (a), and Hinge loss (b), we can see that using Relativistic average Discriminator produces fake images closer to real.

To train our model, we utilize two H&E breast cancer databases from the Netherlands Cancer Institute (NKI) cohort and the Vancouver General Hospital (VGH) cohort with 248 and 328 patients respectively ( Beck et al. [2011]). Each of them include tissue micro-array (TMA) images, along with clinical patient data such as survival time, and estrogen-receptor (ER) status. The original TMA images all have a resolution of pixels, and we split each of the images into smaller patches of , and allow them to overlap by 50%. We also perform data augmentation on these images, a rotation of , and , and vertical and horizontal inversion. We filter out images in which the tissue covers less than 70% of the area. In total this yields a training set of 249K images, and a test set of 62K.

We use this data set to train three different models: an unconditional one, without feeding any kind of label, one on Estrogen-receptor (ER) status (positive or negative), and a final one on survival time. For our survival time model we categorize the images into two groups, a group with survival times greater than 5 years and the second group with 5 years or less.

In Figure 3

we show how the generator is able to learn low dimensional representations of the tissue images, allowing linear interpolation in the latent space that translate into smooth transitions in images. Notably, we picked two ends of the image spectrum, a malignant tissue and a benign tissue. We can see that as we follow the transition, the cancer cell population decreases and the surrounding tissue architecture adapts to that change.

Figure 3: (a): Linear interpolation in the latent space from a malignant (more cancer cells) to benign tissue (less cancer cells). (b): Real images, Inception Network closest neighbor in feature space to the generated above. It shows a smooth transition, the generator has been able to learn low dimensional representations of tissue images.

4 GAN evaluation metrics for digital pathology

In this section, we investigate how relevant GAN evaluation metrics perform on distinguishing differences in cancer tissue distributions. We center our attention on metrics that are model agnostic and work with a set of generated images. We focus on Fréchet Inception distance (FID), Kernel Inception Distance (KID), and 1-Nearest Neighbor classifier (1-NN) as common metrics to evaluate GANs. We do not include Inception Score and Mode Score because they do not compare to real data directly, they require a classification network on survival times and estrogen-receptor (ER), and they have also showed lower performance when evaluating GANs (Barratt and Sharma [2018], Xu et al. [2018]).

Xu et al. [2018] reported that the choice of feature space is critical for evaluation metrics, so we follow these results by using the ’pool_3’ layer from an ImageNet trained Inception-V1 as a convolutional feature space.

We set up two experiments to test how the evaluation metrics capture:

  • Artificial contamination from different staining markers and cancer types.

  • Consistency when two sample distributions of the same database are compared.

4.1 Detecting changes in markers and cancer tissue features

We used multiple cancer types and markers to account for alterations of color and shapes in the tissue. Markers highlight parts of the tissue with different colors, and cancer types have distinct tissues structures. Examples of these changes are displayed in Figure 4.

We constructed one reference image set with 5000 H&E breast cancer images from our data sets of NKI and VGH, and compared it against another set of 5000 H&E breast cancer images contaminated with other markers and cancer types. We used three types of marker-cancer combinations for contamination, all from the Stanford TMA Database (Marinelli et al. [2008]): H&E - Bladder cancer, Cathepsin-L - Breast cancer, and CD137 - Lymph/Colon/Liver cancer.

Figure 4: Different cancer types and markers. (a) H&E Breast cancer, (b) H&E Bladder cancer, (c) Capthepsin-L Breast cancer, and (d) CD137 Bone marrow cancer. We can see different coloring per marker, and tissue architecture per cancer type.

Each set of images was constructed by randomly sampling from the respective marker-cancer type data set, which is done to minimize the overlap between the clean and contaminated sets.

Figure 5 shows how (a) FID, (b) KID, (c) 1-NN behave when the reference H&E breast cancer set is measured against multiple percentage of contaminated H&E breast cancer sets. Marker types have a large impact due to color change and all metrics capture this except for 1-NN. Cathepsin-L highlights parts of the tissue with brown colors and CD137 has similar color to necrotic tissue on H&E breast cancer, but still far from the characteristic pink color of H&E. Accordingly, H&E-Bladder has a better score in all metrics due to the color stain, again expect for 1-NN. Cancer tissue type differences are captured by all the metrics, which shows a marker predominance, but we can see that on the H&E marker the differences between breast and bladder types are still captured.

In this experiment, we find that FID and KID have a gradual response distinguishing between markers and cancer tissue types, however 1-NN is not able to give a measure that clearly defines these changes.

Figure 5: Distinguishing a set H&E Breast cancer images against different contamination of markers and cancer types. For a metric to be optimnal, the value should decreas along with the contamination. (a) corresponds to FID, (b) KID, (c) 1-NN. FID and KID gradually define changes in marker and tissue type, 1-NN does not provide a clear measure of the changes.

4.2 Reliability on evaluation metrics

Another evaluation we performed was to study which metrics are consistent when two independent sample distributions with the same contamination percentage are compared. To construct this test, for each contamination percentage, we create two independent sample sets of 5000 images and compare them against each other. Again, we constructed these image sets by randomly selecting images for each of the marker-cancer databases. We do this to ensure there are no overlapping images between the distributions.

In Figure 6 we show that (a) FID has a stable performance, compared to (b) KID, and especially (c) 1-NN. The metrics should show a close to zero distance for each of the contamination rates since we are comparing two sample-distributions from the same data set. This shows that only FID has a close to zero constant behavior across different data sets when comparing the same tissue image distributions.

Based on these two experiments, we argue that 1-NN does not clearly represent changes in the cancer types and marker, and both KID and 1-NN do not give a constant reliable measure across different markers and cancer types. Therefore we focused on FID as the most promising evaluation metrics.

Figure 6: Consistency of metrics when two independent sets of images with the same contamination are compared. Consistent metrics should be close to zero for each of the contamination rates. (a) FID, (b) KID, and (c) 1-NN, we can see that FID is the metric that shows a close to zero constant measure.

5 Pathology GANs Results

5.1 Image analysis of cellular heterogeneity as feature extraction.

We explore an alternative feature space that has a direct relation to pathology and cellular profiling, the motivation behind this approach is to ensure that our models capture meaningful and faithful representations of the tissue. We do not use this method in the two previous experiments since this tool specifically works with H&E breast cancer images.

The CRImage tool (Yuan et al. [2012]) uses a SVM classifier to provide quantitative information about tumor cellular characteristics in cancer tissue. This approach allows us to gather pathologically meaningful features in the images, namely the number of cancer cells, the number of other types of cells (such as stromal or lymphocytes), and the ratio of tumorous cells per area. We use this information as an alternative feature space for the FID metric. Figure 7 displays an example of how the tool captures the different cells in the generated images, such as cancer cells, stromal cells, and lymphocytes.

Figure 7: CRImage identifies different cell types in our generated images. Cancer cells are highlighted with a green color, while lymphocytes and stromal cells are highlighted in yellow.

5.2 Results

We evaluate our three models with the FID, generating 5000 fake images, and randomly selecting 5000 real images. We also use both approaches for feature space selection, using CRImage cell classifier and the convolutional features of an Inception-V1.

Model Inception FID CRImage FID
Unconditional 14.372.5 7.170.4
Survival > 5 years 13.230.4 3.811.8
Survival <= 5 years 27.304.6 16.460.40
ER Positive 16.500.8 1.931.0
ER Negative 20.880.3 11.720.7
Table 1:

Evaluation of different Pathology GANs. Mean and standard deviations are computed over three different random initializations. The low FID scores in both feature space suggest consistent and accurate representations.

Table 1 shows that our models are able to achieve a accurate characterization of the cancer tissue images. Using the Inception feature space, FID shows a stable representation for all models with values similar to ImageNet models of BigGAN (Brock et al. [2018]) and SAGAN (Zhang et al. [2018]), with FIDs of 7.4 and 18.65 respectively. Using the CRImage cellular information as feature space, FID shows again close representations to real tissue.

5.3 Capturing differences between phenotype and Pathology GAN

In this section, we show the differences between real sample distributions in the ER and survival time cases, and we do the same with our models. We compare whether our models are able to capture the same differences within phenotype. We present this data in two different ways:

  • Measuring the distances between real distributions for both phenotype ER and survival, and doing the same on generated distributions, Figure 8.

  • Displaying the distributions of morphological features of the cancer tissue for real and generated image distributions, Figure 9.

Figure 8: FID metrics on Inception and CRImage feature spaces for ER and Survival time phenotype. We display how generative distributions for each of the possible cases compares to the real distributions, showing that our models are able to follow the differences of the distributions within phenotype.

In Figure 8

, Inception feature space, (a) and (c), show similar trends between real and generated distributions for the ER and survival cases. Differences within phenotype distributions are highlighted by the CRImage feature space, specially on the ER case (b), where differences between real ER positive, negative, and the joint distributions are followed by our models. For survival time (d), our model has a harder time reproducing data from less or equal to five years but it is still able to find differences between this distribution, the greater than five years, and the joint one, which concurs with the results in Table 

1.

Figure 9: Distributions of morphological features for ER and Survival phenotype in real and generated images, we use cancer-cells/all-cells ratio and cancer-cells/area as and dimensions. Our generative model reproduces the real image distributions, especially the distribution tails of all the different classes of phenotype.

With Figure 9, we display the density of real and generative distributions using two important factors from cancer tissue: the cancer cell density per area and the cancer cells per all cells ratio. We used the CRImage cell classifier to extract these from the cancer tissue images. We can see that our model characterization follows the real distributions, ER Positive (a)-(e) and Survival time greater than five years (c)-(g) show close representations. At the same time ER Negative (b)-(f) and Survival time less than five years (d)-(h) show a larger difference between real and generated, which again follows both the evaluation metrics in Table 1 and plots in Figure 8.

Our goal with this work is to show that the generative model is able to condition the cancer tissue images depending on the label, and this way reproduce the real image’s distributions. This opens the door to using much richer clinical information of the cancer tissue sample, for instance molecular properties associated with cancer tissue.

5.4 Pathologists’ results

To demonstrate that the generated images can sustain the scrutiny of clinical examination, we asked two expert pathologists to take two different tests, setup as follows:

  • Test I: 5 Sets of 8 images - Pathologists were asked to find the only fake image in each set.

  • Test II: 10 Individual images - Pathologists were asked to rate all individual images from 1 to 5, where 5 meant the image appeared the most real.

In total, each of the pathologists classified 50 individual images and 25 sets of 8 images. We chose fake images in two ways, with half of them hand-selected and the other half with fake images that had the smallest Euclidean distance to real images in the convolutional feature space (Inception-V1). All the real images are randomly selected between the three closest neighbors of the fake images.

On Test I, pathologist 1 and 2 were able to find only 2 fake out of the 25 sets, and 3 out of the 25, respectively. This is indicative of the images’ quality because we argue that pathologists should be less challenged to find fake images amongst real ones, since having other references to compare with facilitates the task. Figure 10 shows Test II in terms of false positive vs true positive, and we can see that pathologist classification is close to random. The pathologists mentioned that the usual procedure is to work with larger images with bigger resolution, but that the generated fake images were of such a quality, that at the size used in this work, they were not able to differentiate between real and fake tissue.

Figure 10: ROC curve of Pathologists’ classification for individual images in Test II. Their performance is similar to random when asked to classify between real and generated images.

6 Conclusion

We presented a new approach to the use of machine learning in pathology, using GANs to learn clinically/pathologically meaningful representations of cancer tissue. We studied GAN evaluation metrics on cancer tissue images and showed that FID gives a more precise measure, and we also proposed a new feature space with a close relationship to pathology. Additionally, our results show that these models capture faithful representations of cancer tissue, and we demonstrate that pathologists are not able to reliably find differences between real and generated images.

This paper lays the foundation for further work on the following ideas: increase the resolution of the images to capture larger cancer tissue, investigate how feature disentanglement could help to understand common features between cancer types, and map to molecular properties (e.g. genome, transcriptome, and etc) that dictate tissue phenotype. To this end, we are working towards massively scaling up our model to handle whole-slide images across multiple cancer types with large patient cohorts such as the The Cancer Genome Atlas.

Acknowledgments

We would like to thank Joanne Edwards and Elizabeth Mallon for helpful insights and discussions on this work. We will also like to acknowledge the support from EPSRC grant EP/R018634/1.

References