We present Neural Image Compression (NIC), a method to reduce the size of gigapixel images by mapping them to a compact latent space using neural networks. We show that this compression allows us to train convolutional neural networks on histopathology whole-slide images end-to-end using weak image-level labels.READ FULL TEXT VIEW PDF
Gigapixel images are three-dimensional arrays composed of more than 1 billion pixels, common in fields like Computational Pathology  and Remote Sensing , and often associated with image-level labels. The fundamental challenge of gigapixel image analysis with weak image-level labels resides in the low signal-to-noise ratio present in these images, i.e. typically the signal consists of a subtle combination of high- and low-level patterns that are related to the image-level label, while billions of pixels behave as distracting noise. Furthermore, the nature and spatial distribution of the signal are both unknown, often referred to as the what and the where problems, respectively.
Researchers have addressed the challenge of gigapixel image analysis by making different assumptions about the signal, simplifying either the what or the where problem.
The most widespread simplification assumes that the signal is fully recognizable at a low level of abstraction, i.e. the image-level label has a patch-level representation. This simplification addresses the what problem by decomposing the gigapixel image into a set of patches that can be independently annotated. Typically, these patches are manually annotated to perform automatic detection or segmentation using a neural network, relegating the task of performing image-level prediction to a rule-based decision model about the patch-level predictions [3, 4, 5, 1]
. This assumption is not valid for image-level labels that do not have a known patch-level representation. Furthermore, patch-level annotation in gigapixel images is a tedious, time consuming and error-prone process, and limits what machine learning models can learn to the knowledge of human annotators.
Other researchers have assumed that the signal can exist at a low level of abstraction but it is not fully recognizable, i.e. the image-level label has a patch-level representation that is unknown to human annotators. Furthermore, the mere presence of these patches is enough evidence to make a prediction at image level, ignoring the spatial arrangement between patches, thus solving the where problem. Taking this assumption falls into the multiple-instance learning (MIL) framework, which reduces the gigapixel image analysis problem into detecting patches that contain the true signal while suppressing the noisy ones [6, 7, 8, 9, 10]. However, these methods can only take into account patterns present within individual patches, neglecting the potential relationships among them. More generally, MIL techniques cannot exploit patterns present in higher levels of abstraction since they ignore the spatial distribution among patches. This is also true for methods that aggregate patch-level information by means of spatial pooling [6, 11].
In this work, we do not make any assumptions about the nature or spatial distribution of the visual cues associated with image-level labels. We argue that convolutional neural networks (CNNs) are designed to solve the what and the where problems simultaneously , and propose a method to use them for gigapixel image analysis. Feeding CNNs directly with gigapixel images is computationally unfeasible. We propose a method to compress gigapixel images into a highly compact representation so that they can be used to train a CNN using a single GPU to predict any kind of image-level label. We achieve such a size reduction using Neural Image Compression (NIC), a technique that maps images from low-level pixel space to a higher-level latent space using neural networks.
Gigapixel NIC aims to reduce the size of a gigapixel image while retaining semantic information by shrinking its spatial dimensions and growing along the feature direction, see Fig. 1. The method works by, first, dividing the gigapixel image into a set of high-resolution patches. Second, compressing each high-resolution patch with a neural network, which we call the encoder, that maps every image into a low-dimensional embedding vector. Finally, placing each embedding into an array that keeps the original spatial arrangement intact so that neighbor embeddings in the array represent neighbor patches in the original image.
The idea of NIC has a biological inspiration. Human observers can describe complex visual patterns using only a few words without needing to describe every single pixel that they see. Similarly, the encoder can describe patches with low-dimensional embedding vectors, ignoring superfluous details. It is a powerful method that competes with classical approaches in terms of compression rate 
. Moreover, previous work in representation learning and transfer learning have demonstrated that neural networks excel at extracting features that can be exploited by other networks to solve a variety of other downstream tasks[14, 15, 16, 17]. This makes NIC an ideal candidate for reducing the size of gigapixel images before feeding a CNN.
In this work, we compare three of the most common encoding methods for unsupervised representation learning. First, autoencoders (AEs) have been proposed as a straightforward method to learn a compact representation of a given data manifold. AEs are neural networks that follow a particular encoder-bottleneck-decoder architecture. They aim to reconstruct input images by minimizing a reconstruction loss, e.g. the mean squared error (MSE). In particular, we consider the case of the variational autoencoder (VAE), a powerful modification of the original AE that includes a probabilistic approach . Second, we investigate a discriminative model based on contrastive training [16, 19, 20, 21]. This model senses the world via an encoding network that maps images to embedding vectors. By training this model to distinguish between pairs of images with same or different semantic information, we enforce the encoder to learn a compact representation of the input data. Third, we investigate adversarial feature learning [14, 15], a training framework based on Generative Adversarial Networks (GANs) . GANs have emerged as powerful generative models that map low-dimensional latent distributions into complex data. There is evidence that these latent spaces capture some of the high-level semantic information present in the data . However, standard GAN models do not support the reverse operation, i.e. mapping data to the latent space. The Bidirectional GAN (BiGAN) model proposes to learn this mapping by including an explicit encoding network in the training loop . Intuitively, the encoder benefits from all the high-level features already discovered by the generator fully automatically.
Without any loss of generality, we applied our method to two of the largest publicly available histopathology datasets to demonstrate its effectiveness in real-world applications, the Camelyon16 Challenge  and the TUPAC16 Challenge . These datasets consist of gigapixel images of human tissue acquired with brightfield microscopy at very high magnification, also known as whole-slide images (WSIs). Each WSI is associated with a single image-level label: the presence of tumor metastasis for Camelyon16, and the degree of tumor proliferation speed based on gene-expression profiling for TUPAC16.
An important benefit of using a CNN for gigapixel image analysis is that once trained, we can visualize the area of the input image where the CNN pays the most attention to, e.g. using gradient-weighted class-activation maps (grad-CAM) . These saliency maps provide an answer to the where problem by locating visual cues related to the image-level labels. Identifying visual evidence for CNN predictions is of utmost importance in the medical domain regarding algorithm interpretation and knowledge discovery. For the first time, we performed this saliency analysis on gigapixel images and compared the resulting maps with the patch-level annotations of an expert observer.
This work is an extension of our earlier conference paper . In the current version, we have included a number of important additions: two new datasets, an additional encoding method, the grad-CAM analysis, a new experiment at patch level, a new experiment at image level, a more thorough evaluation using cross-validation, and an independent test evaluation performed by a third-party.
Our contributions can be summarized as follows:
We propose Neural Image Compression (NIC), a method to reduce gigapixel images to highly-compact representations, suitable for training a CNN end-to-end to predict image-level labels using a single GPU and standard deep learning techniques.
We compare several encoding methods that map high-resolution image patches to low-dimensional embedding vectors based on different unsupervised learning techniques: reconstruction error minimization, contrastive training and adversarial feature learning.
We evaluate the proposed methodology using two publicly available breast cancer sets of gigapixel WSIs with two types of weak image-level labels: the presence of tumor metastasis in breast lymph nodes, and the degree of tumor proliferation speed.
We generate saliency maps representing the areas of the input gigapixel image where a trained CNN pays the most attention to, in order to discover and localize visual cues associated to the image-level labels.
Let us define as the gigapixel image that we aim to compress, e.g. a WSI, with rows, columns and three color channels (RGB). In order to compress into a more compact representation , we followed two steps. First, we divided into a set of high-resolution patches with , sampled from the -th row and -th column of an uniform grid of square patches of size using a stride of throughout . Second, we compressed each independently from each other, generating a set of low-dimensional embedding vectors of length at each spatial location on the grid: with .
We formulated the task of mapping high-entropy into low-entropy as an instance of an unsupervised representation learning problem, and parametrized this mapping function with a neural network so that . By sliding throughout all spatial locations, we compressed into with a total volume reduction of . More formally:
We investigated several unsupervised encoding strategies to learn . In all cases, we trained neural networks to solve an auxiliary task and learn as a by-product of the training process. Note that none of the studied methods required the use of manual annotations. Network architectures and training protocols are detailed in the Supplementary Material accompanying this paper.
Two networks are trained simultaneously, the encoder and the decoder . The task of is to map an input patch into a compact embedded representation , and the task of is to reconstruct from , producing . In this work, we used a more sophisticated version of AE, the variational autoencoder (VAE) . The encoder in the VAE model learns to describe
with an entire probability distribution, instead of a single vector, see Fig.2. More formally, outputs and
with and denoting element-wise multiplication.
We trained the VAE model by optimizing the following objective:
with as a scaling factor, and and as the parameters of and , respectively.
This procedure results in a continuous latent space where changes in the embedding vectors are proportional to changes in the input data, and vice-versa, retaining semantic knowledge more effectively than vanilla AE .
We assembled a training dataset composed of pairs of patches where each pair was associated with a binary label . Each label described whether the patches had been extracted from the same or different location in a given gigapixel image, with and , respectively. We trained a two-branch Siamese network  to solve this classification problem, see Fig. 3.
We applied heavy data augmentation on all patches as indicated in , i.e. rotation, color augmentation, brightness, contrast, zooming, elastic deformation and additive Gaussian noise. Due to the strong augmentation, patches from the same location looked substantially different in a highly non-linear fashion while keeping a similar overall structure (semantic), see examples in Fig. 3
. Since the classifier had access to the data via the encoder only, the encoder was forced to extract high-level features that could precisely characterize the data in order to solve the classification task. To further strengthen the difficulty of the task, we sampled 75% of thedifferent data points from non-overlapping adjacent locations, where most of the visual features were shared.
The BiGAN setup consists of three networks: a generator , a discriminator and an encoder , see Fig 4. maps a latent variable to generated images :
whereas maps images sampled from the true data distribution to embeddings :
During training, the three networks play a minimax game where the discriminator tries to distinguish between actual or generated image-embedding pairs, i.e. and respectively, while and try to fool by producing increasingly more realistic images and embeddings , i.e. closer to . More formally, we optimized the following objective function:
with , and representing the parameters of , and , respectively.
The authors of BiGAN demonstrate, theoretically and experimentally, that and learn an approximate inverse mapping function from each other, producing an encoding network that learns a powerful low-dimensional representation of the image world inherited from , suitable for downstream tasks such as supervised classification .
In this section, we propose a method to train a CNN to predict image-level labels directly from compressed gigapixel images. Furthermore, we analyze the location of visual cues associated with the image-level labels.
We consider a dataset of gigapixel images that were compressed into with , using Eq. 1. In order to train a standard CNN on a dataset like , we set the depth of the convolutional filters of the input layer to be equal to the code size used to compress the images.
We hypothesize that such a CNN can learn to detect highly discriminative features by exploiting two complementary sources of information from : (1) the global context encoded within the spatial arrangement of embedding vectors, and (2) the local high-resolution information encoded within the features of each embedding vector.
Note that in this setup, each compressed image constitutes a single training data point, despite its gigapixel nature. Most public datasets with gigapixel images and their respective image-level labels consist of a few hundred data points only [4, 3], increasing the risk of overfitting. Therefore, we took a number of measures to prevent this effect, enumerated below.
First, we extended the training dataset by taking spatial crops of size from , drastically increasing the total number and variability of the samples presented to the CNN . During training, we randomly sampled the location of the center pixel of these crops, whereas during testing, we selected
crops uniformly distributed along the spatial dimensions of, and averaged the predictions of the CNN across them . Without any loss of generality, we applied this method to histopathology WSIs. Because WSIs often contain large empty areas with no tissue, we detected the tissue regions  and sampled crops proportionally to the distance to background to accelerate the training, so that areas with higher tissue density were sampled more often. Similarly, test crops were sampled from locations where tissue was present.
The second measure that we took to prevent overfitting was performing simple augmentation at image level, i.e. 90-degree rotation and mirroring, encoding each image 8 times. This augmentation was carried out during testing as well, averaging the predictions of the CNN across them.
Finally, we designed a CNN architecture aimed at reducing the number of parameters present in the model. In particular, we set all convolutional layers to use depthwise separable convolutions, a type of convolution that reduces the number of parameters while maintaining a similar level of performance .
The problem of feature localization is of utmost relevance for gigapixel image analysis since often visual cues related to the image-level labels are sparse and positioned in arbitrary locations within the image. For the purpose of identifying the location of these visual cues, we applied the Gradient-weighted Class-Activation Map (Grad-CAM) algorithm  to our trained CNN.
Given a compressed gigapixel image , its associated image-level label and a trained CNN, Grad-CAM performs a forward pass over , producing a set of intermediate three-dimensional feature volumes , with and indicating the -th and -th convolutional layer and feature map, respectively. Subsequently, it computes the gradients of with respect to , for a fixed convolutional layer. It averages the gradients across the spatial dimensions and obtains a set of gradient coefficients indicating how relevant each feature map is for the desired output . Finally, it performs a weighted sum of the feature maps using the gradient coefficients :
We applied the visualization method to the first convolutional layer, i.e. , in order to maximize the heatmap resolution.
We conducted a series of experiments in order to evaluate the performance of gigapixel NIC. Due to the computationally expensive nature of these experiments, we did not tune the hyper-parameters of NIC. Instead, we selected fixed values using the following heuristics. We used, a common patch size used in the Computational Pathology literature , with a stride of the same size to perform non-overlapping patch sampling. We selected to obtain crops corresponding to typical sizes of gigapixel WSIs, i.e. pixels, and as done in the literature . Finally, we selected as the value that allowed us to perform our experiments using a single GPU. Network architectures and training protocols are detailed in the Supplementary Material accompanying this paper.
In this work, three cohorts from different sources were used for supervised and unsupervised training at patch and image level.
First, the Camelyon16  dataset is a publicly available multicenter cohort that consists of 400 sentinel lymph node hematoxylin and eosin (H&E) WSIs from breast cancer patients. Reference standard exists in two forms: (1) fine-grained annotations of metastatic lesions, and (2) image-level labels indicating the presence of tumor metastasis in each slide. We set aside 60 WSIs from the original training set to train encoders at patch level. The remaining WSIs together with the original test set, a total of 340 WSIs, were used to train and evaluate a classification model using image-level labels only.
Second, we used the TUPAC16  dataset, consisting of 492 H&E WSIs from invasive breast cancer patients. It is a publicly available cohort with WSIs from The Cancer Genome Atlas  where each WSI is associated with a tumor proliferation speed score, an objective measurement that takes into account the RNA expression of 11 proliferation-associated genes . We set aside 40 WSIs from this set to train encoders at patch level. The remaining WSIs, a total of 452 WSIs, were used to train and evaluate a regression model using image-level labels only. Additionally, 321 test WSIs with no public ground truth available were used to perform an independent evaluation.
Third, the Rectum dataset consisted of 74 H&E WSIs from rectal carcinoma patients . Manual annotations of 9 tissue classes were made by an expert, which included: (1) blood cells, (2) fatty tissue, (3) epithelium, (4) lymphocytes, (5) mucus, (6) muscle, (7) necrosis, (8) stroma, and (9) tumor. The slides were randomized and organized into ten equal partitions at patient level, using 5 for training, 1 for validation and 4 for testing purposes. This dataset was used to train and evaluate encoders at patch level only.
All WSIs in this study were preprocessed with a tissue-background segmentation algorithm  in order to exclude areas not containing tissue from the analysis. Furthermore, all images were analyzed at resolution.
We assembled a set of patch datasets to train and evaluate each of the encoding networks described in Sec. 2 using the set of images that we set aside from each cohort, i.e. 60 WSIs from Camelyon16, 40 WSIs from TUPAC16 and all image tiles from Rectum. Subsequently, we divided each of these collections of images into training, validation and test partitions.
First, we created the contrastive dataset by extracting an equal amount of patches from each source, i.e. Camelyon16, TUPAC16 and Rectum, and merged them together resulting in nd atch pairs for training and validation purposes, respectively, using the same partitions as indicated previously. Subsequently, we created the non-contrastive dataset by randomizing all individual patches within the contrastive dataset.
Second, we created the supervised-tumor dataset by extracting nd atches from the set of 60 WSIs from Camelyon16 for training, validation and testing purposes, respectively, using the same partitions as indicated previously. Notice that the patches in the test set did not undergo any augmentation. We used the fine-grained tumor annotations to sample a balanced distribution of tumor and non-tumor patches.
Finally, we created the supervised-tissue dataset by extracting nd atches from the set of images from Rectum for training, validation and testing purposes, respectively, using the same partitions as indicated previously. Notice that the patches in the test set did not undergo any augmentation. We used the fine-grained tissue-type annotations to sample a balanced distribution of patches among the 9 classes described before.
We trained the contrastive encoder with the contrastive dataset, and the VAE and BiGAN models with the non-contrastive dataset. No manual annotations were required in this process. We trained a supervised baseline encoder for breast tumor classification using the supervised-tumor dataset, and a supervised baseline encoder for rectum tissue classification using the supervised-tissue dataset.
We included an additional baseline that captured simple color information from the raw input by averaging the pixel intensity across spatial dimensions from input RGB patches. It provided a strong baseline for identifying slow changing patterns and color information, common in gigapixel images.
This entire training process resulted in 6 encoding networks used in subsequent experiments: the mean-RGB baseline, the VAE encoder, the contrastive encoder, the BiGAN encoder, the supervised-tumor baseline, and the supervised-tissue baseline.
Due to the lack of a common evaluation methodology for unsupervised representation learning, we compared the performance of these 6 encoders when used as fixed feature extractors for related supervised classification tasks. We defined two tasks: (1) discerning between tumor and non-tumor patches on the supervised-tumor dataset, named Task-1; and (2) performing 9-class tissue classification on the supervised-tissue dataset, named Task-2. For each task, we trained an MLP on top of each encoder with frozen weights, and reported the accuracy metric on each test set.
Results in Tab. I highlight several facts. First, VAE, Contrastive and BiGAN, displayed a higher performance than the lower baseline for both Task 1 and Task 2, stressing their ability to describe complex patterns beyond simple features related to color intensity. Second, the VAE encoder obtained a higher performance than the Contrastive one, particularly for Task 2. Third, the BiGAN encoder achieved the best performance among all the unsupervised methods, with a relatively large margin for the more complex Task 2 with respect to the runner-up VAE model. Furthermore, the BiGAN encoder obtained the best result for 5 out of 9 classes in Task 2, and it achieved the first or second best result for 8 out of 9 classes among the unsupervised models. Remarkably, BiGAN succeeded at classifying patches from challenging tissue classes such as blood cells and necrotic tissue.
In this experiment, we trained a CNN to perform binary classification on compressed gigapixel WSIs from the Camelyon16 cohort, identifying the presence of tumor metastasis using image-level labels only. Due to the limited amount of images in this cohort (340 WSIs), we divided the dataset into 4 equal-sized partitions and performed 4 rounds of cross-validation, using 2 partitions for training, 1 for validation and 1 for testing, rotating them in each round. We trained a different CNN classifier for each encoder, i.e. Mean-RGB, VAE, Contrastive, BiGAN, and the upper baseline supervised-tumor. We reported the area under the receiver operating characteristic (AUC) on three evaluation sets.
The first evaluation set, called All, concatenated all samples in each of the hold-out partitions. Notice that each hold-out partition was evaluated by a different CNN that had never seen the data. The second evaluation set, called Test, was the subset of All that matched the official test set of the Camelyon16 Challenge, so that we could compare our results with the public leaderboard. The third evaluation set, called Macro, used the same WSIs as in Test but considering as positive labels only those that presented a macro metastasis, i.e. a tumor lesion larger than . The macro labels were only available for the test set of Camelyon16. The Macro set was relevant to evaluate how the method performed with lesions visible at low resolution.
Results in Tab. II confirmed that the method presented in this work is an effective technique for gigapixel image analysis using image-level labels only. Regarding the All evaluation set, BiGAN achieved a remarkable performance of 0.716 AUC, with a relative difference with respect to the supervised baseline of only 6%. The Contrastive and VAE models also improved over the lower baseline, but obtained significantly lower performance scores compared to BiGAN. Regarding the Test set, we observed a lower performance of 0.674 AUC for the BiGAN encoder. Regarding the Macro set, the performance gap between the supervised baseline and the BiGAN encoder increased substantially from 0.095 to 0.184.
In this experiment, we trained a CNN to perform a regression task on compressed gigapixel WSIs from the TUPAC16 cohort, predicting the degree of tumor proliferation speed based on gene-expression profiling. We performed 4-fold cross-validation as in the previous experiment, and reported the Spearman correlation between the predicted and the true scores on two evaluation sets.
The first evaluation set, called All, concatenated all samples in each of the hold-out partitions. The second evaluation set, called Test, matched the test set used in the TUPAC16 Challenge, whose ground truth is not public. Using the encoder that obtained the highest performance, we evaluated each WSI in Test 4 times using each of the CNNs trained during cross-validation and submitted the average score per slide. Our predictions were independently evaluated by the challenge organizers, ensuring a fair and independent comparison with the state of the art.
The results in Tab. III showed that BiGAN achieved the highest performance with a 0.521 Spearman correlation. Remarkably, this score was superior to that of any other unsupervised or supervised encoder. In addition, we obtained a score of 0.557 on the TUPAC16 Challenge test set which was superior to the state-of-the-art for image-level regression with a score of 0.516. Notice that the first entry of the leaderboard used an additional set of manual annotations of mitotic figures, thus it cannot be compared with our setup.
We conducted a qualitative analysis on the trained CNNs to locate the spatial position of visual cues that were relevant to predict the image-level labels. We applied the Grad-CAM algorithm to the CNNs trained for both tasks at image level. For the tumor metastasis prediction task, we compared the saliency maps with fine-grained manual annotations. Figures 5 and 6 include the results for two samples. The results for the rest of WSIs can be found in the Supplementary Material. Notice that each WSI was evaluated by a CNN that had not seen the image (hold-out partition).
Fig. 5 shows that the Mean-RGB baseline model lacked the ability to focus on specific tissue regions, suggesting that it was unable to learn discriminative features from image-level labels. The VAE and Contrastive models exhibited a suboptimal behavior, scattering attention all over the image. Remarkably, the BiGAN model seemed to focus on tumor regions only, discarding not only fatty and empty tissue, but areas with healthy dense tissue as well. It showed a strong discriminative power to discern between tumor and non-tumor regions, even though the CNN had access to image-level labels only. For completeness, we also included the supervised-tumor baseline, that exhibited a similar behavior as BiGAN, solely focusing on tumor regions, as expected. Regarding Fig. 6, a similar trend than the one found in the previous task was observed for all encoders, with the BiGAN model focusing on very specific regions of the WSIs that seemed compatible with active tumor regions. In this case, the supervised-tumor baseline focused on irrelevant areas, in line with its low performance for this task.
Our experimental results confirmed the initial hypothesis that visual cues associated with weak image-level labels can be exploited by our method, integrating information from global structure and local high-resolution visual cues. Furthermore, we have shown that this methodology is flexible and completely label-agnostic, delivering relevant results in both classification and regression tasks, and emerging as a promising strategy to tackle the analysis of more challenging image-level labels that are closely related to patient outcome, e.g. overall survival and recurrence-free survival. Moreover, gigapixel NIC paves the way for leveraging existing computer vision algorithms that could not be applied in the gigapixel domain until now, e.g. image captioning (useful to generate written clinical reports), visual question answering, image retrieval (to find similar pathologies), anomaly detection and generative modeling[33, 34, 35, 36, 37].
A key assumption of our method was that high-resolution image patches could be represented by low-dimensional highly compressed embedding vectors. We analyzed several unsupervised strategies to achieve such a compression and found that the BiGAN encoder, trained using adversarial feature learning, was superior to all the other methods across all experiments. We believe that this relative improvement with respect to the VAE and Contrastive method is explained by intrinsic algorithmic differences among the methods. In particular, the VAE model relied on minimizing the MSE objective, which is a unimodal function that fails to capture high-level semantics and focuses on reconstructing low-level pixel information instead, wasting embedding capacity. On the other hand, the Contrastive encoder uses the embedding capacity more efficiently, but its performance is driven by the design of the hand-engineered contrastive task. Remarkably, the BiGAN model learns an encoder that inverts a complex mapping between the latent space and the image space, fully automatically. By doing so, the encoder benefits from all the high-level features and semantics already discovered by the generator, producing very effective discriminative embedding vectors.
We trained a CNN to predict the degree of breast tumor proliferation speed based on gene-expression profiling, a label associated with unknown visual cues. Remarkably, our method succeeded in finding and exploiting these patterns in order to predict expected tumor proliferation speed, surpassing the current state-of-the-art for image-level based methods. This result confirms our method as an effective solution to deal with gigapixel image-level labels with unknown associated visual cues. Moreover, our method could be used in future work to effectively mine datasets with thousands of gigapixel images , targeting other automatically generated labels from immunohistochemistry, genomics or proteomics, and recognizing visual patterns beyond the knowledge of human pathologists.
For the first time, we visualized the regions of a gigapixel image that a trained CNN attends to when predicting image-level labels, and compared how different encoding methods affect these regions. We discovered that only the CNNs trained with images compressed with the BiGAN encoder and the supervised-tumor baseline were able to attend to regions of the image where tumor cells were present. The fact that the BiGAN model simultaneously learned to (1) delimit metastatic lesions and (2) identify tumor features within the patch embeddings, validates our initial hypothesis that CNNs are an effective method for analyzing gigapixel images, since they can exploit both global and local context.
We targeted the presence of tumor metastasis in breast lymph nodes and, although we obtained encouraging results with the BiGAN setup by performing similarly to the supervised baseline, our best performance is still far away from that of the Camelyon16 leading methods. In particular, the limited performance of the supervised-tumor baseline reveals some of the limitations of our methodology. We explain this performance gap due to two factors. First, the majority of the images marked as positive contain tumor lesions comprised of only a few tumor cells, so-called micro-metastasis, becoming almost undetectable with the compression setup tested in this work. Second, the lack of training data. With only a few hundred training images our model easily falls in the regime of overfitting. Future work aims to improve image resolution within compressed image representations and sample efficiency.
We acknowledge several limitations of our method. First, we did not test the impact in performance of several method’s hyper-parameters due to restrictions in computational resources, which could have resulted in a reduced overall performance. In particular, we noticed that our method failed to detect small tumor lesions, possibly caused by a suboptimal selection of stride and patch size . Second, we observed a performance gap between the best unsupervised encoder and the supervised baseline for the patch-based classification benchmark, indicating that our encoding model extracts suboptimal features.
As future work, we propose to extend this method in multiple ways. First, investigating more sophisticated encoders in order to improve the low-dimensional representation of the image patches [16, 38, 39]. Second, incorporating attention mechanisms allowing the CNN to attend to relevant regions for the image-level labels with ease . Finally, training the encoder with image-level information using gradient checkpointing .
We showed that our method for gigapixel neural image compression was able to distill relevant information into compact image representations. The fact that we could train a CNN using these alternative learned representations opens the door to other potential methods that no longer consider gigapixel images as low-level pixel arrays, but operate in a higher level of abstraction. In this work, we showed examples of classification, regression and visualization performed in a latent space learned by a neural network. We believe that these positive results can pave the way for more advanced gigapixel applications performed in the latent space such as data augmentation, generative modeling, content retrieval, anomaly detection and image captioning.
This study was financed by a grant from the Radboud Institute of Health Sciences (RIHS), Nijmegen, The Netherlands. The authors would like to thank Dr. Mitko Veta for evaluating our predictions in the test set of the TUPAC dataset; and the developers of Keras, the open source tool that we used to run our deep learning experiments.
, “Weakly supervised learning for whole slide lung cancer image classification,” inMedical Imaging with Deep Learning, 2018.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
, “Unsupervised feature extraction by time-contrastive learning and nonlinear ica,” inAdvances in Neural Information Processing Systems, 2016, pp. 3765–3773.
, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
Two networks are trained simultaneously, the encoder and the decoder . The task of is to map an input patch to a compact embedded representation , and the task of is to reconstruct from , producing . In this work, we used a more sophisticated version of AE, the variational autoencoder (VAE) , with and .
The encoder in the VAE model learns to describe with an entire probability distribution, in particular, given an input , the encoder outputs and , two embeddings representing the mean and standard deviation of a normal distribution so that:
with and denoting element-wise multiplication.
The architecture of
consisted of 5 layers of strided convolutions with 32, 64, 128, 256 and 512 3x3 filters, batch normalization (BN) and leaky-ReLU activation (LRA); followed by a dense layer with 512 units, BN and LRA; and a linear dense layer withunits.
The architecture of the decoder consisted of a dense layer with 8192 units, BN and LRA, eventually reshaped to ; followed by 5 upsampling layers, each composed of a pair of nearest-neighbor upsampling and a convolutional operation , with 256, 128, 64, 32 and 16 3x3 filters, BN and LRA; finalized with a convolutional layer with 3 3x3 filters and tanh activation.
We trained the VAE model by optimizing the following objective:
with representing a single data sample, a sample from , a scaling factor, and and as the parameters of and , respectively. Notice that we optimized and to minimize both the reconstruction error between the input and output data distributions, and the KL divergence between the embedding distribution and the normal distribution with .
using stochastic gradient descent with Adam optimization and 64-sample mini-batch, decreasing the learning rate by a factor of 10 starting fromvery time the validation loss plateaued until . Finally, we selected the encoder corresponding to the VAE model with the lowest validation loss.
We assembled a training dataset composed of pairs of patches with where each pair was associated with a binary label , and . In order to solve this binary classification task, we trained a two-branch Siamese network  called . Both input branches shared weights and consisted of the same encoding architecture as the VAE model. After concatenation of the resulting embedding vectors, a MLP followed consisting of a dense layer with 256 units, BN and LRA; finalized by a single sigmoid unit.
We minimized the binary cross-entropy loss using stochastic gradient descent with Adam optimization and 64-sample mini-batch, decreasing the learning rate by a factor of 10 starting from very time the validation classification accuracy plateaued until . Finally, we selected the encoder corresponding to the with the highest validation classification accuracy.
We trained a BiGAN setup consisting of three networks: a generator , a discriminator and an encoder . mapped a latent variable drawn from a normal distribution into artificial images :
whereas mapped images sampled from the true data distribution into embeddings :
During training, the three networks played a minimax game where the discriminator tried to distinguish between actual and artificial image-embedding pairs, i.e. and respectively, while and tried to fool by producing increasingly more realistic images and embeddings , i.e. closer to . We used and .
Given the difficulty of training a stable BiGAN model, we downsampled by a factor of 2 before feeding it to the model. The architecture of the encoder consisted of 4 layers of strided convolutions with 128 3x3 filters, BN and LRA; followed by a linear dense layer with units.
The architecture of the generator consisted of a dense layer with 1024 units, BN and LRA, eventually reshaped to ; followed by 4 upsampling layers, each composed of a pair of nearest-neighbor upsampling and a convolutional operation , with 128 3x3 filters, BN and LRA; finalized with a convolutional layer with 3 3x3 filters and tanh activation.
The discriminator had two inputs, a low-dimensional vector and an image. The image was fed through a network with an architecture equal to but different weights, and the resulting embedding vector concatenated to the input latent variable. This concatenation layer was followed by two dense layers with 1024 units, LRA and dropout (0.5 factor); finalized with a sigmoid unit.
We optimized the following objective function:
with , and representing the parameters of , and , respectively.
We minimized using stochastic gradient descent with Adam optimization, 64-sample mini-batch, and fixed learning rate of or a total of pochs. Finally, we selected the encoder
corresponding to the last epoch.
We extracted the embedding by averaging the pixel intensity across spatial dimensions from input RGB patches :
with indexing the three RGB color channels, and and indexing the two spatial dimensions.
We trained an encoder identical to the one used in the VAE model, followed by a dense layer with 256 units, BN and LRA; and finalized by a single sigmoid unit.
We minimized the binary cross-entropy loss using stochastic gradient descent with Adam optimization and 64-sample mini-batch, decreasing the learning rate by a factor of 10 starting from very time the validation classification accuracy plateaued until . Finally, we selected the encoder corresponding to the model with the highest validation classification accuracy.
We trained an encoder
identical to the one used in the VAE model, followed by a dense layer with 256 units, BN and LRA; and finalized by a softmax layer with nine units.
We minimized the categorical cross-entropy loss using stochastic gradient descent with Adam optimization and 64-sample mini-batch, decreasing the learning rate by a factor of 10 starting from very time the validation classification accuracy plateaued until . Finally, we selected the encoder corresponding to the model with the highest validation classification accuracy.
On top of each encoder with frozen weights, we trained an MLP consisting of a dense layer with 256 units, BN and LRA; followed by either a single sigmoid unit or a softmax layer with nine units, respectively for each classification task.
We minimized the cross-entropy loss using stochastic gradient descent with Adam optimization and 64-sample mini-batch, decreasing the learning rate by a factor of 10 starting from very time the validation classification accuracy plateaued until . Finally, we selected the model with the highest validation classification accuracy.
We designed a CNN architecture consisting of 8 layers of strided depthwise separable convolutions  with 128 3x3 filters, BN, LRA, feature-wise 20% dropout, L2 regularization with oefficient, and stride of 2 except for the 7-th and 8-th layers with no stride; followed by a dense layer with 128 units, BN and LRA; and a final output unit, with linear or sigmoid activation for regression or classification tasks, respectively.
We trained the CNN using stochastic gradient descent with Adam optimization and 16-sample mini-batch, decreasing the learning rate by a factor of 10 starting from very time the validation metric plateaued until . We minimized MSE for regression, and maximized binary cross-entropy for binary classification.
Given a compressed gigapixel image , its associated image-level label and a trained CNN , we performed a forward pass over , producing a set of intermediate three-dimensional feature volumes , with and indicating the -th and -th convolutional layer and feature map, respectively. Additionally, we computed the gradients of the feature volume with respect to the class output , for a fixed convolutional layer. We averaged the gradients across the spatial dimensions and obtained a set of gradient coefficients indicating how relevant each feature map was for the desired output . Finally, we performed a weighted sum of the feature maps using the gradient coefficients :
What we obtained was a two-dimensional heatmap that highlighted the regions of that were more relevant for to predict . In order to maximize the resolution of the generated heatmap, we selected .
Grad-CAM heatmaps for Camelyon16 and TUPAC16 can be found here: https://drive.google.com/drive/folders/16E-06rFbGab6-pXfjpo9vXjBVyUQKUuc