Learning Representations with Contrastive Self-Supervised Learning for Histopathology Applications

by   Karin Stacke, et al.
Linköping University

Unsupervised learning has made substantial progress over the last few years, especially by means of contrastive self-supervised learning. The dominating dataset for benchmarking self-supervised learning has been ImageNet, for which recent methods are approaching the performance achieved by fully supervised training. The ImageNet dataset is however largely object-centric, and it is not clear yet what potential those methods have on widely different datasets and tasks that are not object-centric, such as in digital pathology. While self-supervised learning has started to be explored within this area with encouraging results, there is reason to look closer at how this setting differs from natural images and ImageNet. In this paper we make an in-depth analysis of contrastive learning for histopathology, pin-pointing how the contrastive objective will behave differently due to the characteristics of histopathology data. We bring forward a number of considerations, such as view generation for the contrastive objective and hyper-parameter tuning. In a large battery of experiments, we analyze how the downstream performance in tissue classification will be affected by these considerations. The results point to how contrastive learning can reduce the annotation effort within digital pathology, but that the specific dataset characteristics need to be considered. To take full advantage of the contrastive learning objective, different calibrations of view generation and hyper-parameters are required. Our results pave the way for realizing the full potential of self-supervised learning for histopathology applications.


Self supervised contrastive learning for digital histopathology

Unsupervised learning has been a long-standing goal of machine learning ...

Towards Self-Supervised Learning of Global and Object-Centric Representations

Self-supervision allows learning meaningful representations of natural i...

Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods

Self-Supervised Learning (SSL) surmises that inputs and pairwise positiv...

Simple and Effective Balance of Contrastive Losses

Contrastive losses have long been a key ingredient of deep metric learni...

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Contrastive self-supervised learning has outperformed supervised pretrai...

S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration

Previous studies dominantly target at self-supervised learning on real-v...

Benchmarking Omni-Vision Representation through the Lens of Visual Realms

Though impressive performance has been achieved in specific visual realm...

Code Repositories


This is the official code for the paper Learning Represenations with Contrastive Self-Supervised Learning for Histopathology Applications

view repo

1 Introduction

Deep learning has in the last decade shown great potential for medical image-analysis applications (Litjens et al., 2017). However, the transition from research results to clinically deployed applications is slow. One of the main bottlenecks is the lack of high-quality labeled data needed for training models with high accuracy and robustness, where annotations are cumbersome to acquire and relies on medical expertise (Stadler et al., 2021). An active research field has therefore been focused on reducing the dependency of labeled data. This can, for example, be accomplished through transfer learning (Yosinski et al., 2014), where training on a widely different dataset, such as ImageNet (Deng et al., 2009), can reduce the amount of training data needed in a targeted downstream medical imaging application (Truong et al., 2021). However, as ImageNet contains natural images, there is a large discrepancy between the source and target domains in terms of colors, intensities, contrasts, image features, class distribution, etc. It has been shown that a closer resemblance between the source and target datasets is preferable (Cui et al., 2018; Cole et al., 2021; Li et al., 2020), and the usefulness of ImageNet for pre-training in medical imaging has been questioned (Raghu et al., 2019). In addition, ImageNet pre-training has recently also been questioned because of the biased nature of the dataset, intensifying the need to find alternative pre-training methods that are better tailored for the target application (Birhane and Prabhu, 2021; Yang et al., 2021).

Self-supervised learning (SSL) has recently emerged as a viable technique for creating pre-trained models without the need for large, annotated datasets. Instead, pre-training is performed on unlabeled data by means of a proxy objective, for which the labels can be automatically generated. The objective is formulated such that the model learns a general understanding of image content. It can include predicting image rotation (Gidaris et al., 2018), solving jigsaw puzzles (Noroozi and Favaro, 2016), re-coloring gray-scale images (Larsson et al., 2017), to mention a few. One family of SSL methods are those based on a contrastive training objective, which is formulated to contract the representation of two positive views, while simultaneously distracting the representations of negative views. This training strategy has shown great promise in the last few years, and include methods such as CPC (van den Oord et al., 2019), SimCLR (Chen et al., 2020), CMC (Tian et al., 2020a), and MoCo (He et al., 2020). However, successful results have primarily been presented on ImageNet, despite the above-mentioned need for moving away from this dataset, and it is still unclear how well the results generalize to datasets with different characteristics.

In this paper, we investigate how SimCLR (Chen et al., 2020)

can be extended to learn representations for histopathology applications. We take a holistic approach, comparing how differences between the ImageNet dataset and histopathology data influence the SSL objective, as well as pin-pointing how the different components of the objective contribute to the learning outcome. We show that the heuristics that have been demonstrated to work well for natural images do not necessarily apply in histopathology scenarios. Through a rigorous experimental study, which includes three different pathology datasets, we connect the analysis to the actual learning outcome, i.e., how well contrastive pre-training contribute to the target downstream application of tissue classification. Our main motive is to clarify how contrastive SSL for histopathology cannot be considered under the same assumptions as for natural image data. Our results lead to a number of important conclusions. For example, we show that:

  • In pathology, SimCLR pre-training gives substantial benefits, if used correctly.

  • Different types of positive/negative views are optimal for contrastive SSL in histopathology compared to natural images, and the optimal views can be dataset dependent even within the pathology domain.

  • Parameter tuning, such as the batch size used for SSL, does not have the same influence as for natural images, due to the differences in data characteristics.

  • Pre-training data aligned with the target pathology sub-domain is better suited compared to more diverse pathology data.

We conclude with an outlook on what needs to be considered for further improving contrastive SSL in histopathology, where we emphasize how the differences in data characteristics require a significantly different approach to formulating the contrastive learning objective. We believe that this work is important for broadening the understanding of self-supervised methods, and how the intrinsic properties of the data affect the representations. Our hope is that this will be a stepping stone towards pre-trained models better tailored for histopathology applications.

2 Related Work

A large body of literature has been devoted to unsupervised and self-supervised learning. For self-supervised learning, multiple creative methods have been presented for defining proxy objectives and performing self-labelling. These include, but are not limited to, colorization of grayscale images 

(Zhang et al., 2016; Larsson et al., 2017), solving of jigsaw puzzles (Noroozi and Favaro, 2016), and prediction of rotation (Gidaris et al., 2018).

Within self-supervised learning, significant attention has recently been given to a specific family of methods, which performs instance discrimination (Dosovitskiy et al., 2014; Wu et al., 2018) through contrastive learning with multiple views. Bachman et al. (2019) presented a contrastive self-supervised method (AMDIM) based on creating multiple views using augmentation. van den Oord et al. (2019) presented the InfoMax objective, and showed that by minimizing it you can maximize the mutual information between views. Building on these works, constrastive self-supervised methods such as CMC (Tian et al., 2020a), MoCo(v2) (He et al., 2020; Chen et al., 2020), and SimCLR (Chen et al., 2020) have recently shown improved results on ImageNet benchmarks, closing the gap between supervised and unsupervised training. Falcon and Cho (2020) actually showed that many of these methods (such as AMDIM, CPC and SimCLR) are special cases of a general framework for contrastive SSL. As a continuation of this development, methods such as BYOL (Grill et al., 2020) and SwAV (Caron et al., 2020) have been presented, expanding the concept to either avoiding contrastive negatives or to doing cluster assignments instead of instance discrimination. In this work, we use SimCLR as a representative method of the contrastive self-supervised methodology. However, we focus on the integral components of contrastive learning that also are shared by other contrastive methods. This means that the results most likely also generalize to other state-of-the-art methods for contrastive SSL.

Self-supervised methods have also been applied to histopathology. One approach has been to, without modification, apply methods constructed for natural images (Lu et al. (2019); Dehaene et al. (2020); Ciga et al. (2020)). The generality of these results has, however, been questioned, and for CPC it has been shown to perform sub-optimally for histology images (Stacke et al., 2020). This has motivated the development of methods that are tailored for histology data, such as incorporation of the spatial information of patches (Gildenblat and Klaiman, 2020; Li et al., 2021), using augmentations based on stain separation (Yang et al., 2021), or using transformer architectures to capture global information (Wang et al., 2021). However, the contrastive objective has not thoroughly been investigated to fully understand what (if any) aspects of it generalize to other domains than natural images. Moreover, modifications to the objective have not been convincingly motivated. In this work, we aim to take a holistic approach to both the contrastive objective and how the unique properties of the histology data comply with the methodology.

Some works give a more rigorous theoretical background to the contrastive methods, such as Arora et al. (2019), Tsai et al. (2021), Wu et al. (2020) and Tschannen et al. (2020). However, much of the success of the previously mentioned methods is derived from heuristics that are still left to be explained theoretically. It is not clear how well the performance showed on one domain transfers to new and different ones. As Torralba and Efros (2011) pointed out some time ago, all datasets, ImageNet included, encompasses specific biases that may be inherited by a model trained on the data. Purushwalkam and Gupta (2020), for example, argued that the object-centric nature of ImageNet is the reason for why SSL methods based on heavy scale augmentations perform well, but that this approach does not work for object recognition tasks. Cole et al. (2021) showed that contrastive learning is less suited for tasks requiring more fine-grained details, and that pre-training on out-of-domain data gives little benefit. Therefore, we have reason to look closely on how contrastive learning methods transfer to the domain of histopathology.

3 Background

This section gives an overview of contrastive multi-view learning as well as view generation, with the goal of giving an conceptual description of how different design choices affect the learned representation.

3.1 Contrastive learning

The general idea of contrastive learning is that an anchor data point (sometimes referred to as query) together with a positive data point (key) form a set of positive views of the same object. The goal is to map these views to a shared representation space, such that the representations contain underlying information (features) shared between them, while simultaneously discarding other (nuisance) information. A positive view pair could therefore share the information of depicting the same object, but may differ in view angle, lighting conditions, or occlusion level.

For images, this shared information is high-dimensional, which makes its estimation challenging. The views are therefore encoded to a more compact representation using a non-linear encoder,

, , such that the mutual information between and

is maximized. Maximization of the mutual information can be estimated by using a contrastive loss function 

(van den Oord et al., 2019), defined to be minimized by assigning high values of positive pairs of data () and low values to all other ( (denoted “negatives”). A popular such loss function is the InfoNCE loss, defined as:


The choice of positives can be done either in a supervised way, where coupled data is collected (such as multiple staining of the same tissue sample), or in a self-supervised manner, where the views are automatically generated. One popular approach of the latter kind is to create two views from the same data point, , by applying random transformations, , such that two views of the data sample are created, . Negative samples are typically taken as randomly selected samples from the training data. View generation, that is, how the views are chosen, has a direct impact of what features the model will learn.

3.2 View Generation

Due to the way the contrastive objective is formulated, the choices of how positive and negative views are selected will largely impact the learned representation. If they are chosen correctly, the learned representation will separate data in such a way that is useful for the target downstream task. Incorrectly chosen, the model may learn nuisance features that result in poor separation of the data with respect to the downstream task. The choice of optimally selecting positives and negatives thus depends on the intended downstream task, since it needs to take into account what is considered task relevant and task irrelevant 111the notation ”task irrelevant” and ”nuisance” features will be used interchangeably (Tian et al., 2020b). For example, color may be considered a nuisance variable in the downstream task of tumor classification in H&E slides and should therefore not be shared between positives, but may be an important feature if the downstream task is scoring of immunohistochemical staining. As the SSL is task-agnostic, that is, the downstream task is unknown with regard to the self-supervised objective, the view generation is critical for controlling what features the model learns, such that they tailored to the downstream task.

Figure 1 shows the relationship between shared mutual information (SSL objective) and view generation. Positive views (top) can be selected/created such that 1) no task-relevant features are shared, resulting in that the model will use only task-irrelevant features to solve the self-supervised objective, 2) only task-relevant features are shared, resulting in the model learning (a subset) of these, or 3) both task-relevant and irrelevant features are shared, increasing the risk of the model learning so called shortcut features (Geirhos et al., 2020), often low-level features (such as color). If the views are created with augmentations (which is what we will consider in this work), this is the result of 1) too strong, 2) just right or 3) too little transformations. To achieve optimal performance on the downstream task, the model should learn the minimally sufficient solution (Tsai et al., 2021), such that two positive views share as much task-relevant information as possible, and as little task-irrelevant information (middle column, Figure 1).

As highlighted by Arora et al. (2019), the choice of negatives (which generally are randomly selected from the mini-batch) is also important for the learning outcome. In the bottom row of Figure 1, the relationship between an anchor and negative is shown. If no information is shared, the model does not have to learn any task-relevant features as any feature may solve the pre-training objective. If too much information is shared, the model will not learn task-relevant features as these cannot be used to distinguish between positive and negatives. This is typically the case when negatives belong to the same (latent) class as the positive, making them so called false negatives. The ideal case is when the shared information is composed such that it contains all nuisance information, but no task related information (middle column).

Figure 1: How view generation is done will affect the learning outcome. Circles represent all learnable features for one data sample, where shaded areas denote task-relevant features. Between anchor () and positive (), which belong to the same class (A), sharing of task-relevant features should be maximized, while simultaneously minimizing task-irrelevant features. Between the anchor () and the negative () (not belonging to same class, ), the opposite should be true.

To further illustrate the relationship between the different views and the self-supervised objective, an example is shown in Figure 2. In this example, view generation resulted in some features shared between the anchor and positive views, of which a subset are task-relevant. Some task-relevant features are, however, not shared, existing in only one of the views. In addition, some features (both task-relevant and irrelevant) are also shared by a negative view. This means that out of all available features, some exist in one, two or all of the views. During model training, the model learns to represent each data point such that the contrastive objective is fulfilled: the anchor-positive views are attracted, and the anchor-negative views are repelled. The attracting features are found in the intersection between the anchor and positive but not in the negative. The repelling features are found in the negative, but not in the anchor. As discussed in the previous section, the region of attracting features should therefore contain primarily task-relevant features, and the repelling region should only contain task-irrelevant features. It should, however, be noted that there is no guarantee that the model will learn all features in these regions, but only the subset of features that is enough to solve the contrastive objective, as observed by Tian et al. (2020b). In the end, as the contrastive objective is task-agnostic and completely relies on this distinction of features over the training dataset, the degree of task relevance of the learned representation will depend on how the views were generated.

Figure 2: View generation results in shared features between anchor, positive and negative views. Due to the contrastive objective, only a subset of all available features are learned by the model. These features are either attracting features or repelling features, where view generation is the only means of controlling to what extent these features are task-relevant.

4 Method

The goal of this paper is to better understand how contrastive self-supervised learning (SSL) can be used for clinical applications where labeled data is scarce. We do this by evaluating different pre-training methods for the target downstream application, i.e., classification with varying amounts of labeled data. In doing so, it is necessary to systematically analyze and understand the impact of different pre-training methods and training strategies, and how this relates to the type of data used and the target task. In this section, we present the datasets, training details and evaluation metrics.

4.1 Experiment design

By using SimCLR as a representative method for contrastive learning, we build our investigation on a series of experiments where we vary the parameters and methodologies relevant to the analysis.

  • Training SimCLR models on in-domain and out-of-domain histology data. It is of key interest to understand how pre-training from different histopathology sources and with different augmentations, affect the resulting learned outcome. The results from this analysis will form the basis of our discussion on self-supervised learning for histopathology data.

  • Compare domain-specific SimCLR with ImageNet pre-training and no pre-training. Previous works often rely on transfer learning from pre-training using ImageNet. Motivated by the strong differences between pathology data and ImageNet data in terms of e.g., image content, number of classes, and overall composition, a systematic evaluation is done of whether a domain-specific pre-training using SimCLR is more beneficial in this context and if so why.

  • Evaluation of different amounts of supervised data for downstream-task training. Pre-trained models are evaluated both with respect to linear and fine-tuning performance with varying amounts of supervised training data.

  • Batch size and learning rate impact. Tuning of hyper-parameters such as batch size and learning rate have been shown to play an important role in contrastive learning using ImageNet. This experiment explores the corresponding parameter tuning for histopathology data.

  • Evaluation of performance during training. Training dynamics presents important information on model robustness, the optimization, and overall performance. This experiment investigates downstream task training and how the resulting models evolve over time.

These experiments, conducted with multiple datasets, form the basis for a detailed evaluation of contrastive self-supervised learning in general, and SimCLR in particular, in the context of histopathology.

4.2 Datasets

(a) ImageNet
(b) Camelyon16
(d) Multidata
Figure 3: Example images from datasets.

For this study, three different histopathology datasets were used, from sentinel breast lymph node tissue, skin tissue and one consisting of mixed tissues extracted as a subset from 60 different datasets. As reference dataset, ImageNet is used. Examples images are shown in Figure 3, with further details given below, and in Appendix A.

ImageNet ILSVRC2012

(Deng et al., 2009): a dataset constructed by searching on the internet using keywords listed in the WordNet database. A subset of the total dataset, approximately 1.2M images, was labeled as belonging to one of 1000 classes using a crowd sourcing technique. Despite the aim of being a representative dataset with a wide category of objects, the nature of the collection technique and annotation strategy has resulted in distinct characteristics and biases in the data, resulting in models trained on this dataset may inherit the biases (Torralba and Efros, 2011). SSL methods developed and tested on this dataset are therefore also likely to adhere to some inherit characteristics of the data (Cole et al., 2021)

. Using ImageNet pre-trained weights for transfer learning is a common approach for many medical image applications, which motivates us to use it as a baseline method. Pre-trained models (trained supervised) were accessed though the Pytorch library 

222Accessible here: https://pytorch.org/vision/stable/models.html.


(Litjens et al., 2018): 399 H&E-stained whole-slide images (WSIs) of sentinel lymph node tissue, annotated for breast cancer metastases. This dataset was sampled into smaller patches twice, to construct one dataset used for self-supervised training, and one for supervised training. For unsupervised training, the WSIs were sampled in an unsupervised way, i.e., no tissue annotations were used to guide sampling from the 270 slides selected as the training slides from the official split. Patches were sampled non-overlapping with patch size of 256x256 pixels with a resolution of 0.5 microns per pixel (mpp) (approximately 20x). Maximum 1000 samples were chosen per slide, resulting in a dataset consisted of slightly less than 270k images.

For supervised training, a downstream task was formulated as binary tumor classification task. The patches were sampled in accordance to the PatchCamelyon (Veeling et al., 2018) dataset, which is a pre-defined probabilistic sampling of the Camleyon16 dataset using the pixel annotations, resulting in a class-balance between tumor and non-tumor labels in the dataset. The pre-sampled PatchCamelyon dataset is sampled at 10x, with patch size of 96x96 pixels. In this study, the data was resampled to match the unsupervised dataset (at the same coordinates as the original dataset). In line with the PatchCamelyon dataset, training/validation/test samples were taken from 216/54/129 slides respectively. In addition, subsets (possibly overlapping) of the supervised training dataset was selected as taking all patches from 10, 20, 50, 100 random slides, respectively (the smaller subsets are subsets of the larger ones). This was repeated five times to create five folds for each subset. For more details about PatchCamelyon, see Veeling et al. (2018). Pre-training SimCLR models using the unsupervised dataset is therefore considered in-domain pre-training, as the same slides are re-sampled and used for training the supervised, downstream task.


(Lindman et al., 2019): a dataset containing 96 WSIs from 71 unique patients of skin tissue. The data was split into train, validation, and test on patient level, such that 50, 6, and 15 patients were included in train, validation and test respectively. This resulted in 65, 8, and 23 WSIs in each dataset. In analogy with Camelyon16, the AIDA-LNSK is sampled to create two dataset, one for downstream task training, and a corresponding in-domain dataset for pre-training.

For unsupervised training, patches were extracted from tissue regions of slides in the training set (65 slides), found by Otsu threshold from WSI magnification 5x. From these regions, the data was sampled without overlap. This resulted in an unsupervised dataset size of approximately 270k patches, roughly the same size as the unsupervised Camelyon16 dataset. All patches were extracted with 0.5 (mpp) resolution at a size of 256x256 pixels.

From AIDA-LNSK, a downstream task was constructed as a five-class tissue classification task, using available pixel-level annotations. The five classes were formed as four classes representing healthy tissue types (dermis, epidermis, subcutaneous tissue and skin appendage structure) and one class representing “abnormal” (containing different types of cancer, inflammation, scaring and so on). The slides from the above mentioned split was sampled (same size and resolution as the unsupervised dataset) such that for the supervised training dataset, each class included at least 75’000 samples, resulting in approximately 320k patches. The training set was thereafter subdivided, by randomly selecting all patches from 10, 20 and 50 slides from the original 65. Smaller subsets are true subsets of larger ones. This was repeated 5 times, such that for each subset size, 5 (possibly overlapping) dataset were created. The validation and test set were sampled from the respective slides in a class-balanced way. For more information about the data collection and annotations, please see Stadler et al. (2021).


: Ciga et al. (2020) constructed a multi-data dataset, consisting of samples from 60 publicly available datasets, originating from multiple tissue types. This pre-sampled dataset was sampled unsupervised, and will in this study be used for self-supervised training only. Patches were extracted with size 224x224 pixels, at the maximum available resolution per dataset, resulting in a variation of resolution between the patches (0.25–0.5 mpp). In this study, we use a 1% subset of this data, provided by the authors, consisting of 40k patches. With relation to the downstream tasks of breast tumor classification and skin tissue classification, this data is considered out-of-domain.

4.3 Training

For all experiments, the ResNet50 (He et al., 2016) model architecture was used. As self-supervised method, SimCLR was evaluated, and if nothing else is stated, the same training setup was used as in Chen et al. (2020).

The SimCLR objective is to minimize the NT-Xent loss. For a positive pair this is defined as


where is a positive pair, and a negative one, with the similarity function defined as

(cosine similarity). The temperature scaling,

, was set to 0.5 for all experiments.

In SimCLR, augmentations are used to create the positive views from the same anchor sample. Henceforth, ”original” augmentations will refer to the augmentations defined in the SimCLR paper (randomly applying resize and crop, horizontal flip, color jittering and Gaussian blur) with the modification of when training with histopathology data, additional vertical flip and random rotation of 90 degrees was added (due to the rotation invariance of histopathology data). For examples, please see Appendix B.

Following commonly used protocol, the self-supervised pre-training was done once (due to computational and time constraints). The resulting representation is evaluated primarly using a linear classifier on top of the frozen weights, but also using fine-tuning (linear classifier on top, without freezing any weights). The former method is a way of evaluating the quality of the pre-trained representations with regard to the target data and objective, while the latter is a more realistic use-case of the trained weights. For supervised training cases, training was repeated 5 times with different seeds, and when subsets of the training data is used, also different folds. All results are reported as patch-wise accuracy on class-balanced test sets.

All training was conducted on 4 NVIDIA V100 or NVIDIA A100 GPUs. For SimCLR training, effective batch size of 1024 was used for 200 epochs (training time approximately 24 hours). LARS 

(You et al., 2017) was used as optimizer, with an initial learning rate regulated with the a cosine annealing scheduler.

For linear evaluation, models were trained in a supervised manner for 20 epochs using Adam optimizer with an initial learning rate of 0.01. For fine-tuning, models were trained for 50 epochs. For breast, Adam optimizer with an initial learning rate of was used, with weight decay of

. For skin, SGD optimizer with Nesterov momentum was used with initial learning rate of

and momentum parameter of 0.9. In addition, models were trained in a supervised manner with random initialization (“from scratch”), using Adam optimizer with learning rate 0.001 for 50 epochs. Common for all supervised training was the usage of cosine annealing scheduler to reduce learning rate and weighted sampling to mitigate effects of class imbalance. Augmentations applied during supervised training consisted of: random resize crop with scale variance between 0.95–1.0, color jittering, and rotation/flip.

5 Results

Below follows detailed description and results from the experiments outlined in Section 4.1.

5.1 Positive-view generation by augmentation

Augmentations Breast Skin
+ {Gaussian blur} + 1.26 + 0.37
+ {Scale} + 3.94 + 0.64
+ {Gaussian blur, Scale} (SimCLR Orig.) + 0.2 + 1.25
+ {Scale, Grid Distort} + 2.44 + 1.42
+ {Scale, Grid Distort, Shuffle} + 1.72 + 1.39
+ {Grid Shuffle} + 0.37 + 2.83
+ {Grid Distort, Shuffle} + 1.6 + 3.79
Table 1: Relative improvement (percentage points) over using base augmentations only (rotate, flip, color jitter). SimCLR model trained for 50 epochs, linear evaluation on supervised training set from 50 slides.

We here investigate the effect of different augmentations, and their ability to isolate the task-relevant features in order to improve the correlation between the SSL objective and the downstream performance.

SimCLR models were trained on the unsupervised datasets from either breast or skin, and evaluated after 50 epochs on in-domain data from 50 slides, respectively. Eight different augmentation combinations were evaluated in terms of the relative improvement over Base augmentations (using flip, rotate, color jitter and very low scale variance, ). As Chen et al. (2020) found large scale variance (0.2–1.0) together with Gaussian blur beneficial for ImageNet, these augmentations were evaluated both together and individually. Furthermore, two additional augmentations were evaluated, Grid Distort and Shuffle. These were chosen as transformations that preserve the label of histopathology patches, but adds perturbations of the compositions of the cells. The results are shown in Table 1. Further details and examples of the augmentations can be found in Appendix B.

Choosing optimal augmentations for histopathology data depend on dataset and downstream task

Looking at Table 1, choosing the appropriate augmentations for view generation makes it possible to boost performance with and percentage points for breast and skin respectably. However, it appears as there is no common set of augmentations which is optimal for both datasets. Furthermore, using the same augmentations that were presented in Chen et al. (2020) as optimal for ImageNet gives sub-optimal performance for histology data. Using Gaussian Blur was found to be of negligible value, and scale was only substantially beneficial for breast data, not for skin. The best set of augmentations for breast data was to use Base + Scale, while for skin, Base + Grid Distort + Shuffle gave the highest performance.

Thus, different sets of augmentations are optimal for different datasets and different downstream task. This is not surprising, as the relevant information in the data depends both on the inherited features in the dataset, as well as what the downstream task is. Finding task- and data-specific augmentations are therefore needed.

Figure 4: Patch-level performance of linear trainings of different pre-training strategies, at varying supervised training size (number of slides). The linear models are trained on representations learned by: ImageNet supervised pre-training (gray, solid), or using SimCLR pre-training with different augmentations (dashed/dotted), and on different datasets (in-domain pathology data: red, out-of-domain pathology data: blue). The reference (black, solid) is the full ResNet50 model trained in a supervised way on the (subset) training data. Left: breast data, right: skin data. Note different x-axes between subplots.

5.2 Downstream Performance

In this section, different pre-training strategies are evaluated. By using a random initialized model trained in a supervised way as reference, we want to understand the gain of using a pre-trained model depending on the size of the supervised data. In Figure 4, the result of linear evaluation (frozen pre-trained weights) for breast and skin is shown, where the supervised reference is shown in black (solid line). For each tissue type, five different pre-trained models are evaluated, either ImageNet Supervised (gray, solid) or four different configurations of SimCLR, where color denotes dataset (in-domain in red, out-of-domain in blue) and markers denote augmentations applied (original SimCLR in dashed, best set from Table 5.1 as dotted). Table 2 shows the fine-tuning results comparing the pre-training method giving the best performance on the linear evaluation with ImageNet pre-training and no pre-training (random initialization). From the results in the figure and table, we can draw a number of conclusions.

width= Data Pre-training Supervised size (# slides) Breast None ImageNet Supervised SimCLR Breast, Base + Scale Skin None ImageNet Supervised SimCLR Skin, Base + Distort + Shuffle

Table 2: Fine-tuning performance (patch-level accuracy, %) using no pre-training (random initialized weights), ImageNet pre-training or SimCLR in-domain training.

In-domain pre-training boosts performance, especially in low-supervised data scenarios.

The evaluation of linear training in Figure 4 shows that the best linear separation is given by pre-training using SimCLR on in-domain data with custom augmentations (red, dotted), exceeding ImageNet pre-training (gray, solid). These results are echoed also in the fine-tuning case, as shown in Table 2. Notably for smaller supervised training datasets (fewer than 65 slides), pre-training gives a substantial boost. When significantly more supervised training data is available (100 slides or more), the gain of using pre-trained weights is diminished. This is especially clear for breast data when doing fine-tuning (Table 2), where no pre-training on the full supervised dataset (216 slides) gave similar performance as using initalization from either ImageNet or SimCLR pre-trained weights. For skin, the size of the full supervised dataset (of 65 slides) is still small enough to make use of pre-trained weights a good idea.

Optimal dataset and view generation depend on downstream task.

For breast data, we see in Figure 4 that the two in-domain models with different augmentations (red) gave similar performance, while for skin, different augmentations gave larger difference. Using a custom set of augmentations compared to the original SimCLR gave a significant boost in performance (dotted vs dashed). Furthermore, using Multidata as pre-training dataset that is out-of-domain pathology data (blue) gave for breast data the poorest performance, independent of augmentation, while for skin, Multidata with the original augmentations was on par or just slightly worse than the best in-domain model (blue, dashed). This corroborates the theory discussed in Section 3.2, that the features learned during pre-training are highly dependent on what data and how view generation was performed (i.e., what augmentations were applied), and that their usefulness/relevance are dependant on the downstream task.

Increasing diversity of pathology data is not beneficial per se.

Both tissue types were evaluated on the Multidata dataset, with two different sets of augmentations each. These augmentations where chosen either as a general approach (SimCLR original) or a dataset specific augmentation. The Multidata dataset is smaller than the others, but has much larger diversity as it contains samples from a wide range of publicly available datasets. This reduces the risk of false negatives, and could potentially create more diverse sets of features. Looking at the linear performance in Figure 4, we see that the same model trained on Multidata with original augmentations (blue, dashed) performed poorly on the downstream task of breast tumor detection in sentinel lymph node tissue, but gave good results in skin tissue classification. That task-relevant features were extracted for skin, but not for breast, could be explained either through lack of existing features in the dataset needed for good tumor detection in lymph node tissue, or a sub-optimal view generation to isolate them. Collecting diverse datasets does not guarantee a generalizable model per se. Having diverse datasets may increase the chance of containing the needed, relevant features, but if those features are learned depends on the view generation.

(a) Breast
(b) Skin
Figure 5: Training loss and downstream accuracy evaluated for two different models on for breast and skin data respectively. Loss continues to decrease continuously during training, while the performance on the downstream application evaluated at epochs {10, 20, 50, 100, 200} is after the initial epochs almost constant.

5.3 Effects of hyper-parameters

Large batch sizes and long training times have been shown to play an important role in contrastive self-supervised learning for ImageNet (Chen et al., 2020). Here, we investigate to see if this also holds true for histology data.

Longer training does not improve performance.

Figure 5 shows evaluation of linear performance at intervals during training, for two models respectively for breast and skin. Despite continued reduction in training loss (the model is still learning to solve the SSL objective), the performance on the downstream task is changing very little after the first 10 epochs. This indicates that the view generation fails to isolate task-relevant features, making the model rely on task-irrelevant features to solve the SSL objective (scenario shown in Figure 2).

Large batch sizes are not needed.

The motivation of large batch sizes is that this will form a better approximation of the true dataset distribution, wherein the separation of the positive and negatives will better reflect the true distribution. Figure 6 shows varying batch size for a model trained on breast data, with LARS optimizer. As long as the learning rate is updated according to the size of the batch (approximately following the relationship as in Chen et al. (2020)), larger batch size does not give better performance. In fact, a batch size of 2048 gave an overall lower performance than 256 (which fits on a single GPU).

Figure 6: Performance for SimCLR trained with original augmentations for 50 epochs on breast data, using varying batch size and learning rate. Using learning rate of 2.4 with batch size 256 did not converge.

6 Discussion

From the results in Section 5, we can make some interesting observations. Primarily, with correctly selected augmentations, in-domain contrastive SSL is beneficial as pre-training, especially in low-data regimes. In addition, experiments show that large batch sizes and long training times may not be needed to create pre-trained models, making model-creation more accessible. However, the results also raise concerns, which are discussed below.

6.1 Consequences of different dataset characteristics

Dataset Dataset characteristics Consequence
Number of
ImageNet 1000
Good Good
Many classes, good balance:
reduced risk of false negatives
High variance and good isolation:
easier view generation
Camelyon16 2 Low Low Low/Medium
Few classes, poor balance:
higher risk of false negatives
Low variance with low/
medium isolation:
harder view generation
AIDA-LNSK 5 Medium Low Low/Medium
Few classes, medium balance:
slightly higher risk of false negatives
Low variance with low/
medium isolation:
harder view generation
Table 3: Dataset characteristics and the consequences they have for learning with contrastive self-supervised methods.

The results presented in Section 5 show that the heuristics derived to be optimal for ImageNet does not transfer to the other datasets. This suggests that the method is tightly coupled with the dataset. We can identify a few important dataset characteristics that affect the learning outcome, listen in Table 3. In the table, ImageNet is compared with the histopathology datasets of breast sentinel lymph node tissue (Camelyon16) and skin tissue (AIDA-LNSK) (presented in Section 4.2), with respect to these characteristics, and the discussion is expanded below.

Number of classes and class balance affect the risk of false negatives.

When negatives are chosen as random samples from the dataset, the distribution of the classes in the dataset affects the risk of getting a high number of false negatives. With many classes and perfect balance between them, the risk of drawing negatives belonging to the same class as the anchor data point is low. In the case of ImageNet, with 1000 classes, drawing 1024 (or even up to 4096) random samples from the mini-batch as negatives, the likelihood of false negatives is low. Compare this with the histology datasets with 2 and 5 classes respectively, and where the poorer class balance further increases the risk for false negatives for the larger classes. In addition, ImageNet was collected in a supervised manner. For both histopathology data sources, the datasets used for SSL were sampled without knowledge of class labels, making the class distribution depend only on the natural occurrence of the class (in contrast to stratified sampling).

Diversity in/across classes and downstream-task isolation makes view generation easier.

As contrastive SSL aims to do instance discrimination, having data samples that are distinct makes the objective easier. Variance across classes further helps the model to learn features that separate between classes. The nature of the datasets allows ImageNet to have higher variance both within and across classes, as attributes such as viewpoints, backgrounds, lighting, occlusions, etc, can vary in a natural dataset. Even within subgroups of labels, such as dogs, we have a larger variation due to more diversity of color, texture, shape and size – all of which, to a large extent, are constant between histopathology classes. In the ImageNet case, the augmentations applied can make use of the known variance, making the model invariant to them. The subtleties of the difference between classes in histology data makes it more challenging to find effective augmentations.

Downstream-target isolation relates to how isolated the downstream targets are on average in one image. ImageNet contains many object-centered images, containing only the class object, while both histopathology datasets have images that contain multiple classes (tissues and/or cell types). With the downstream task of object classification, having multiple objects present in the images will add noise to the wanted target signal. Augmentations such as scale may in cases with low downstream-target isolation create a positive pair depicting two separate objects, instead of the same object in different scales.

6.2 How to do contrastive learning for histology?

From what we have seen so far, the cause of the success of contrastive self-supervised learning (SSL) methods on ImageNet has been highly dependent of intrinsic properties of the data. The intricate interplay between method and data raises questions on both how to adapt the method to better accommodate the data and how to better assemble datasets that fit the method.

Current positive-view generation is not sufficient.

From the results shown in Table 1 we saw that a tailored set of augmentations gives substantial improvements in downstream performance. However, we also saw in Figure 5 that the representations learned are, to a large extent, based on nuisance variables. As optimal augmentations depend on both dataset and task, finding a general approach that applies to all datasets and all tasks may not be feasible. The augmentations found in this study were sub-optimal even when tailored with a specific dataset and task in mind. Creating augmentations that are strong enough to retain only label information and remove all other is not trivial, and may require extensive domain knowledge. This is further exacerbated by the fact that pathologists generally are not used to describing diagnostic criteria in terms of features suitable to formulate as image transformations. Moreover, heavily tailored augmentations would be a step towards feature engineering, something we are trying to move away from using deep learning. A different direction would be to optimally learn what augmentation to apply, such as presented by Tamkin et al. (2021). However, contrary to what Tamkin et al. (2021) suggest, the conclusion from our results is that these augmentations need to be optimized for the downstream task, not the SSL objective. A semi-self-supervised approach could therefore be an interesting future research direction.

Many false-negatives gives conflicting signals.

Datasets with few classes and/or large class imbalances suffers a large risk of introducing false negatives, when negatives are taken as random samples from the mini-batch. This risk also increases with larger batch sizes, as the ratio between number of classes and batch size increases. This could be one explanation why there is little benefit of increased batch size for histopathology data (Figure 6). Having a large portion of false negatives has consequences for the learning outcome, as this prevents the model from using class-specific features to discriminate between positives and negatives (as highlighted in Figure 2). We can further investigate this by looking at the cosine distance between samples in a mini-batch (1024 samples) of a SSL model trained on breast data. A significant number of negative samples have high or very high cosine similarity () with the anchor data point, an occurrence not seen for ImageNet data. Out of all negatives, of the anchor-negative pairs had a similarity higher than 0.9, while the same number of ImageNet is . Minimizing the number of false negatives is an important part of getting better performance (Chuang et al., 2020).

Intrinsic properties of histology datasets may be incompatible with current methods.

As discussed in Section 6.1, intrinsic properties of the datasets makes positive view generations challenging (low inter- and intra-class variance and lower target object isolation in individual images), and increases the risk of false-negatives (due to low number of classes and poor class distribution). Even if these problems could be addressed with new techniques such as better view generation and true negative sampling, questions regarding the suitability of SSL for these types of datasets remain. The SimCLR objective optimizes towards instance discrimination. This approach is intuitive when we have a dataset where the intra-class variance consists of multiple ways of describing the same object. Being able to separate each of these instance helps give a wider distribution of the possible appearances of the object in question. In histopathology, datasets are constructed as smaller patches from whole-slide images, and where the intra-class variance consists of images showing multiple cells, where the cells actually are more or less clones of each other. A different approach of constructing these datasets may be needed, such that the downstream-target isolation becomes higher. This is indeed challenging. Taking smaller patches depicting as little as individual cells suffers even more of false negatives, and macro structures may be hard to learn. Taking larger patches/reduce resolution would include larger structures, but may include multiple tissues at once, reducing downstream-target isolation further.

If you understand your data, then contrastive self-supervised learning can still be useful.

Despite the above mentioned limitations for contrastive SSL used on histology data, the current setup may still bring value for specific histology applications. For example, the reduced need for large batch sizes and long training times makes in-domain pre-training using contrastive SSL accessible to a larger community. Furthermore, by keeping the dataset characteristics from Table 3 in mind, risk factors for the dataset in question may be early identified. By understanding the inter- and intra-variance of the (latent) classes of the dataset, augmentations may be formulated that are better tailored to the specific downstream application in mind, compared to naively using those optimized for ImageNet. The risk of false negatives might be possible to mitigate during the data collection, for example by controlling the resolution in which the patches are sampled. In combination with the results shown in Table 2 that shows that in-domain SimCLR pre-training does boost performance, contrastive self-supervised learning can indeed be a way to reduce the need for labeled data for histopathology applications.

6.3 Limitations and Future work

This paper aims to evaluate if and how contrastive self-supervised methods can be used to reduce the needed amount of labeled data for the target histopathology application. The scope of the study was limited to three datasets and two classification tasks, and where SimCLR was chosen as representative method among all contrastive self-supervised methods. Restricted by the challenges and limitations of pre-training models for one downstream task, we did not evaluate the generalization of the SSL models by evaluating one pre-trained model on multiple downstream tasks, with one exception (SimCLR Multidata with original augmentations was used as pre-trained model for both downstream tasks). Despite the restricted scope, we believe that the results may give guidance when applied to an extended domain.

The results from this study show that contrastive self-supervised methods have the potential, if applied correctly, to reduce the need for labeled target data. However, they also show that the method is still sub-optimal with respect to the specific data characteristics of histopathology. There is therefore room for improvement, but the challenges of creating informative positives and reduce false negatives are not trivial to solve. Creating informative positives may be easier with deepened understanding of what features should be considered task-relevant for a given downstream task. The problem of false negatives could potentially be solved by selecting negatives in a non-random way, potentially taking a semi-supervised approach. We hope that this work will inspire interesting future research that take a holistic approach, considering the interplay between dataset and method.

7 Conclusions

In this paper, we have evaluated contrastive self-supervised learning on histopathology data. Effective contrastive self-supervised learning with respect to a particular downstream task requires two criteria to be fulfilled, namely that the shared information between the positive views is high, and that the false negative rate is low. Our study shows that both these criteria are challenging to fulfill for histopathology applications, due to the characteristics of the datasets. Furthermore, we have shown that the explicit and implicit heuristics used for ImageNet does not necessarily apply in the domain of histopathology. We conclude that SSL for histopathology cannot be considered and used under the same assumptions as for natural images, and that in-depth understanding of the data is essential for training self-supervised models for histopathology applications.

This work was supported by the Wallenberg AI and Autonomous Systems and Software Program (WASP-AI), the research environment ELLIIT, AIDA Vinnova grant 2017-02447, and Linköping University Center for Industrial Information Technology (CENIIT). We also like to thank our colleague Jesper Molin (Sectra) for comments on the manuscript.

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.

We declare that we do not have any conflicts of interest.


  • S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi (2019) A theoretical analysis of contrastive unsupervised representation learning. In

    Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA

    pp. 5628–5637. External Links: Link Cited by: §2, §3.2.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 15509–15519. External Links: Link Cited by: §2.
  • A. Birhane and V. U. Prabhu (2021)

    Large Image Datasets: a Pyrrhic Win for Computer Vision?

    In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1537–1547. Cited by: §1.
  • A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. A. Kalinin (2020) Albumentations: fast and Flexible Image Augmentations. Information 11 (2), pp. 125. External Links: Document Cited by: Appendix B.
  • M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, External Links: Link Cited by: §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020) A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, pp. 1597–1607. External Links: Link Cited by: §1, §1, §2, §4.3, §5.1, §5.1, §5.3, §5.3.
  • X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved Baselines with Momentum Contrastive Learning. CoRR abs/2003.04297. External Links: 2003.04297 Cited by: §2.
  • C. Chuang, J. Robinson, Y. Lin, A. Torralba, and S. Jegelka (2020) Debiased contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, External Links: Link Cited by: §6.2.
  • O. Ciga, A. L. Martel, and T. Xu (2020) Self supervised contrastive learning for digital histopathology. CoRR abs/2011.13971. External Links: 2011.13971 Cited by: §2, §4.2.
  • E. Cole, X. Yang, K. Wilber, O. Mac Aodha, and S. Belongie (2021) When Does Contrastive Visual Representation Learning Work?. CoRR abs/2105.05837. External Links: 2105.05837 Cited by: §1, §2, §4.2.
  • Y. Cui, Y. Song, C. Sun, A. Howard, and S. J. Belongie (2018) Large scale fine-grained categorization and domain-specific transfer learning. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    pp. 4109–4118. External Links: Link, Document Cited by: §1.
  • O. Dehaene, A. Camara, O. Moindrot, A. de Lavergne, and P. Courtiol (2020) Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology. CoRR abs/2012.03583. External Links: 2012.03583 Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. External Links: Document Cited by: §1, §4.2.
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative Unsupervised Feature Learning with Convolutional Neural Networks

    In Advances in Neural Information Processing Systems, Vol. 27. Cited by: §2.
  • W. Falcon and K. Cho (2020) A framework for contrastive self-supervised learning and designing a new approach. CoRR abs/2009.00104. External Links: 2009.00104 Cited by: §2.
  • R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann (2020)

    Shortcut Learning in Deep Neural Networks

    Nature Machine Intelligence 2 (11), pp. 665–673. External Links: 2004.07780, ISSN 2522-5839, Document Cited by: §3.2.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §1, §2.
  • J. Gildenblat and E. Klaiman (2020) Self-Supervised Similarity Learning for Digital Pathology. CoRR abs/1905.08139. External Links: 1905.08139 Cited by: §2.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko (2020) Bootstrap your own latent - A new approach to self-supervised learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, External Links: Link Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735. External Links: ISSN 2575-7075, Document Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. External Links: ISSN 1063-6919, Document Cited by: §4.3.
  • G. Larsson, M. Maire, and G. Shakhnarovich (2017) Colorization as a Proxy Task for Visual Understanding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 840–849. External Links: ISSN 1063-6919, Document Cited by: §1, §2.
  • H. Li, P. Chaudhari, H. Yang, M. Lam, A. Ravichandran, R. Bhotika, and S. Soatto (2020)

    Rethinking the hyperparameters for fine-tuning

    In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1.
  • J. Li, T. Lin, and Y. Xu (2021) SSLP: spatial Guided Self-supervised Learning on Pathological Images. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert (Eds.), Lecture Notes in Computer Science, Cham, pp. 3–12. External Links: Document, ISBN 978-3-030-87196-3 Cited by: §2.
  • K. Lindman, F. R. Jerónimo, M. Lindvall, and C. Bivik Stadler (2019) Skin data from the Visual Sweden project DROID. External Links: Document Cited by: §4.2.
  • G. Litjens, P. Bandi, B. Ehteshami Bejnordi, O. Geessink, M. Balkenhol, P. Bult, A. Halilovic, M. Hermsen, R. van de Loo, R. Vogels, Q. F. Manson, N. Stathonikos, A. Baidoshvili, P. van Diest, C. Wauters, M. van Dijk, and J. van der Laak (2018) 1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset. GigaScience 7 (6). External Links: Document Cited by: §4.2.
  • G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez (2017) A Survey on Deep Learning in Medical Image Analysis. Medical Image Analysis 42, pp. 60–88. External Links: 1702.05747, ISSN 13618415, Document Cited by: §1.
  • M. Y. Lu, R. J. Chen, J. Wang, D. Dillon, and F. Mahmood (2019) Semi-Supervised Histology Classification using Deep Multiple Instance Learning and Contrastive Predictive Coding. CoRR abs/1910.10825. External Links: 1910.10825 Cited by: §2.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, pp. 69–84. External Links: Link, Document Cited by: §1, §2.
  • S. Purushwalkam and A. Gupta (2020) Demystifying contrastive self-supervised learning: invariances, augmentations and dataset biases. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, External Links: Link Cited by: §2.
  • M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio (2019) Transfusion: understanding Transfer Learning with Applications to Medical Imaging. CoRR abs/1902.07208. External Links: 1902.07208 Cited by: §1.
  • K. Stacke, C. Lundström, J. Unger, and G. Eilertsen (2020) Evaluation of Contrastive Predictive Coding for Histopathology Applications. In Proceedings of the Machine Learning for Health NeurIPS Workshop, pp. 328–340. External Links: ISSN 2640-3498 Cited by: §2.
  • C. B. Stadler, M. Lindvall, C. Lundström, A. Bodén, K. Lindman, J. Rose, D. Treanor, J. Blomma, K. Stacke, N. Pinchaud, M. Hedlund, F. Landgren, M. Woisetschläger, and D. Forsberg (2021)

    Proactive Construction of an Annotated Imaging Database for Artificial Intelligence Training

    Journal of Digital Imaging 34 (1), pp. 105–115. External Links: ISSN 1618-727X, Document Cited by: §1, §4.2.
  • A. Tamkin, M. Wu, and N. D. Goodman (2021) Viewmaker networks: learning views for unsupervised representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §6.2.
  • Y. Tian, D. Krishnan, and P. Isola (2020a) Contrastive multiview coding. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI, pp. 776–794. External Links: Link, Document Cited by: §1, §2.
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020b) What makes for good views for contrastive learning?. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, External Links: Link Cited by: §3.2, §3.2.
  • A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pp. 1521–1528. External Links: Link, Document Cited by: §2, §4.2.
  • T. Truong, S. Mohammadi, and M. Lenga (2021) How Transferable Are Self-supervised Features in Medical Image Classification Tasks?. arXiv:2108.10048 [cs]. External Links: 2108.10048 Cited by: §1.
  • Y. H. Tsai, Y. Wu, R. Salakhutdinov, and L. Morency (2021) Self-supervised learning from a multi-view perspective. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §2, §3.2.
  • M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2020) On mutual information maximization for representation learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §2.
  • A. van den Oord, Y. Li, and O. Vinyals (2019) Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748. External Links: 1807.03748 Cited by: §1, §2, §3.1.
  • B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling (2018) Rotation equivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2018 - 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II, pp. 210–218. External Links: Link, Document Cited by: §4.2.
  • X. Wang, S. Yang, J. Zhang, M. Wang, J. Zhang, J. Huang, W. Yang, and X. Han (2021) TransPath: transformer-Based Self-supervised Learning for Histopathological Image Classification. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert (Eds.), Lecture Notes in Computer Science, Cham, pp. 186–195. External Links: Document, ISBN 978-3-030-87237-3 Cited by: §2.
  • M. Wu, C. Zhuang, M. Mosse, D. Yamins, and N. Goodman (2020) On Mutual Information in Contrastive Learning for Visual Representations. CoRR abs/2005.13149. External Links: 2005.13149 Cited by: §2.
  • Z. Wu, Y. Xiong, S. Yu, and D. Lin (2018) Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. CoRR abs/1805.01978. External Links: 1805.01978 Cited by: §2.
  • K. Yang, J. Yau, L. Fei-Fei, J. Deng, and O. Russakovsky (2021) A Study of Face Obfuscation in ImageNet. CoRR abs/2103.06191. External Links: 2103.06191 Cited by: §1.
  • P. Yang, Z. Hong, X. Yin, C. Zhu, and R. Jiang (2021) Self-supervised Visual Representation Learning for Histopathological Images. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert (Eds.), Lecture Notes in Computer Science, Cham, pp. 47–57. External Links: Document, ISBN 978-3-030-87196-3 Cited by: §2.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3320–3328. Cited by: §1.
  • Y. You, I. Gitman, and B. Ginsburg (2017) Large Batch Training of Convolutional Networks. arXiv:1708.03888 [cs]. External Links: 1708.03888 Cited by: §4.3.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pp. 649–666. External Links: Link, Document Cited by: §2.

Appendix A Datasets

In Table 4 the number of slides and patches are given for the different datasets. All patches were extracted at 0.5 microns per pixel resolution with size 256x256 pixels. The unsupervised dataset were sampled without overlap in a uniform grid. For breast, maximally 1000 samples per slide was randomly selected from slides in the training set, resulting in approximately 270k patches. Similarly for skin, the unsupervised training set consists of samples selected without using labels, resulting in approximately 270k patches. Potential sampling points were found by, in tissue regions, extracting samples from a random, uniform grid. From these candidates, 1000 samples per slide were extracted at random.

The supervised training, validation and test set for breast follows the patch coordinates of PatchCamelyon, but was resampled at the above mentioned resolution and size. For skin, the supervised datasets were constructed as follows. All slides were sampled in a uniform grid with 50% overlap, resulting in total of 1.3 million candidate patches. From these candidates, a subset where selected for each dataset, with the follow criteria. For the training set, the large class imbalance was mitigated by selecting patches such that for each class, the total number of patches was min(75 000, all available). This resulted in approx. 320k patches. For validation, 700 patches from each class were randomly selected from the slides in the validation set. Similarly, 3700 patches from each class were randomly selected from the slides in the test set to formulate the test patches. The validation and test sets are therefore class balanced. There is no patient overlap between the supervised datasets.

width= Training Validation Test Unsupervised Breast # WSI 270 N/A N/A # patches 265048 N/A N/A Skin # WSI 65 N/A N/A # patches 271675 N/A N/A Supervised Breast # WSI 216 / 100 / 50 / 20 / 10 54 129 # patches 262144 / 120000 / 60000 / 25000 / 10000 32768 32768 Skin # WSI 65 / 50 / 20 / 10 8 23 # patches 317243 / 235000 / 95000 / 50000 3500 18500

Table 4: Number of whole-slide images (WSIs) and patches for breast and skin data.

Appendix B Augmentations

The augmentations applied were done using Pytorch Transforms (https://pytorch.org/vision/stable/transforms.html) or with Albumentations (Buslaev et al., 2020). The implementation of Gaussian blur was taken from: https://github.com/facebookresearch/moco. Examples are shown in Figure 7, and implementation details in Table 5.

Figure 7: Examples of augmentations
Transformation Params Probability
Base Random Crop size: 224x224 1.0
Flip - 0.5
Rotation (fixed 90 degrees) - 0.5
Color Jitter
brightness: 0.8,
contrast: 0.8,
saturation: 0.8,
hue: 0.2
SimCLR original Scale scale: {0.2, 0.95} – 1.0 1.0
Gaussian Blur sigma: 0.1–2.0 {0, 0.5}
Grid Distortion
num_steps: 9,
distort_limit: 0.2
border_mode: 2
{0, 0.5}
(Grid) Shuffle
grid: (3,3) {0, 0.5}
Table 5: List of applied augmentations, with corresponding parameters and probability of application.

Appendix C Results

Tables 6 and 7 show the results presented in Figure 4 in numerical form.

Model Supervised size (#slides)
10 20 50 100 216
Accuracy (%) Supervised
ImageNet Supervised
SimCLR Breast, Original
SimCLR Breast, Base + Scale
Multidata, Original
Multidata, Base + Scale
AUC Supervised
ImageNet Supervised
SimCLR Breast, Original
SimCLR Breast, Base + Scale
Multidata, Original
Multidata, Base + Scale
Table 6: Linear performance for patch-level PatchCamelyon, shown as accuracy (%) and patch-level AUC

width= Model Supervised size (#slides) 10 20 50 65 Accuracy (%) Supervised ImageNet Supervised SimCLR Skin, Original SimCLR Skin, Base + GridDist + Shuffle Multidata, Original Multidata, Base + GridDist + Shuffle

Table 7: Linear performance (patch-level accuracy (%)), AIDA-LNSK