Official code for "Self-Supervised driven Consistency Training for Annotation Efficient Histopathology Image Analysis"
Training a neural network with a large labeled dataset is still a dominant paradigm in computational histopathology. However, obtaining such exhaustive manual annotations is often expensive, laborious, and prone to inter and Intra-observer variability. While recent self-supervised and semi-supervised methods can alleviate this need by learn-ing unsupervised feature representations, they still struggle to generalize well to downstream tasks when the number of labeled instances is small. In this work, we overcome this challenge by leveraging both task-agnostic and task-specific unlabeled data based on two novel strategies: i) a self-supervised pretext task that harnesses the underlying multi-resolution contextual cues in histology whole-slide images to learn a powerful supervisory signal for unsupervised representation learning; ii) a new teacher-student semi-supervised consistency paradigm that learns to effectively transfer the pretrained representations to downstream tasks based on prediction consistency with the task-specific un-labeled data. We carry out extensive validation experiments on three histopathology benchmark datasets across two classification and one regression-based tasks, i.e., tumor metastasis detection, tissue type classification, and tumor cellularity quantification. Under limited-label data, the proposed method yields tangible improvements, which is close or even outperforming other state-of-the-art self-supervised and supervised baselines. Furthermore, we empirically show that the idea of bootstrapping the self-supervised pretrained features is an effective way to improve the task-specific semi-supervised learning on standard benchmarks. Code and pretrained models will be made available at: https://github.com/srinidhiPY/SSL_CR_HistoREAD FULL TEXT VIEW PDF
Official code for "Self-Supervised driven Consistency Training for Annotation Efficient Histopathology Image Analysis"
Deep neural network models have a achieved tremendous success in obtaining state-of-the-art performance on various histology image analysis tasks ranging from disease grading, cancer classification to outcome prediction (Srinidhi et al., 2021; Bera et al., 2019; Litjens et al., 2017). The main success of these methods is attributed to the availability of large-scale open datasets with clean manual annotations. However, collecting such a large corpus of labeled data is often expensive, laborious and requires skillful domain expertise, notably in the histopathology domain (Madabhushi and Lee, 2016)
. Recently, self-supervised and semi-supervised approaches are becoming increasingly popular to alleviate the annotation burden by leveraging the readily available unlabeled data that can be trained with limited supervision. These methods have recently demonstrated promising results on various computer vision(Jing and Tian, 2020; Laine and Aila, 2016; Sohn et al., 2020) and medical image analysis tasks (Chen et al., 2019; Tellez et al., 2019b; Li et al., 2020d). In this paper, we focus on the self-supervised driven semi-supervised learning paradigm for histology image analysis by efficiently exploiting the underlying information present in unlabeled data, both in task-agnostic and task-specific ways.
The existing plethora of self-supervised learning (SSL) methods can be viewed as defining a surrogate task, i.e., a pretext task - which is formulated using only unlabeled examples, and which requires a high-level semantic understanding of the image to solve these tasks (Jing and Tian, 2020). The neural network model trained to solve this pretext task often learns useful visual representations that can be transferred to any downstream task to solve the task-specific problem. On the other hand, another important stream of work is based on semi-supervised learning (SmSL), which seeks to learn from both labeled and unlabeled examples, with limited manual annotations (Chapelle et al., 2010). Among SmSL methods, the most recent and popular stream of approaches are based on consistency regularization (Laine and Aila, 2016; Sajjadi et al., 2016) and pseudo-labeling (Lee, 2013; Sohn et al., 2020). The consistency enforcing strategy aims to constrain network predictions to be invariant to input or model weight perturbations, such as adding noise to the input data through different image augmentations (Xie et al., 2019), network dropout (Srivastava et al., 2014) and stochastic depth (Huang et al., 2016). The main idea is that the model should predict similar labels for both the input image and its perturbed (augmented) version of the same image. Approaches of this kind include temporal ensembling (Laine and Aila, 2016), mean teacher (Tarvainen and Valpola, 2017) and virtual adversarial training (Miyato et al., 2018)
. Alternatively, pseudo-labeling imputes artificial (pseudo) labels for unlabeled data obtained from model class predictions, which is trained using labeled data alone(Sohn et al., 2020). These SmSL approaches’ success is attributed to the fact that these models implicitly learn to fit decision boundaries by grouping similar images to share similar labels, forming high-density clusters in the input feature space.
Despite significant advancements among SSL and SmSL approaches, they still suffer from some major limitations. Several SSL methods assume that optimizing the pretext objective task will invariably yield suitable downstream representations for the target task. However, many recent studies (Zoph et al., 2020; Yan et al., 2020; Goyal et al., 2019) have shown that SSL methods overfit to the pretraining objective and may not generalize well to the downstream task. On the other hand, methods based on SmSL approaches generally struggle to learn effectively when the number of labeled instances are scarce and also noisy (Rebuffi et al., 2020). This is a typical scenario in histopathology, where the number of manually labeled annotations is small and labels are often noisy (Shi et al., 2020). Furthermore, when the ratio of labeled and unlabeled samples is highly imbalanced, models trained solely based on consistency strategy have very low accuracy and higher entropy, which prevents them from achieving high-confidence scores (i.e., pseudo labels) on unlabeled data (Kim et al., 2020).
To address these shortcomings, several recent studies explored the feasibility of integrating the merits of both SSL and SmSL approaches to efficiently exploit the limited available labeled target data with abundant unlabeled data, to enhance the performance on downstream tasks (Zhai et al., 2019; Rebuffi et al., 2020). These approaches first aim to initialize a good latent representation of the data by formulating a pretext objective in a task-agnostic way, without using any labels. Later, these pretrained representations are effectively transferred to the downstream tasks by reinitializing these features via SmSL approach in a task-specific way. The idea of bootstrapping features trained via SSL algorithm has been shown to improve on an SmSL approach by preventing overfitting on the target domain (Zhai et al., 2019).
In this paper, we take inspiration from the above observations, and propose a novel self-supervised driven semi-supervised learning framework for histopathology image analysis, which harnesses the unlabeled data both in a task-agnostic and task-specific manner. To this end, we first present a simple yet effective, self-supervised pretext task, namely, Resolution Sequence Prediction (RSP), which leverages the multi-resolution contextual information present in the pyramidal nature of histology whole-slide images (WSI’s). Our design choice is inspired by the way a pathologist searches for cancerous regions in a WSI. Typically, a pathologist zooms in and out into each region, where the tissue is examined at high to low resolution to obtain the details of individual cells and their surroundings. In this work, we show that exploiting such meaningful multi-resolution contextual information provides a powerful surrogate supervisory signal for unsupervised representation learning. Second, we further develop a ‘teacher-student’ semi-supervised consistency paradigm by efficiently transferring the self-supervised pretrained representations to downstream tasks. Our approach can be viewed as a knowledge distillation method (Hinton et al., 2015), where the self-supervised teacher model learns to generate pseudo labels for the task-specific unlabeled data, which forces the student model to make predictions consistent with the teacher model. We experimentally show that initializing the student model with the SSL pretrained teacher model achieves robustness against noisy input data (i.e., noise is injected through various kinds of domain-specific augmentations) and helps learn faster than the teacher in practice. Our whole-framework is trained in an end-to-end manner to seamlessly integrate the information present in labeled and unlabeled data both in task-specific and task-agnostic ways.
The major contributions of this paper are:
We propose a novel self-supervised pretext task for generating unsupervised visual representations via predicting the resolution sequence ordering in the pyramidal structure of histology WSI.
We present a new ‘teacher-student’ semi-supervised consistency paradigm by efficiently transferring the self-supervised pretrained representations to downstream tasks based on prediction consistency with the task-specific unlabeled data.
We extensively validate our method on three benchmark datasets across two classification and one regression based histopathology tasks, i.e., tumor metastasis detection, tissue type classification, and tumor cellularity quantification. The proposed self-supervised method, along with consistency training, is shown to improve the performance on all three datasets, especially in the less annotated data regime.
The paper is organized as follows: we first briefly introduce the related work in Section 2. In Section 3, we present the detail of our proposed methodology. Datasets and experimental results are described in Section 4. Finally, we discuss our key findings and limitations of our work in Section 6, followed by conclusion in Section 7.
In this section, for brevity, we review only the recent developments in self-supervised and semi-supervised representation learning literature that are closely relevant to our work.
Self-Supervised (SSL) representation learning has recently gained momentum in many medical image analysis tasks for reducing the manual annotation burden. These approaches aim to construct different types of auxiliary pretext tasks, where the supervisory signals are generated from the data itself. Such pretraining of convolutional neural network (CNN) designed to solve these pretext tasks results in useful feature representations that can be used to initialize a subsequent CNN on data with limited labels. The design of pretext task is often based ondomain-specific knowledge like image context restoration (Chen et al., 2019), anatomical position prediction (Bai et al., 2019), 3D distance prediction (Spitzer et al., 2018), Rubik’s cube recovery (Zhuang et al., 2019) and image intrinsic spatial offset prediction (Blendowski et al., 2019). For instance, Chen et al. (2019) proposed image context restoration task for 2D fetal ultrasound image classification, CT abdominal multi-organ localization, and tumor segmentation in brain MR images. Blendowski et al. (2019) extend the context prediction task to 3D by designing image-intrinsic spatial offset relations to learn pretrained features. Similarly, Zhuang et al. (2019) extend the SS approach to 3D volumetric medical images by solving Rubik’s cube recovery task for brain hemorrhage classification and tumor segmentation. Bai et al. (2019) proposed to learn anatomical position prediction as a supervisory signal for cardiac MR image segmentation. Spitzer et al. (2018)
designed a pretext task based on 3D distance prediction between the two sampled patches from the same subject for segmenting brain areas as the target task. Many such pretext tasks are designed based on ad-hoc heuristics, limiting the generalizability of learned representations.
An alternative stream of approach is based on generative modeling (such as VAE (Kingma and Welling, 2013), GAN-based models (Dumoulin et al., 2016; Donahue et al., 2016) and other variants of it), which implicitly learn representations by minimizing the reconstruction loss in the pixel space. Compared with the discriminative ones, generative approaches are overly focused on pixel-level details, thus, limiting their ability to model complex structures present in an image. Recently, a new family of discriminative methods is proposed based on contrastive learning, which learns to enforce similarities in the latent space between similar/dissimilar pairs (He et al., 2020; Oord et al., 2018). In such methods, similarity is defined through maximising mutual information (Oord et al., 2018) or via different data transformations (Chen et al., 2020). For example, Lu et al. (2019) combined attention based multiple instance learning with contrastive predictive coding for weakly supervised histology classification. Chaitanya et al. (2020) extended the contrastive learning approach to segmentation of volumetric medical images by utilizing domain and problem-specific cues for efficient segmentation in three MRI datasets. Finally, Li et al. (2020c) proposed a patient feature-based softmax embedding to learn multi-modal SSL representations for diagnosing retinal disease.
The existing semi-supervised learning (SmSL) techniques can be broadly categorized into three groups: i) adversarial training-based (Zhang et al., 2017; Diaz-Pinto et al., 2019; Quiros et al., 2019); ii) graph-based (Shi et al., 2020; Javed et al., 2020; Aviles-Rivero et al., 2019); and iii) consistency-based (Li et al., 2020a; Zhou et al., 2020; Li et al., 2020d; Su et al., 2019; Liu et al., 2020) approaches. Adversarial training
based SmSL approaches learn a generative and a discriminate model simultaneously by forcing the discriminator to output class labels, instead of estimating the input probability distribution as in a normal generative adversarial network (GAN). For example,Zhang et al. (2017) proposed a segmentation and evaluation network, where the segmentation network is encouraged to obtain segmentation mask for unlabeled images; while, the evaluation network is forced to distinguish the segmentation results with an annotated mask by assigning different scores. Diaz-Pinto et al. (2019)
proposed the GAN framework for retinal image synthesis by utilizing both labeled and unlabeled data for training a glaucoma classifier. Meanwhile,Quiros et al. (2019) generated pathologically meaningful representations to synthesize high fidelity H&E breast cancer tissue images, which resemble that of real tissue ones. On the other hand, graph based methods construct a graph that establishes a semantic relationship between its neighbors and utilize the transduction of the graph to assign labels to unlabeled data via label propagation. As a typical example, Aviles-Rivero et al. (2019) proposed a graph-based SSL model for chest X-ray classification, where the pseudo labels for unlabeled data are generated using label propagation. In histology, Javed et al. (2020) introduced a graph-based community detection algorithm for identifying seven tissue phenotypes in WSI’s. In more recent work, Shi et al. (2020) utilized a graph-based self-ensembling approach to create an ensemble target for each label prediction using an exponential moving average (EMA); and minimizes the distance between label prediction and its ensemble target via consistency cost. Such self-ensembling based approaches are shown to be robust to noisy labels compared to other graph-based methods.
The most recent line of work in SmSL is based on consistency regularization, which enforces the consistency of predictions to random perturbations such as data augmentations (French et al., 2017), stochastic regularization (Laine and Aila, 2016; Sajjadi et al., 2016), and adversarial perturbation (Miyato et al., 2018). More recently, Tarvainen and Valpola (2017) proposed the mean teacher (MT) framework that averages the model weights instead of EMA of the label predictions to enhance the quality of consistency targets. These strategies were recently extended to several medical image analysis tasks. For instance, Li et al. (2020d) introduced a transformation consistent self-ensembling model for segmenting three medical data. Several extensions to MT have also been explored by enforcing prediction consistency either in region-based (Zhou et al., 2020), relation-based (Liu et al., 2020; Su et al., 2019) or cross-domain based (Li et al., 2020a), which is subjected under various domain-specific perturbations.
An overview of the proposed self-supervised driven consistency training approach is illustrated in Fig. 1. Our framework consists of three main stages: i) We pretrain a self-supervised model on an unlabeled set to obtain task-agnostic feature representations. ii) We fine-tune the SSL model on a limited amount of labeled data to obtain the task-specific features. iii) We further aim to improve the downstream performance on the target task by using both labeled and unlabeled data in a task-specific semi-supervised manner. Both teacher and student networks are initialized with the fine-tuned model for consistency training on the target task. The main objective is to optimize the student network, which learns to minimize the supervised loss on labeled set () and consistency loss on an unlabeled set (
). During consistency training, the teacher network is trained to predict the pseudo-label on a weakly augmented unlabeled image. A student network then tries to match this pseudo-label by making its prediction on a strongly augmented version of the same unlabeled image. We update only the student network weights during training while keeping the teacher network weights frozen, and we make the student as a new teacher after every epoch and iterate until convergence.
We first start by describing the self-supervised representation learning framework based on three different pretext categories (Jing and Tian, 2020), namely: context-based; generative-based and contrastive-based methods. The pretext tasks are designed to solve the task-agnostic problem in a self-supervised manner, where the class labels to train the network are generated automatically from the data itself. These pretrained representations can be transferred to multiple downstream tasks by fine-tuning a network on the limited labeled training examples in a task-specific way. In our work, we first start pretraining a convolutional neural network (convNet) on an unlabeled pretraining set to obtain generalized feature representations via task-agnostic manner.
Let us denote the pretraining set as consisting of unlabeled training samples. In histopathology, the input denotes the RGB image patch sampled from a gigapixel WSI, with height () and width (); and is the class label for , with for classification or for regression. Our goal is to learn feature embedding in an unsupervised manner, that maps an unlabeled set to a low-dimensional embedding , with being the feature dimension and denotes the neural network parameterized by .
Given a set of training samples , the self-supervised training aims to optimize the following objective,
where, are the pseudo labels generated automatically from the self-supervised pre-text tasks. In this paper, we investigate several popular self-supervised pretraining paradigms for histopathology, including the generative-based Variational Autoencoder (VAE), contrastive-based Momentum Contrastive Coding (MoCo), and finally, the proposed context-based Resolution Sequence Prediction (RSP) framework. The details are presented next.
Our self-supervised design choice for the “Resolution sequence prediction (RSP)” task is inspired by how a pathologist examines a WSI during diagnosis for potential abnormalities. Typically, a pathologist switches multiple times between lower magnification levels for context and higher magnification levels for detail. Such multi-resolution multi-field-of-view (FOV) analysis is possible due to the WSI’s pyramidal nature, where the multiple downsampled versions of the original image are stored in a pyramidal structure.
In this work, we exploit this multi-resolution nature of WSIs by proposing a novel self-supervised pretext task - which learns image representations by training convNets to predict the order of all possible sequences of resolution that can be created from the input multi-resolution patches. We argue that solving this resolution prediction task will allow a CNN to learn useful visual representations that inherently capture both contextual information (at lower magnification) and fine-grained details (at higher magnification levels).
Specifically, we create 6-tuples of randomly shuffled multi-resolution patches sampled from input WSI. We formulate our resolution sequence prediction task as a multi-class classification problem. Formally, we construct a tuple of three concentric multi-resolution RGB image patches extracted at three different magnification levels, such that the spatial resolution of (measured in ). By extracting such multiple concentric same size patches (), we ensure that the FOV of one image patch () lies inside the central square region of the other two () lower magnification patches. A sample set of multi-resolution concentric patches are shown in Fig. 2. These sets of patches form an input tuple to our self-supervised RSP framework. For brevity, we only consider a tuple of three input patches from a given WSI, for which 3! = 6 possible permutations can be constructed (which is referred to as resolution sequence ordering), as illustrated in Fig. 2.
To achieve our goal, given an input multi-resolution sequence - among possible permutations, we aim to train a siamese convNet model (Koch et al., 2015) to predict the label (i.e., order of resolution sequences over possible classes), which is given by,
is the predicted class probability for the input sequence, with label , and being the learnable parameter of the model . Therefore, given a set of training samples from the unlabeled set , the convNet model learns to solve the objective function defined in Eq. 1, by minimizing the categorical cross-entropy (CE) loss defined by,
The proposed RSP framework has three main stages: i) feature extraction; ii) pair-wise feature extraction; and iii) resolution sequence prediction. In the first stage, we adopt the siamese based architecture to obtain features for each input multi-resolution patches, where all three network branches share the same parameters. In our work, we adopt the commonly used ResNet-18 model to obtain the features , after the global average pooling layer; where,
is a latent vector of dimension 512. An additional crucial part of self-supervised pretraining is preparing the training data. To prevent the model from picking up on low-level cues and learning only trivial features, we make the sequence prediction task more difficult by applying various geometric transformations on the input data. The details of these geometric transformations are discussed thoroughly in Section4.2. In the second stage, we perform pair-wise feature extraction on the extracted feature vector , to capture the intrinsic relationship between the multi-resolution frames. Specifically, we concatenate features of each pair of input patches (i.e., to obtain feature vector. Next, we use a multi-layer perceptron (MLP) with one hidden layer to obtain ; where,
denotes ReLU and the bias is ignored for simplicity. Finally, in thethird stage, the pair-wise features (’s) are concatenated, resulting in feature vector. This feature vector is finally fed to another MLP with one hidden layer and softmax function to predict the order of resolution sequence (i.e., one of 6 possible permutations), as illustrated in Fig. 2.
Momentum Contrast model (MoCo) (He et al., 2020) is one of the most popular self-supervised models that even outperforms supervised baseline models. Given a data point in a dataset, MoCo samples a positive pair and negative pairs . MoCo is trained with infoNCE loss (Oord et al., 2018), defined as
where, and are neural networks, and
is a hyperparameter for temperature. This is a log loss of a softmax classifier which minimizes the difference between the representationsand its positive pair while maximizing the differences between and negative pairs . Note that minimizing maximizes the lower bound for mutual information between and (Oord et al., 2018). However, the bound is not tight for a small number of ; therefore, in practice, we need to use a large number of negative samples for each iteration. However, as this is not practical for computational efficiency, MoCo maintains a large queue of encoded data. At each training iteration, the entire mini-batch consisting of a positive sample and negative samples are inserted into the queue. Therefore, we use the entire queue (except the positive sample) as the set of negatives for the infoNCE loss. One of the key observation made by MoCo is that this can be problematic if the encoder changes too quickly, as this would cause the discrepancy between the distribution of the samples in the queue and the new samples to be quite different, and the classifier can easily decrease the loss. To solve this problem, MoCo uses two networks: the encoder with parameters and the momentum encoder with parameters . is not trained with the infoNCE loss but is updated with momentum parameter :
after each training iteration. We use the queue size of 8192 and of 0.999, and adopt multiple augmentation schemes. In each training iteration, for each data , we randomly i) jitter the brightness, contrast, saturation, and hue by 0.6 1.4, ii) rotate it by 0 360 degrees, iii) flip vertically & horizontally, and iv) crop with an area in the range 0.7 1 and stretch to the original size.
Variational autoencoder (VAE) (Kingma and Welling, 2013)
is an unsupervised machine learning model that is often used for dimensionality reduction and image generation. The model contains an encoder and a decoder, with a latent space that has a dimension smaller than the input data. The reduction in dimension on the latent space helps extract the prominent information in the original data. Unlike the vanilla autoencoders, VAE assumes that input data comes from some latent distribution. The encoder estimates the mean (
) and variance () of the data in the latent space, and the decoder samples a point from the distribution for data reconstruction. The assumption of
following a normal distribution and the stochastic property of the latent vector force the model to create a continuous latent space with similar data closer in the space. This resolves the model overfitting due to irregularities in the latent space often observed in the conventional autoencoder. The learning rule of VAE is to maximize the evidence lower bound (),
where, is the approximate posterior distribution of . The first term describes the reconstruction loss of the autoencoder model. The second term is can be seen as a regularizer that forces the approximate latent distribution to be close to
. Standard stochastic gradient descent methods cannot directly apply to the model because of the stochastic property of the latent vector. The solution, called the reparameterization trick, is to introduce a new random variableas the model input and set the latent vector to . This allows all model parameters to be deterministic.
For our VAE model, we use a ResNet-18 model to encode input image of size to a latent vector of size 512. Then, we use the generator from the BigGan model (Brock et al., 2018) to reconstruct the latent vector back to the original image.
The unsupervised learned representations are now transferred to the downstream task using limited labeled datain a task-specific way. It is a common practice to fine-tune the entire pretrained network when the downstream data is large and similar to original pretrained data. Hence, we choose to fine-tune all layers in the pretrained network by initializing with the pretrained weights to obtain task-specific representations: ; where, is the weight for the task-specific linear layer. Specifically, we fine-tune the entire network (all layers) with limited labels, along with a linear classifier or a regressor trained on the top of learned representations to obtain task-specific predictions.
The goal of consistency training (CR) is to obtain similar model predictions for differently augmented versions of the same input image (Laine and Aila, 2016; Sajjadi et al., 2016). We leverage this idea to further improve the task-specific (downstream) performance by using the second set of unlabeled data in a task-specific semi-supervised manner. In general, most existing SSL approaches utilize the entire task-specific training set to fine-tune the pretrained model on the downstream tasks. The main objective of SSL is to develop universal feature representations that can solve a wide variety of tasks on many datasets. Although many recent pretraining approaches (Chen et al., 2020; He et al., 2020; Chen et al., 2019; Chaitanya et al., 2020) have shown tremendous success in both computer vision and medical imaging, but they still fail to adapt to the new target tasks. A recent study by Zoph et al. (2020) reveals that the value of pretraining diminishes with stronger data augmentation and with the use of more task-specific label data. Further, the authors have shown that the self-supervised pretraining benefits only when fine-tuned with a limited amount of labeled data; whereas, the model performance deteriorates with the use of a more extensive label set. This raises an important question to “what degree the SSL works and how much amount of labeled data do we need to fine-tune the pretrained SSL model”.
In our work, we focus on answering the above question by performing a set of control experiments by varying the amount of labeled data in both low-data and high-data regimes on three different histology datasets. To this end, we provide an elegant solution based on teacher-student consistency training to improve the downstream performance by exploiting the unlabeled data in a task-specific semi-supervised manner.
Our teacher-student consistency training (shown in Fig.1) has three main steps:
i) We initialize the fine-tuned model as both teacher and student network; with teacher model weights being frozen across all layers (entire network), except the last linear layer (classifier/regressor); while, student model weights are frozen only until the output of global average pooling layer, with an (MLP with one hidden layer and linear classifier/regressor) trained on the top of learned task-specific feature representations.
ii) We use the teacher network to generate pseudo labels on the deliberately noised unlabeled data . Next, a student network
is trained both via standard supervised loss (on labeled data) and consistency loss (on unlabeled data), i.e., the supervised loss is evaluated by comparing against the ground-truth labels (cross-entropy (CE) for classification / mean squared error (MSE) for regression task); while, the consistency loss (CE for classification / MSE for regression task) is obtained by comparing against the pseudo labels (i.e., logits for regression / one-hot labels for classification) of the teacher model.
iii) We update the weights of only the student model and iterate these steps by treating the student as a new teacher after every epoch to relabel the unlabeled data and train a new student. In this way, our teacher-student consistency approach propagates the label information to the unlabeled data by constraining the model predictions to be consistent with the unlabeled data under different data augmentations.
We start by describing our method in the context of semi-supervised learning (SmSL) paradigm for the downstream task. Let us consider the training data (fine-tuning set) consisting of total samples, out of which are labeled inputs: , and are unlabeled inputs: . Where, is a hyperparameter that determines the relative ratio of and . In practice, we include all labeled instances as a part of unlabeled set, without using their labels, when constructing . Further note that, we use different batch sizes for the labeled and unlabled data such that . Formally, we aim to minimize the following objective (total loss):
where, is the supervised loss measured against the labeled inputs and is the consistency loss evaluated between the same unlabeled inputs with different data augmentations. The term is the weighting factor which is empirically set as , that controls the trade-off between the supervised and consistency loss. denotes the ConvNet model parameterized by , with and are the weights of the teacher and student network, respectively; while, and represents the weak and strong data augmentations applied to teacher and student model, respectively.
Earlier works on consistency training (Sohn et al., 2020; Xie et al., 2019; Tarvainen and Valpola, 2017; Liu et al., 2020; Li et al., 2020b) mainly focused on improving the quality of consistency targets (pseudo labels) by using either of the two strategies: i) careful selection of domain-specific data augmentations; or ii) selection of better teacher model rather than the simple replication of student network. However, there exist some limitations with the above approaches: First, the predicted pseudo labels for the unlabeled data may be incorrect since the model itself is used to generate them. Suppose, if a higher weight is assigned to these , the quality of learning may be hampered due to misclassification, and the model may suffer from confirmation bias (Arazo et al., 2020). Second, instead of using a converged model (such as pretrained) to generate pseudo labels with high confidence scores, the models are trained from scratch leading to lower accuracy and high entropy.
In this work, we aim to overcome these limitations by providing a solution that leverages the advantage of both the above solutions in a simple, efficient manner. The main key difference between our approach and the other existing consistency training methods are two fold: i) we make use of the task-specific fine-tuned model to generate high-confidence (i.e., low-entropy) consistency targets instead of relying on the model being trained; ii) we experimentally show that by aggressively injecting noise through various domain-specific data augmentations, the student model is forced to work harder to maintain consistency with the pseudo-label produced by teacher model. This ensures that the student network doesn’t merely replicate the teacher’s knowledge.
More formally, we define the consistency loss for regression task, as the distance between the prediction of teacher network (with weights and noise ) with the prediction of student model (with weights and noise ):
where, denotes each unlabeled training sample. In contrast, for classification task, the consistency loss is calculated via standard cross-entropy (CE) loss defined by:
where, be the predicted class probability by the teacher network for input applied with weak augmentation () and be the predicted class probability by the student network for input , applied with strong augmentation (). H(.) denotes the CE between two probability distributions and is the pseudo-label produced by the teacher network on weakly unlabeled input image . In this work, we leverage two kinds of augmentations: weak and strong. The weak augmentation includes simple horizontal flip and cropping; while for strong augmentation, we use RandAugment technique (Cubuk et al., 2020). The complete list of data augmentations and their parameter settings are listed in Section 4.2.
During training, we only update the weights of the student network while keeping the teacher network weights frozen. The weights are updated by learning an MLP (with one hidden layer) and a task-specific linear classifier/regressor on the output of the global average pooling layer, with the rest of the layer weights frozen for the student network. The idea of fine-tuning the last layers (i.e., one hidden layer MLP and a linear classifier/regressor) of the student model improves the task-specific performance by using both labeled and unlabeled data in a task-specific way. This is because the effect of pretraining and most feature re-use happens in the lowest layers of the network, while fine-tuning higher layers change representations that are well adapted to downstream tasks. This observation was also shown to be consistent in a recent study by Raghu et al. (2019). After every epoch, we make the student as the new teacher and iterate this process until the model converges. The pseudocode for our proposed consistency training is illustrated in Algorithm 1.
We evaluate the efficacy of our method on one regression and two classification tasks on histopathology benchmark datasets, including BreastPathQ (Martel et al., 2019), Camelyon16 (Bejnordi et al., 2017) and Kather multi-class (Kather et al., 2019). Further, we also show extensive ablation experiments and compare them with state-of-the-art SSL methods by varying different percentages of labeled data.
For baselines, we compare our SSL approach (i.e., RSP) with two other popular SSL methods, including the supervised one: VAE (Kingma and Welling, 2013), MoCo (He et al., 2020), and the random weight initialized (supervised). To further evaluate our approach on task-specific consistency training, we fine-tune the same pretrained models for the second time using different percentages of task-specific labeled data. In our experiments, we first initialize the teacher-student model with the fine-tuned SSL model trained on different percentages of labeled data: 10%, 25%, 50%, and 100% (depicted as “self-supervised pretraining and supervised fine-tuning” in Table 3, 4, 5). Next, we train each of these fine-tuned models again for the second time using labeled and unlabeled samples, again by varying percentages of labeled data, and report the final results (depicted as “consistency training (CR)” as shown in Table 3, 4, 5). Note: this experimental setting is kept standard across all three datasets for a fair evaluation.
The distribution of the number of WSI’s and their corresponding patches in all three datasets used for our experiments is shown in Table 1. In this section, we briefly describe all three publicly available datasets, whereas the data-specific implementations such as pretraining, fine-tuning, and test splits adopted in our experiments are explained in their respective subsections.
BreastPathQ dataset: This is a publicly available dataset consisting of hematoxylin and eosin (H&E) stained 96 WSI’s of post-NAT-BRCA specimens (Martel et al., 2019; Peikari et al., 2017), which are scanned at magnification level (). A set of 2579 patches each with dimension () are extracted from 69 WSI’s for training, and the remaining 1121 patches are extracted from 25 WSI’s, which are reserved for testing. Two expert pathologists label the images in this dataset according to the percentage of cancer cellularity in each image patch.
Camelyon16 dataset: We performed classification of breast cancer metastases at image level on the dataset from Camelyon16 challenge (Bejnordi et al., 2017). This dataset contains 399 H&E stained WSI’s of lymph nodes in the breast, which is split into 270 for training and 129 for testing. The images are acquired from two different centers scanned at a magnification of () and () magnification levels and are exhaustively annotated by pathologists.
Kather multiclass dataset: This dataset contains two subsets of patches containing nine tissue classes: adipose, background, debris, lymphocytes, mucus, smooth muscle, normal colon mucosa, cancer-associated stroma, and colorectal cancer epithelium (Kather et al., 2019). Out of the two subsets, the training set consists of 100K image patches of H&E stained colorectal cancer images of pixels scanned at /pixel spatial resolution. In contrast, the set contains 7180 image patches. In this dataset, only patches are made available without access to WSIs.
We perform all our experiments by selecting ResNet-18 as the base feature embedding network, with the methods outlined in Section 3
on all three datasets. All the experiments are performed on 4 Tesla NVIDIA V100 GPUs, and the entire framework is implemented in PyTorch. We first specify the implementation details common to all datasets, and data specific implementations are provided in Table2.
For self-supervised pretraining: The model is trained for 250 epochs with a batch size of 64. We employ (SGD with Nesterov momentum + Lookahead) optimizer (Zhang et al., 2019), with a momentum of 0.9, weight decay of and a constant learning rate of 0.01. For Lookahead, we set and slow weights step size . The best pretrained model is chosen based on the lowest validation loss across BreastPathQ, Camelyon16, and Kather datasets.
We use domain-specific data augmentations recommended by Tellez et al. (2019a), including rotations, horizontal flips, scaling, additive Gaussian noise, brightness, and contrast perturbations, shifting hue and saturation values in HSV color space, perturbations in H&E color space. We also add random resized crops, blur and affine transformations to the previous list. Specifically, we use a rotation factor between , scaling factor between [0.8, 1.2], additive Gaussian noise with [ and ], affine transformation with translation, scale and rotation limit of , respectively, hue and saturation intensity ratio between [-0.1, 0.1] and [-1, 1], respectively, brightness and contrast intensity ratios between [-0.2, 0.2], blurring the input image using a random-sized kernel in the range [3, 7], and randomly resizing and cropping the image patch to its original image size. Finally, we perturb the intensity of hematoxylin and eosin (HED) color channels with a factor of [-0.035, 0.035]. We apply all these transformations in sequence by randomly selecting them in each mini-batch to obtain a diverse set of training images.
For supervised fine-tuning: We fine-tune the entire pretrained SSL model (all layers) with a linear classifier or a regressor trained on top of the learned representations, with limited labels (10%, 25%, 50% and 100% of labeled examples), to directly evaluate the performance of RSP, VAE, and MoCo models. In particular, for RSP, we discard entirely the last MLP with one hidden layer after pretraining and fine-tune with a linear layer on the top of dimensional () embedding followed by softmax to obtain task-specific predictions. However, for VAE and MoCo, we fine-tune with a linear layer on the feature vector.
For fine-tuning, we use different sets of hyperparameters for all three datasets, which are provided in Table 2. Further, we include a simple set of augmentations ( as depicted in Algorithm 1), such as rotation, scaling, and random resized crops. For rotation and scaling, we use a factor of  and [0.8, 1.2], respectively, and we randomly resize and crop the image patch to its original image size.
For consistency training: We use a semi-supervised approach for consistency training by using labeled and unlabeled examples in a task-specific manner. We adopt the same task-specific fine-tuned model as both teacher and student network, with teacher network weights frozen (all layers); while training a student network with one hidden layer MLP and a task-specific linear layer (classifier/regressor) on the output of global average pooling layer (with rest of the layer weights frozen). All the hyperparameters related to consistency training are shown in Table 2. In our experiments, we initialize the teacher-student model with the fine-tuned SSL model trained on different percentages of labeled data (10%, 25%, 50%, and 100%). Next, we train each of these fine-tuned models again for the second time using labeled and unlabeled samples, again by varying percentages of labeled data, and report the final results.
In our work, we use two kinds of augmentations for consistency training: “weak” and “strong” augmentation for teacher and student network, respectively. For the teacher network, we employ simple transformations such as horizontal flip and random cropping to its original image size as weak augmentations. Whereas for the student network, we adopt a similar set of transformations to the pretraining stage, but with different hyperparameters to strengthen the augmentation severity, which is referred to as strong augmentations. The following are the list of augmentations with different parameters to the pretraining stage: an affine transformation with translation limit of [0.01, 0.1], scale limit of [0.51, 0.60] and rotation of , HSV intensity ratio between [-1, 1] and blurring the input image using a random-sized kernel in the range [5, 7]. We apply these augmentations in sequential by randomly selecting them in each mini-batch using the RandAugment technique (Cubuk et al., 2020). In our experiments, we use in RandAugment; where, denotes the number of augmentations to apply sequentially in each mini-batch, and is the magnitude that is sampled within a pre-defined range [1, 10] that controls the severity of distortion in each mini-batch.
In this experiment, we train our approach to automatically quantify tumor cellularity (TC) scores in digitized slides of breast cancer images for tumor burden assessment. TC score is defined as the percentage of the total area occupied by the malignant tumor cells in a given image patch (Peikari et al., 2017). For pretraining the SSL approach, we adopted 69 WSI’s of the training set, from which we randomly extract patches of size () at , , and magnification for RSP, while for VAE and MoCo, patches are extracted at magnification. We perform fine-tuning by resizing the image patches to () on 2579 training image patches (out of which 80% (2063) are reserved for training and 20% (516) for validation), and testing is done on 1121 image patches (as shown in Table 1). To experiment with limited data on the downstream task, we divide the fine-tuning set (i.e., 2063 patches) into four incremental training subsets: 10%, 25%, 50%, and 100%. Two pathologists annotated each image patch in the test set according to the percentage of cancer cellularity in each image patch. We report the intra-class correlation coefficient (ICC) values between the proposed methods and the two pathologists A and B.
|Methods % Training Data||ICC Coefficient (95% CI)|
|Pathologist A||Pathologist B||Pathologist A||Pathologist B||Pathologist A||Pathologist B||Pathologist A||Pathologist B|
|10% (206 labels)||25% (516 labels)||50% (1031 labels)||100% (2063 labels)|
|Self-supervised pretraining + Supervised fine-tuning|
|Random||0.697 [0.67, 0.73]||0.637 [0.60, 0.67]||0.786 [0.76, 0.81]||0.727 [0.70, 0.75]||0.812 [0.79, 0.83]||0.797 [0.77, 0.82]||0.863 [0.85, 0.88]||0.843 [0.83, 0.86]|
|VAE||0.733 [0.70, 0.76]||0.693 [0.66, 0.72]||0.767 [0.74, 0.79]||0.756 [0.73, 0.78]||0.790 [0.77, 0.81]||0.775 [0.75, 0.80]||0.853 [0.84, 0.87]||0.824 [0.80, 0.84]|
|MoCo||0.675 [0.64, 0.71]||0.648 [0.61, 0.68]||0.718 [0.69, 0.75]||0.651 [0.62, 0.68]||0.746 [0.72, 0.77]||0.711 [0.68, 0.74]||0.757 [0.73, 0.78]||0.718 [0.69, 0.75]|
|RSP (ours)||0.701 [0.67, 0.73]||0.667 [0.63, 0.70]||0.796 [0.77, 0.82]||0.734 [0.71, 0.76]||0.842 [0.82, 0.86]||0.834 [0.82, 0.85]||0.884 [0.87, 0.90]||0.872 [0.86, 0.89]|
|Consistency training (CR)|
|Random + CR||0.658 [0.62, 0.69]||0.630 [0.59, 0.66]||0.818 [0.80, 0.84]||0.802 [0.78, 0.82]||0.847 [0.83, 0.86]||0.839 [0.82, 0.86]||0.891 [0.88, 0.90]||0.891 [0.88, 0.90]|
|VAE + CR||0.771 [0.75, 0.79]||0.727 [0.70, 0.75]||0.842 [0.82, 0.86]||0.826 [0.81, 0.84]||0.866 [0.85, 0.88]||0.857 [0.84, 0.87]||0.884 [0.87, 0.90]||0.864 [0.85, 0.88]|
|MoCo + CR||0.808 [0.79, 0.83]||0.803 [0.78, 0.82]||0.872 [0.86, 0.89]||0.863 [0.85, 0.88]||0.848 [0.83, 0.86]||0.850 [0.83, 0.87]||0.895 [0.88, 0.91]||0.902 [0.89, 0.91]|
|RSP + CR (ours)||0.876 [0.86, 0.89]||0.846 [0.83, 0.86]||0.873 [0.86, 0.89]||0.854 [0.84, 0.87]||0.870 [0.86, 0.88]||0.861 [0.84, 0.88]||0.910 [0.90, 0.92]||0.907 [0.90, 0.92]|
Results on BreastPathQ dataset. Predicting the percentage of tumor cellularity (TC) at patch-level (intra-class correlation (ICC) coefficients between two pathologists A and B). The 95% confidence intervals (CI) are shown in square brackets. We bold the best results.
Table 3 presents the ICC values for different methodologies, and the corresponding TC scores produced by each method on sample WSIs of the BreastPathQ test set are shown in Fig. LABEL:Fig:Heat-maps_on_BreastPathQ (shown in Appendix). The consistency training (CR) improved the results of self-supervised pretrained models (VAE, MoCo, and RSP) by a 3% increase in ICC values. Further, all SSL and CR methods (VAE, MoCo, and RSP) seem to exhibit optimal performance, which is close or even outperforming to that of supervised baseline (random) on all training subsets. Among all the methods, the RSP+CR achieves the best score of greater than , which even surpassed the intra-rater agreement score of (Akbar et al., 2019). Besides, our obtained TC score of on the BreastPathQ test set is superior to state-of-the-art (SOTA) methods (Akbar et al., 2019; Rakhlin et al., 2019), with a maximum score of 0.883. Specifically, our RSP+CR approach achieves a minimum of greater ICC value than VAE+CR and MoCo+CR, and at least 17% improvement in ICC value to the supervised baseline, trained on 10% labeled set ( image patches). In contrast, on a complete training set, all CR methods exhibit competitive/similar performance. This indicates that the consistency training improves upon self-supervised pretraining predominantly in the low-data regime.
This experiment is a slide-based binary classification task to identify the presence of lymph node metastasis in WSIs using only slide-level labels. To experiment with the limited annotations, we first perform self-supervised pretraining on 60 WSI’s (35 normal and 25 tumor), which are set aside from the original training set. For pretraining, we randomly extract patches of size () at , , and magnification for RSP, while for VAE and MoCo, patches are extracted at magnification. Further, the downstream fine-tuning is performed on the randomly extracted patches of size () from the remaining 210 WSI’s (125 normal and 85 tumor) of the training set, out of which 80% (306.3K patches - 150K tumor + 156.3K normal) are reserved for training and 20% (40K patches - 20K tumor + 20K normal) for validation. We finally evaluate the methods on 129 WSI’s of the test set (as shown in Table 1). We divide the fine-tuning set containing 306.3K patches into four incremental subsets of [10%, 25%, 50%, 100%] containing [30.6K, 76.5K, 153.1K, 306.3K] image patches, respectively.
We follow the same post-processing steps as Wang et al. (2016) to obtain slide-level predictions. We first train our proposed models to discriminate patch-level tumor vs. normal patches. We then aggregate these patch-level predictions to create a heat map of tumor probability over the slide. Next, we extract several features similar to Wang et al. (2016)
from the heat map and train a slide-level support vector machine (SVM) classifier to make the slide-level prediction. We compare and evaluate all three SSL pretrained and CR methods with the corresponding supervised baseline. The method’s performance is evaluated in terms of area under the receiver operating characteristic curve (AUC) on a test set containing 129 WSIs. In addition, we also evaluate the binary classification performance (accuracy (Acc)) on the patch-level data containing 40K patches (20K tumor + 20K normal) of the validation test. Further, we perform the statistical significance test by comparing the pairs of AUCs between consistency training and SSL methods using the two-tailed Delong’s test(Sun and Xu, 2014). All differences in AUC value with a -value were considered as significant.
|Methods % Training Data||AUC||Acc||AUC||Acc||AUC||Acc||AUC||Acc|
|10% (30630 labels) (4000 labels)||25% (76576 labels) (10000 labels)||50% (153151 labels) (20000 labels)||100% (306303 labels) (40000 labels)|
|Self-supervised pretraining + Supervised fine-tuning|
|Random||0.804 [0.72 - 0.89]||0.904||0.861 [0.79 - 0.93]||0.936||0.847 [0.77 - 0.92]||0.946||0.865 [0.79 - 0.93]||0.968|
|VAE||0.737 [0.64 - 0.83]||0.827||0.814 [0.73 - 0.89]||0.864||0.830 [0.75 - 0.91]||0.906||0.818 [0.73 - 0.90]||0.907|
|MoCo||0.895 [0.84 - 0.95]||0.837||0.867 [0.80 - 0.93]||0.895||0.877 [0.81 - 0.94]||0.904||0.857 [0.78 - 0.93]||0.921|
|RSP (ours)||0.836 [0.76 - 0.91]||0.898||0.886 [0.83 - 0.94]||0.928||0.861 [0.79 - 0.93]||0.946||0.878 [0.81 - 0.95]||0.953|
|Consistency training (CR)|
|Random + CR||0.659 [0.54 - 0.77]||0.911||0.782 [0.69 - 0.87]||0.948||0.783 [0.69 - 0.87]||0.955||0.870 [0.80 - 0.94]||0.964|
|VAE + CR||0.633 [0.55 - 0.72]||0.828||0.719 [0.63 - 0.81]||0.863||0.741 [0.64 - 0.84]||0.918||0.779 [0.69 - 0.87]||0.928|
|MoCo + CR||0.728 [0.63 - 0.82]||0.835||0.742 [0.64 - 0.84]||0.902||0.766 [0.67 - 0.86]||0.929||0.825 [0.75 - 0.90]||0.946|
|RSP + CR (ours)||0.855 [0.78 - 0.92]||0.907||0.917 [0.86 - 0.97]||0.935||0.848 [0.77 - 0.93]||0.949||0.882 [0.80 - 0.96]||0.959|
Table 4 presents the AUC scores for predicting slide-level tumor metastasis using different methodologies. On 10% label regime, RSP and MoCo methods outperformed the supervised baseline, whereas the performance of VAE is significantly decreased compared to other methods. Further, the RSP+CR approach significantly outperforms the RSP by a margin of 2% on 10% and 25% labeled set. The proposed RSP+CR achieves the best score of 0.917 using 25% labeled set (K patches) compared to the winning method in Camelyon16 challenge (Wang et al., 2016), which obtained an AUC of 0.925 using the fully supervised model trained on millions of image patches. Compared with the unsupervised representation learning methods proposed in Tellez et al. (2019b), our RSP+CR approach trained on 10% labels (K patches) outperforms their top-performing BiGAN method by 13% higher AUC trained on 50K labeled samples. Additionally, we also evaluated our methods’ performance on the validation set containing 40K patches (20K tumor + 20K normal). Surprisingly, the supervised baseline (Random, Random+CR) outperformed the RSP, RSP+CR methods by a slight margin difference of 0.5% Acc on all percentages of training subsets.
Most importantly, from our experiments on the Camelyon16 dataset, we draw several insights on the generality of our approach on low- and high-labels training scenarios. On a low-label data regime, i.e., the patch-wise classification task on the validation set, which has training labels ranging from 4K to 40K, we observe that adding consistency training improved the SSL model performance up to 2% increase in Acc values. AUCs of consistency trained models are statistically higher than AUCs of SSL pretrained models with -value , across 10% and 25% labeled set. As we increase the number of labeled samples (50% to 100%), adding the consistency training to the Random, VAE, and MoCo SSL pretrained models resulted in a noticeable drop in AUC values. The results for the RSP model still improved after consistency training in the high-label data regime, but these differences were not statistically significant. Thus, in general, our approach has been shown to work well in a limited annotation setting, which is highly beneficial in the histopathology domain.
Further, we also observe that pretraining performance slightly diminishes with an increase in the amount of labeled data (from 10% (30K) to 100% (306K) labels), which essentially deteriorates the value of pretrained features and is consistent with the recent study by Zoph et al. (2020). Overall, our consistency training approach continues to improve the task-specific performance only when trained with low-label data, and it is additive to pretraining.
Fig. LABEL:Fig:Heat-maps_on_Camelyon16 (shown in Appendix) highlight the tumor probability heat-maps produced by different methodologies. Visually all self-supervised pretrained methods (VAE, MoCo, and RSP) were shown to focus on tumor areas with high probability, while the supervised baseline exhibits slightly lower probability values for the same tumor regions. We observe that most methods successfully identify the macro-metastases (Row 1-3), with a tumor diameter larger than 2mm, with an excellent agreement with the ground truth annotation. However, the same methods struggle to precisely identify the micro-metastases (Row 4), with tumor diameter smaller than 2mm, which is generally challenging even for the fully-supervised models.
Due to the unavailability of access to WSIs in this dataset, we could not perform self-supervised pretraining on this dataset. However, instead, we used the SSL pretrained model of Camelyon16 to fine-tune and evaluate the patch-level performance for feature transferability between datasets with different tissue types/organs and resolution protocols. In our experiments, the downstream fine-tuning is performed on 100k image patches of the training set and tested on 7180 images of the test set by resizing the patches to pixels.
|Methods % Training Data||Acc||Acc||Acc||Acc|
|10% (8000 labels)||25% (20000 labels)||50% (40000 labels)||100% (80000 labels)|
|Self-supervised pretraining + Supervised fine-tuning|
|Consistency training (CR)|
|Random + CR||0.938||0.670||0.943||0.735||0.941||0.723||0.939||0.707|
|VAE + CR||0.972||0.876||0.979||0.906||0.978||0.903||0.982||0.915|
|MoCo + CR||0.987||0.939||0.990||0.953||0.987||0.944||0.983||0.921|
|RSP + CR (ours)||0.982||0.918||0.982||0.913||0.985||0.930||0.986||0.934|
Table 5 presents the overall accuracy (Acc) and weighted score () for classification of 9 colorectal tissue classes using different methodologies. On this dataset, the MoCo+CR approach obtains a new state-of-the-art result with an Acc of 0.990, weighted score of 0.953, and a macro AUC of 0.997, compared to the previous method (Kather et al., 2019) which obtained an Acc of 0.943. This underscores that our pretrained approaches are more generalizable to unseen domains with different organs, tissue types, staining, and resolution protocols. All the consistency trained methods marginally outperform the SSL pretrained models on all subsets of a labeled set. Further, the CR methods (RSP+CR, MoCo+CR, VAE+CR) outperform the supervised baseline by 3% and 17% increase in Acc and score, respectively. Compared to the previous representation learning methods (Pati et al., 2020), our approach obtains 3% improvement in Acc by training on just 10% labels, compared to the previous method (Acc of 0.951) trained using 100% labels. Thus, in general, our approach can be applied to other domain-adaptation problems in histopathology, where target annotations are often limited or sometimes unavailable.
In this section, we perform the ablation experiments to study the importance of three components of our method: (i) ratio of unlabeled data; (ii) impact of strong augmentations on student network; (iii) convergence behavior of consistency training. We choose to perform these ablation studies on 10% labeled data on BreastPathQ and Camelyon16 datasets due to time constraints. Further, we exclude the Kather Multiclass dataset, as it was used to evaluate the feature transferability between datasets, thus making it less suitable for this extensive study.
|Ratio of Unlabeled Data ()||BreastPathQ||Camelyon16|
|ICC ()||ICC ()||AUC||Acc|
The success of consistency training is mainly attributed to the amount of unlabeled data. From Table 6, we observe a marginal to noticeable improvement in performance as we increase the ratio of unlabeled to labeled batch size (). This is consistent with the recent studies in Xie et al. (2019) and Sohn et al. (2020). For each fold increase in the ratio between unlabeled and labeled samples, the performance improves by at least 2% on both BreastPathQ and Camelyon16. However, the performance in BreastPathQ is quite negligible since the number of training samples ( patches) is substantially less compared to Camelyon16 (K patches). On the other hand, increasing the ratio of unlabeled data while fine-tuning the pretrained model tends to converge faster than training the model from scratch. In essence, a large amount of unlabeled data is beneficial to obtain better performance during consistency training.
|No of Possible Transformations||BreastPathQ||Camelyon16|
The success of teacher-student consistency training is crucially dependent on the different strong augmentation policies applied to the student network. Table 7 depicts the analysis of the impact of augmentation policies on final performance. In our experiments, we apply each of these augmentations sequentially by randomly selecting them in each mini-batch using the RandAugment (Cubuk et al., 2020) technique. We vary the total number of augmentations () from value 1 to 7 and examine the effect of strong augmentation policies (applied to the student network) during consistency training. From Table 7, we observe that as we gradually increase the severity of augmentation policies in the student model, there are marginal to noticeable improvements in the performance gain. This improvement is mainly visible when trained on large amounts of unlabeled data (such as Camelyon16), where there is a minimum improvement in AUC as we increase the augmentation strength. This suggests that adding strong augmentations to the student network is essential to avoid the model being learned just the teacher’s knowledge and gain further improvements in task-specific performance.
With the advancements in deep learning techniques, current histopathology image analysis methods have shown excellent human-level performance on various tasks such as tumor detection(Campanella et al., 2019), cancer grading (Bulten et al., 2020), and survival prediction (Wulczyn et al., 2020), etc. However, to achieve these satisfactory results, these methods require a large amount of labeled data for training. Acquiring such massive annotations is laborious and tedious in practice. Thus, there is a great potential to explore self/semi-supervised approaches that can alleviate the annotation burden by effectively exploiting the unlabeled data. Drawing on this spirit, in this work, we propose a self-supervised driven consistency training method for histology image analysis by leveraging the unlabeled data in both task-agnostic and task-specific manner. We first formulate the self-supervised pretraining as the resolution sequence prediction task that learns meaningful visual representations across multiple resolutions in WSI. Next, a teacher-student consistency training is employed to improve the task-specific performance based on prediction consistency with the unlabeled data. Our method is validated on three public histology datasets, i.e., BreastPathQ, Camelyon16, and Kather Multiclass, in which our method consistently outperforms other self-supervised methods and also with the supervised baseline under a limited-label regime. Our method has also shown its efficacy in transferring pretrained features across different datasets with different tissue types/organs and resolution protocols.
Despite the excellent performance of our method, there is one main limitation: i.e., if the pseudo labels produced by the teacher network are inaccurate, then the student network is forced to learn from incorrect labels leading to confirmation bias (Arazo et al., 2020). As a result, the student may not become better than the teacher during consistency training. We solved this issue with RandAugment (Cubuk et al., 2020), a strong data augmentation technique, which we combine with label smoothing (soft pseudo labels) to deal with confirmation bias. This is also consistent with the recent study (Arazo et al., 2020) that showed soft pseudo labels outperform hard pseudo labels when dealing with label noise. However, the bias issue still persists with soft pseudo labels in our application. This is prominently visible in our method, where, compared to self-supervised pretraining (see Fig. LABEL:Fig:Heat-maps_on_Camelyon16, column (c) - (f); Fig. LABEL:Fig:Heat-maps_on_BreastPathQ, column (b) - (e)), the consistency trained approaches (see Fig. LABEL:Fig:Heat-maps_on_Camelyon16, column (g) - (j); Fig. LABEL:Fig:Heat-maps_on_BreastPathQ, column (f) - (i)) exhibits some low probability () spurious pixels outside the malignant cell boundaries. This happens because of the naive pseudo labeling produced by the teacher network, which sometimes overfits to incorrect pseudo labels. Further, this issue is reinforced when we attempt to train the student network on unlabeled samples with incorrect pseudo labels leading to confirmation bias. One solution to mitigate this issue is to make the teacher network constantly adapt to the feedback of the student model instead of the teacher model being fixed. This has shown to work well in a recent meta pseudo label technique (Pham et al., 2020), where both teacher and student are trained in parallel by making the teacher learn from the reward signal of the student performance on a labeled set. Exploring this idea is beyond the scope of this work, and we will leave this to the practitioner to explore more along these lines.
In general, our proposed self-supervised driven consistency training framework has a great potential to solve the majority of both classification and regression tasks in computational histopathology, where annotation scarcity is a significant issue. Further, our pretrained representations are more generic and can be easily extended to other downstream multi-tasks, such as segmentation and survival prediction. It is worth investigating further to develop a universal feature encoder in histopathology that can solve many tasks without the need for excessive labeled annotations.
In this paper, we present an annotation efficient framework by introducing a novel self-supervised driven consistency training paradigm for histopathology image analysis. The proposed framework utilizes the unlabeled data both in a task-agnostic and task-specific manner to significantly advance the accuracy and robustness of the state-of-the-art self-supervised (SSL) methods. To this end, we first propose a novel task-agnostic self-supervised pretext task by efficiently harnessing the multi-resolution contextual cues present in the histology whole-slide images. We further develop a task-specific teacher-student semi-supervised consistency method to effectively distill the SSL pretrained representations to downstream tasks. This synergistic harness of unlabeled data has been shown to improve the SSL pretrained performance, over its supervised baseline, under a limited-label regime. Extensive experiments on three public benchmark datasets across two classification and one regression based histopathology tasks, i.e., tumor metastasis detection, tissue type classification, and tumor cellularity quantification, demonstrates the effectiveness of our proposed approach. Our experiments also showed that our method’s performance is significantly outperforming or even comparable to that of the supervised baseline when trained under limited annotation settings. Furthermore, our approach is more generic and has been shown to generate universal pretrained representations that can be easily adapted to other histopathology tasks and also to other domains without any modifications.
Conflict of interest
ALM is co-founder and CSO of Pathcore. CS, SK and FC have no financial or non-financial conflict of interests.
This research is funded by: Canadian Cancer Society (grant number 705772); National Cancer Institute of the National Institutes of Health [grant number U24CA199374-01]; Canadian Institutes of Health Research.
Fig. LABEL:Fig:Heat-maps_on_BreastPathQ Tumor cellularity scores produced on WSIs of the BreastPathQ test set for 10% labeled data.
Fig. LABEL:Fig:Heat-maps_on_Camelyon16 Tumor probability heat-maps overlaid on original WSIs from Camelyon16 test set predicted from 10% labeled data.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703. Cited by: §3.3, §4.2, §5.2, §6.
Transfusion: understanding transfer learning for medical imaging. In Advances in Neural Information Processing Systems, pp. 3347–3357. Cited by: §3.3.