Self-distillation Augmented Masked Autoencoders for Histopathological Image Classification

by   Yang Luo, et al.

Self-supervised learning (SSL) has drawn increasing attention in pathological image analysis in recent years. However, the prevalent contrastive SSL is suboptimal in feature representation under this scenario due to the homogeneous visual appearance. Alternatively, masked autoencoders (MAE) build SSL from a generative paradigm. They are more friendly to pathological image modeling. In this paper, we firstly introduce MAE to pathological image analysis. A novel SD-MAE model is proposed to enable a self-distillation augmented SSL on top of the raw MAE. Besides the reconstruction loss on masked image patches, SD-MAE further imposes the self-distillation loss on visible patches. It guides the encoder to perceive high-level semantics that benefit downstream tasks. We apply SD-MAE to the image classification task on two pathological and one natural image datasets. Experiments demonstrate that SD-MAE performs highly competitive when compared with leading contrastive SSL methods. The results, which are pre-trained using a moderate size of pathological images, are also comparable to the method pre-trained with two orders of magnitude more images. Our code will be released soon.


Distilling Visual Priors from Self-Supervised Learning

Convolutional Neural Networks (CNNs) are prone to overfit small training...

AWEncoder: Adversarial Watermarking Pre-trained Encoders in Contrastive Learning

As a self-supervised learning paradigm, contrastive learning has been wi...

Masked Autoencoders in 3D Point Cloud Representation Learning

Transformer-based Self-supervised Representation Learning methods learn ...

Contrastive Representation Learning for Whole Brain Cytoarchitectonic Mapping in Histological Human Brain Sections

Cytoarchitectonic maps provide microstructural reference parcellations o...

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Recent general-purpose audio representations show state-of-the-art perfo...

EncoderMI: Membership Inference against Pre-trained Encoders in Contrastive Learning

Given a set of unlabeled images or (image, text) pairs, contrastive lear...

1 Introduction

With the advancement of deep learning, self-supervised learning (SSL) has received increasing research attention

[11, 8, 14, 7]

. SSL is a special unsupervised learning that learns the image representation by using different input sensor signals to automatically label training data, i.e., pretext task

[17]. A number of practices [13, 6, 2] show that such a paradigm can establish an effective feature pre-training for downstream tasks. Furthermore, SSL is particularly suitable for medical image analysis, as supervised learning still remains the dominating technique for many medical image analysis tasks so far, which, however, relies heavily on time-consuming manual annotations.For histopathological images, Typical examples include predicting cube rotation of 3D medical images [29], leveraging nuclei size and quantity to extract the instance-aware feature [24], or solving jigsaw puzzles [20] and so on.

Self-supervised contrastive learning is perhaps the most prevalent SSL paradigm in the past few years [6, 8, 7]. It aims to learn representations that maximize the agreement between different views, and simultaneously minimize the similarity between two unrelated instances. Generally, different views are augmentations from the same instance. Although several studies [27, 22, 9] have successfully applied self-supervised contrastive learning to different pathological image analysis tasks, we argue that this paradigm might be suboptimal. The reason mainly comes from two aspects. First, different pathological image regions are somewhat homogeneous, i.e., their visual appearances are similar, although they are labeled as different categories such as normal and tumor, or normal, situ, invasive, etc., for different clinical purposes. However, the performance of contrastive learning depends significantly on the diversity of the sampled images. It is not easy to extract discriminative feature. Second, the category number of pathological images are typical few, e.g., 2 or 4 [19, 26, 3]. It is limited compared to natural image understanding tasks, which restrict the learning effectiveness. More specifically, the learning procedure would wrongly take the images sampled from different regions as negative instances and push away them, despite that they belong to the same category, i.e., normal.

Recently, He et al. [14] proposed a new SSL paradigm termed masked autoencoders (MAE). It fails into the transformer-based encoder-decoder framework but the two sides are asymmetric. Image reconstruction [18] is used as the pretext task which formulates the learning in a generative way. Concretely, the task uses a few visible image patches to reconstruct the other masked patches. When the training is finished, the decoder is dispensed while the learned encoder establishes a pre-trained backbone for downstream tasks. This paradigm has the advantages of not being affected by the category number. It thus more friendly to the pathological image analysis. However, it is observed that there is no work using MAE to analyze pathological images now.

Figure 1: Illustrative framework of the proposed SD-MAE. Besides the raw MAE, we make use of visible patches after decoding and apply them as the teacher to transfer knowledge to their counterparts after encoding.

Motivated by the analysis above, in this paper we develop SD-MAE, a self-distillation augmented MAE. It not only introduces MAE to pathological image analysis for the first time, but also enables a self-distillation augmented SSL on top of the raw MAE. We observe that MAE only imposes constraints on reconstructed masked patches, i.e., requiring them as similar as the raw ones. It ignores the changes on visible patches. We argue that these patches also convey important clues for SSL. Therefore, the self-distillation mechanism is leveraged to directly propagate high-level semantics to the encoder. Specifically, the feature obtained from the encoder is treated as the student while their counterpart from the decoder is teacher, which is closer to the outputted layer and contains more high-level features. Therefore, imposing additional constraints on the two kinds of features is conductive to improving the abstraction of feature obtained from the encoder, and subsequently, leading to a more powerful feature representation. We implement this idea and carry out image classification experiments on two pathological and one natural image datasets. It is shown that SD-MAE gains highly competitive results when compared with leading contrastive-based SSL methods. Moreover, the results, which are obtained by fine-tuning only on a moderate size of pathological images, are also comparable to the model trained with two orders of magnitude more images. Our contributions can be summarized as follows:

  • We point out the limitation of popular self-supervised contrastive learning in pathological image analysis, and introduce MAE to this task for the first time, which is more suitable for pathological image modeling.

  • We proposed SD-MAE, a self-distillation augmented MAE. It has the advantages of enhancing the feature representation on top of MAE, thus further benefiting downstream tasks.

  • We carry out experiments on both pathological and natural image datasets, which basically verifies that SD-MAE generates a more effective feature representation and improves the image classification accuracy.

2 Methodology

An illustrative framework of the proposed SD-MAE is shown in Fig.1. It has two modules, i.e., masked image modeling and visible image modeling. The former aims to enable a generative SSL on unlabeled data by constraining the masked image patches, while the latter further imposes self-distillation constraints on visible patches to guide a more effective encoder learning. It is an elegant complementary to the raw MAE especially for pathological images, which have homogeneous visual appearance and with few category numbers. We will introduce the modules as follows.

2.1 Masked Image Modeling

MAE [14] is leveraged as our masked image modeling block. Its modeling process is independent of the category number thus more friendly to pathological image analysis. Generally, it consists of four components:

2.1.1 Patchifying and Masking

decides how to mask the input images and the masking ratio. For patchifying, the input image is firstly extracted by convolutions with kernel size (

) and stride

into patches, where . The size of each patch is

. Then, each patch is flattened to a token, i.e., a 1-dimensional vector of visual feature with length

. The representation of all patches is formulated as . For masking, we randomly divide patches into two sets according to a masking ratio , namely where . will be used as the input of encoder and as the labels.

2.1.2 Encoder

takes as input, and extract latent feature from visible patches. Unlike convolution, ViT is suitable for MAE due to its "isotopic" architecture [21] whose size and shape of the output is equal to the input. The encoder firstly maps dimensions of tokens to with a trainable linear projection, and adds holistic positional embeddings to the visible tokens, that is , where , and then processes via a series of Transformer blocks, getting latent representation vectors of patches .

2.1.3 Decoder

impels masked tokens to learn low-level representation from visible patches for subsequent image reconstruction. Initially, the decoder concatenates and as one matrix . For keeping the positional relation of patches, Decoder also adds positional embeddings to all patches , where . The full set of tokens are processed via decoder. As a result, , the decoder output was divide into and , indicating visible and masked tokens respectively.

2.1.4 Prediction target

defines how to predict the original signals. Before predicting, we consider the original masked tokens after normalizing are our prediction target. The decoder will use a linear layer to align and , namely . We compute the mean square error (MSE) loss between the predicted and labels .


2.2 Visible Image Modeling

Knowledge Distillation is a widely recognized method that reinforces the learning capability of the student model by transferring knowledge from the teacher model [15]. Usually the teacher model is a large capacity network while the student model is a small network. Self-distillation is a special knowledge distillation. Zhang et al. [28]

firstly apply it to the vectors at different depths within the same neural network. It distills knowledge from deeper layers to shallow layers, enhancing the feature representation of shallow layers.

As seen in Equ.(1), MAE updates the encoder parameters only on masked tokens by MSE loss, while in [14, 25] the authors claimed that imposing MSE loss also on visible tokens would degrade its performance. As a result, visible tokens after decoding are not evaluated during the whole training. We argue that the tokens also convey valuable knowledge as they undergo an additional decoding the same as masked tokens. It learns better representation related to the prediction target, i.e., patch-based image reconstruction. There might be other means of making use of the knowledge perceived from this process. With this idea in mind, we try the following two schemes to exploit the visible tokens. Specifically, there are two kinds of latent representation vectors for visible tokens in MAE, namely after encoding and

after decoding. We treat them either as shallow and deep features in the self-distillation framework

[28], or two views in self-supervised contrastive learning. Consequently, a self-distillation scheme or a self-contrastive scheme can be built on top of MAE as follows.

We first introduce the self-distillation scheme, in which and represent the teacher and student network respectively. Following three simple MLP layers, we can get two high-dimension and , respectively. We further regard

as probability distribution of discrete visible patches in hidden space, and

as labels of these visible patches. Therefore, we learn to match these two distributions by minimizing the cross-entropy loss, which can be described as follows:


The total loss is formulated as follows:


where is the empirically determined scaling factor (in our work ).

Then we explain the self-contrastive scheme. As and come from different stages of the same network. We can perceive these two views as a positive pair and the of other image from the same batch as negative instances. Thus, we can perform contrastive learning on them. Mathematically, we use the mean value of and as a positive pair of a patch . For calculating contrastive loss, the model minimize the distance between positive pairs, and from sample i and j respectively, and maximize the distance between negative pairs, i.e. coming from other patches in a same batch. Subsequently, we employed the normalized temperature-scaled softmax similarity [1] in images defined as follows:


where sim(.,.) is cosine similarity between two vectors and

denotes a constant temperature parameter, which set to 0.5. The total loss is formulated as:


We also set . Intuitively, the self-distillation scheme is expected to generate a more effective representation compared to the self-contrastive counterpart, as it is more tailored to pathological images. We will verify it in experiment.

3 Experiments

3.1 Datasets

To evaluate its effectiveness, we train SD-MAE on two public pathological image datasets (i.e., PatchCamelyon, NCT-CRC-HE) and one natural image dataset (i.e., ImageNet-100). Their details are as follows.

PatchCamelyon [4] consists of 327,680 96 x 96 color images extracted from Camelyon16 dataset [5]. Each image is labeled as either normal or tumor. The same as [22], there are 245,760 and 40,960 images in training and test sets.

NCT-CRC-HE [16] is a dataset manually extracted from H&E stained human colorectal cancer images. Its tissues are divided into eight classes of colorectal cancer and one class of normal tissue. Following [27], we exclude images belonging to the background in both training and test sets. Finally, we have 89,343 and 6,333 224 x 224 images for training and test, respectively.


is a subset of ImageNet-1000

[10]. It contains 130,000 training and 5,000 test images randomly sampled from pre-selected 100 classes.

Methods PatchCamelyon NCT-CRC-HE ImageNet-100
Supervised ViT-S [12] 81.36
CS-CO [27] (OI)
TransPath [22]
MAE [14]
Table 1: Accuracy of different methods on the three datasets. OI denotes our implementation. is the result given by [22], which is pre-trained on 15 million pathological images.

3.2 Experimental Setup

We follow almost the same protocol in MAE [14] to train the SD-MAE. The input images are resized to 224 x 224 and the batch size is set 1024 in both pre-training and fine-tuning steps. We split images into 14 × 14 patches with size of 16 x 16. The same as most generative methods, RandomResizedCrop is the only augmentation strategy in pre-training. The same as [8, 25, 14, 23], both pre-training and fine-tuning are carried out on the same dataset. For each experiment, we pre-train the model once and then fine-tune it three times. Since the objective between generative and contrastive learning are different, we use their respective optimal evaluation methods, namely linear probing and end-to-end fine-tuning, respectively.

We use ViT-S (12 transformer blocks with dimension 384) as the encoder and employ a lightweight decoder (4 transformer blocks with dimension 192 and a linear projection for patch recovering). In pre-training, the masking ratio is 0.6 on the two pathological image datasets, and 0.75 on ImageNet-100. SD-MAE adopts 100-epoch training and 5-epoch warm-up. We apply L2-normalization bottleneck


(dimension 256 and 4096 for the bottleneck and the hidden dimension, respectively) as the projection head in self-distillation or self-constrastive. All the experiments are carried out on Pytorch with 4 Nvidia 3090 GPUs.

Figure 2: The accuracy of different masking ratios on NCT-CRC-HE.

3.3 Results and Comparisons

Tab.1 list the accuracy of different methods on the three datasets. As a baseline, ViT-S is directly trained using the dataset without SSL-based pre-training, thus representing a metric whether SSL takes effect. As for the results of the two leading self-supervised contrastive learning methods. CS-CO reports a result even worse than ViT-S. It is explained as the difference between the network capacity, where ResNet-34 is employed as the backbone in CS-CO. While TransPath shows pre-training is useful, exhibiting nearly 3% improvements on both datasets. The result indicates that it is still a feasible way of using self-supervised contrastive learning for pathological image analysis. However, it is pre-trained using 15 million pathological images, which is not easy to acquire. As a comparison, the three MAE-based methods only pre-trained on the training set given by the dataset experimented. They all show improvements over ViT-S. Moreover, MAE+self-contrastive performs worse than MAE, indicating that MAE is a powerful model and it is not easy to further improve it. In contrast, our SD-MAE shows steady improvements over MAE on all the three datasets. It implies that self-distillation extracts feature complementary to the reconstruction task in MAE. It is also worthy note that TransPath reports the best accuracy among the compared methods on NCT-CRC-HE. We argue that it is largely attributed to the availability of nearly two orders of magnitude more images. We will verify our hypothesis in the near future.

We also curious about the appropriate masking ratio of pathological images in MAE, which is set to 0.75 in [14] for natural images. As seen in Fig.2. A relative small masking ratio (i.e., 0.6) reports the best accuracy on NCT-CRC-HE. It again demonstrates the specificity of pathological image analysis. In Fig.3, we present two recovering examples by using SD-MAE. It is seen that slightly better reconstruction is obtained by SD-MAE compared to MAE. It demonstrates that self-distillation reinforces the learning capability of the MAE encoder.

(a) inputs
(b) masking
(c) MAE
(d) SD-MAE
Figure 3: Images reconstructed by MAE and SD-MAE on NCT-CRC-HE. The color boxes highlight their details

4 Conclusion

Noticing the self-supervised contrastive learning might be suboptimal in histological image analysis, we introduce MAE to this field for the first time and propose SD-MAE. A novel self-distillation scheme is developed to fully make use of the information conveyed by MAE. It transfers abstract and complementary clues from the decoder to encoder, guiding a more effective visual pre-training. Experimental results on two pathological and one natural image datasets all demonstrate the effectiveness of SD-MAE, which would be a promising paradigm for pathological image analysis. In the future, we plan to evaluate SD-MAE on other downstream tasks such as nuclei segmentation, etc.