Log In Sign Up

Self-supervised learning methods and applications in medical imaging analysis: A survey

by   Saeed Shurrab, et al.

The availability of high quality annotated medical imaging datasets is a major problem that collides with machine learning applications in the field of medical imaging analysis and impedes its advancement. Self-supervised learning is a recent training paradigm that enables learning robust representations without the need for human annotation which can be considered as an effective solution for the scarcity in annotated medical data. This article reviews the state-of-the-art research directions in self-supervised learning approaches for image data with concentration on their applications in the field of medical imaging analysis. The article covers a set of the most recent self-supervised learning methods from the computer vision field as they are applicable to the medical imaging analysis and categorize them as predictive, generative and contrastive approaches. Moreover, the article covers (40) of the most recent researches in the field of self-supervised learning in medical imaging analysis aiming at shedding the light on the recent innovation in the field. Ultimately, the article concludes with possible future research directions in the field.


page 5

page 6

page 7

page 8


Dive into Self-Supervised Learning for Medical Image Analysis: Data, Models and Tasks

Self-supervised learning (SSL) has achieved remarkable performance on va...

About Explicit Variance Minimization: Training Neural Networks for Medical Imaging With Limited Data Annotations

Self-supervised learning methods for computer vision have demonstrated t...

Causality matters in medical imaging

This article discusses how the language of causality can shed new light ...

CASS: Cross Architectural Self-Supervision for Medical Image Analysis

Recent advances in Deep Learning and Computer Vision have alleviated man...

3D Self-Supervised Methods for Medical Imaging

Self-supervised learning methods have witnessed a recent surge of intere...

Histopathology DatasetGAN: Synthesizing Large-Resolution Histopathology Datasets

Self-supervised learning (SSL) methods are enabling an increasing number...

1 Introduction

Medical image analysis is the field of science that is mainly concerned with processing and analysing the medical images from different modalities to extract useful information that help in making precise diagnosis decision (anwar2018medical). The heavy load of medical images analysis falls into four main tasks spouted from the main computer vision tasks and tailored for the medical filed. These four tasks are classification, detection and localization, segmentation and registration (altaf2019going). Each of the mentioned tasks has its own methods and algorithms that help in understating and extracting useful information from the medical images.

The recent advancements in artificial intelligence (AI) field brought significant improvements into the medical image analysis field by transforming it from a heuristic-based into a learning-based approach


. To elaborate more, learning-based analysis approaches aim at extracting useful information (features) that represent the input images in a way that fits the target medical image analysis task. In addition, features extraction can be accomplished manually (engineered) or automatically (learned) from the data


. While manual feature extraction is the main concern of Statistical Machine Learning field, Deep Learning field is mainly concerned with the automatic extraction of features, and is highly preferred.

Convolutional Neural Network (CNN) is a branch of deep learning models which deals with grid-based data such as images to learn the latent features in a hierarchical fashion from the fine level (lines and edges) to the complex level (objects) (yamashita2018convolutional)

. CNN basic structure is composed of three blocks, namely, convolutional layer, pooling layer and fully connected layer. Each of these layers has its own function with respect to the learning processes according to the target task. Ultimately, the process of optimizing the learnable layers in CNN’s is accomplished through gradient descent algorithm and its variants which aim at minimizing the difference between the network output and the ground truth labels (loss function).

CNNs have been a popular choice in the filed of medical image analysis and have provided a tremendous progression into the various medical image analysis tasks due to their ability to deal with images data in their raw formats as well as due to the performance they provide which can be compared to the human performance at faster rates.

However, CNNs are known to have an enormous number of trainable parameters to be estimated, usually in millions, to capture the underlying distribution in the input data. As a result, a relatively large amount of data is required to achieve better estimation of these parameters. Furthermore, the input data need to be human annotated to enable performing supervised training using the gradient descent algorithm


Despite the remarkable success that CNNs have achieved in the medical image analysis field, however, there are some obstacles that hamper their advancement. Building a large enough human annotated medical dataset with high quality is expensive and time-consuming. In addition, unlike the natural scene image data which may be annotated by less skilled personnel, medical datasets require expert personnel to accomplish the annotation process. Moreover, the annotation process is prone to patients’ privacy preserving issues especially when working with a specific disorder (taleb20203d). And thus, annotated data scarcity in terms of annotation and volume acts as a major obstacle for machine learning applications in the medical field.

As an alternative solution, the concept of transfer learning came to the top of the table for situations where the amount annotated data is relatively small. Transfer learning is the process of employing the knowledge that has been achieved in a source task to another target task from the same domain to improve the performance in the target task

(goodfellow2016deep; torrey2010transfer). The most common form of transfer learning in the machine learning community is built upon pre-trained state of the art models such as VGG (simonyan2014very), GoogleNet (szegedy2015going) , ResNet (he2016deep) and DenseNet (huang2017densely)

which are trained on the giant labeled image datasets such as ImageNet


. ImageNet includes about (15) million natural images that belong to (22,000) visual categories and (1,000) labels


Though, employment of pre-trained models on ImageNet for medical applications is a controversial issue for two reasons. Firstly, the extracted features from the natural images domain may not be a good representation in the medical field due to the remarkable difference in feature distribution, resolution and output labels between both domains. Secondly, ImageNet pre-trained models are built for predicting (1000) labels which makes them over-parameterized in the medical images context (holmberg2020self; raghu2019transfusion). Despite that, a set of guidelines exists that mainly depends on the target dataset size and domain similarity when dealing with ImageNet pre-trained models for different domains (karpathy2016cs231n). However, other approaches have been proposed to overcome such problems where Self-Supervised Learning is one of them.

Self-supervised learning is a hybrid learning approach that combines both supervised and unsupervised learning schemes in a pretraining-fine-tuning fashion

(xie2021self). More clearly, self-supervised learning is an approach that aims at learning semantically useful features for a certain task by generating supervisory signal from a pool of unlabeled data without the need for human annotation to be used for subsequent tasks where the amount of the annotated data is limited (chen2019self). And thus, from unsupervised perspective self-supervised learning approach cancels the need for manually annotated data. On the other side, the supervised perspective in self-supervised learning approach is represented in model training with labels generated from the data itself.

More clearly, self-supervised learning scheme is divided into two separated tasks called pretext task and down-stream task (liu2021graph). In the pretext task, a model is learned in a supervised fashion using the unlabeled data by creating labels from the data in a way that enables the model to learn the useful concepts within it. On the other side, the learned concepts from the pretext task are transferred as initial weights to the down-stream task to accomplish its intended goal by fine tuning or training continuation (holmberg2020self). Figure 1 depicts the main workflow of self-supervised learning approach.

Figure 1: Self-supervised learning main workflow

Self-supervised learning became a popular choice in the filed of medical image analysis where the amount of the available annotated data is relatively small while the unlabeled data is comparatively large. Several researches have demonstrated the effectiveness of the self-supervised learning approach throughout various medical images analysis tasks such as detection and classification (lu2020@semi; li2021rotation; sriram2021covid), detection and localization (chen2019self; sowrirajan2020moco; nguyen2020self), and segmentation tasks(taleb20203d; xie2020pgl; chaitanya2020contrastive).

This paper aims at reviewing the state-of-the-art research directions in self-supervised learning approaches for image data with concentration on their applications in medical imaging analysis. Self-supervised learning can act as an effective solution for the problem of annotated data scarcity in the medical field. Thus, our main goal is to shed the light on the recent innovations in the field of self-supervised learning in medical imaging analysis by providing high-level overview of each developed method listed in the medical field.

Various research works in the literature have concentrated on self-supervised learning in computer vision per se such in (jing2020self; liu2021self; ohri2021review; jaiswal2021survey), while other researches briefly reviewed the role of self-supervised learning in the medical image analysis as a part of deep learning applications in medical image analysis such in (tajbakhsh2020embracing; chen2021recent). However, to the best of our knowledge, this is the first attempt to shed the light on the application of self-supervised learning approach in medical imaging analysis that would bridge the gap between both fields and acts as a comprehensive reference for the researchers in the field. The key contributions of this paper can be summarized as follows:

  • We provided a high-level overview of the of the state-of-the-art self-supervised learning methods in computer vision field as they are general purpose methods that can be used in the medical context. Further, we categorized these methods as predictive, generative and contrastive self-supervised methods.

  • We covered and provided high-level overview for a list of the (40) most recent and impactful research works in the field of self-supervised learning in medical imaging analysis.

  • We categorized these works in a similar fashion to categorization in computer vision tasks. Further, we included an additional category called multiple-tasks/multi-tasking to fit those researches that have utilized multiple tasks simultaneously.

  • We developed a github repository111 called Awesome Self-Supervised Learning in Medical Imaging that would serve as a resource for the literature in the field which will be updated continuously.

The rest of this survey is organized as follows: Section 2 summarizes the literature selection methodology. Section 3 provides an in-depth overview about the self-supervised learning approach and its methods. Section 4 reviews the recent self-supervised learning methods in medical imaging analysis, while 5 discusses the major insights derived from the presented self-supervised learning methods in medical imaging. Section 6 highlights some possible future research directions in the field, while section 7 concludes the paper. Ultimately, Appendix A lists the available implementation codes of the discussed researches throughout this paper.

2 Survey methodology

This section summarizes the followed methodology by the authors in order to search for relevant literature to the self-supervised learning applications in medical imaging analysis topic. This methodology includes determination of literature sources as well as searching keywords, Inclusion/exclusion criteria setting and paper selection.

2.1 Sources and keywords

The first step in our methodology is to select the main sources of literature that will be used to search for literature. As a results we considered three bibliographic databases as primary sources of literature including:

We focused in our literature search on these resources as they include reputable journals and conferences that are mainly concerned with machine learning applications in medical imaging. On the other side, we considered two additional sources of literature as secondary sources which are:

For searching keywords, we opted the terms self-supervised learning in medical imaging, pretext tasks in medical imaging, representation learning in medical imaging and contrastive learning in medical imaging to investigate the selected resources.

2.2 Inclusion/exclusion criteria

Initially, we explored the literature in the field over the period (2017-2021), as this is the period where self-supervised learning started to creep into medical imaging analysis, with more concentration on the research works from the period (2019-2021) and excluded any other works outside this period. Further, we examined the titles and abstracts of the research articles resulted from querying the selected resources to judge the relevance of search results. As a results, we considered only research works that either adopted self-supervised learning approach directly to solve medical imaging tasks or presented a novel self-supervised learning approach in medical imaging that has not been seen before to our knowledge and excluded any other works of less relevance to our target. For self-supervised learning approaches from computer vision field, we first explored the selected self-supervised learning research in medical imaging analysis literature and selected those methods that have been frequently used in the medical field even if they are not within the predefined period. We further added some additional state-of-the-art methods that have not been explored directly in the medical context and excluded any other methods. In addition, we kept refining our search results by selecting research articles that are published in journals or conferences with impact factor of (3) or greater and excluded any other works published in venues with less impact factor than our threshold. For ArXiv preprints, we considered only those works cited in the selected published papers and excluded any other works. We further examined the affiliation as well as the research portfolio of the authors of these preprints before including their works. We also considered research works from outside the selected sources gathered by exploring related works sections of the selected papers that are directly relevant to our target.

2.3 Papers selection

As a result of the predefined inclusion/exclusion criteria, we settled to (15) self-supervised learning approaches that have been developed on natural images and exploited them in the medical context. For self-supervised learning in medical imaging, we settled to (40) papers that relate directly to self-supervised learning applications in medical imaging analysis. Each of the selected papers have been reviewed thoroughly and a high-level overview is developed that focuses on the innovation in self-supervised learning approach and presented throughout this survey.

3 Self-supervised learning approaches

The very beginnings of self-supervised learning concepts formulation refers to the early efforts of bengio2007greedy

in training deep neural networks in an unsupervised greedy layer-wise fashion by training a single-layer auto-encoder for each layer one at a time (pretraining). After training each layer in the network separately, the resulting weights of each layer are used as initial weights to train the whole network on the target task (fine-tuning). One of the prominent downsides of the greedy layer-wise approach is the inability to secure complete optimal solution by grouping sub-optimal ones

(goodfellow2016deep). Further, the greedy layer-wise approach has been obsoleted by the emergence of end-to-end deep neural models that can be trained in a single run (mao2020survey).

Despite that, the greedy layer-wise methodology formed the nucleus for what so-called nowadays self-supervised learning approach and opened the door for its applications in computer vision, natural language processing, robotics and other fields. Pretext tasks play a central role in the self-supervised learning approach and act as its backbone. While the down-stream task may differ according to the researchers’ needs and targets, the pretext task can be common among different down-stream tasks. For example, the same pretext task, e.g. convolutional auto-encoder, could be used to learn visual features for two different down-stream tasks with different data. This property makes it helpful to categorize self-supervised learning approaches according to the nature of pretext task.

In the context of visual data, more specifically images data, we categorize self-supervised learning pretext tasks into three main categories including predictive, generative and contrastive tasks. Predictive pretext tasks depend mainly on developing self-supervised predictive models in learning the latent features in the input data by treating the pretext task as a classification problem, while generative pretext tasks aim at learning the latent features throughout reconstructing the input data (weng2019selfsup). On the other side, contrastive tasks aim at developing robust representations from the input data by learning to differentiate between the similar (positive) example pairs and the dissimilar (negative) example pairs (oord2018representation). The upcoming section introduces the reader to the most prominent methods for each category.

3.1 Predictive self-supervised learning

3.1.1 Exemplar CNN

Exemplar CNN is one of the earliest predictive self-supervised pretext design attempts which was proposed by dosovitskiy2015discriminative. Learning a good representation about the input data throughout exemplar CNN method is hypothesised by the model robustness to the applied transformations. To achieve this, a synthesized training dataset is created. This dataset consists of image patches of objects or parts of the object with size of (32 x 32) pixels which are cropped from the original images and they are called the seed patches. Following that, a set of predefined transformations including translation, scaling, rotation, contrast and color adjustment are applied randomly to each generated patch as shown in figure 2. Consequently, each seed patch along with its applied transformations form a surrogate class in the training dataset. Thus, a convolutional neural network is trained to discriminate between the different surrogate classes. Further, three convolutional architectures were designed by the authors that differ in depth and configurations while share the same loss function which is cross-entropy loss. Ultimately, the authors demonstrated by experimentation that the optimum number of surrogate classes is 8000 while the optimum number of instances per class is 100 instances.

Figure 2: Self-supervised features learning by Exemplar CNN (dosovitskiy2015discriminative).

3.1.2 Relative position prediction

Relative positions is another predictive pretext task proposed by doersch2015unsupervised that is inspired by the word embedding Skip-Gram model (mikolov2013distributed) in the natural language processing field. The main hypothesis that underlies the relative position prediction is that understanding the spatial context of the objects in the input image is required to predict the relative position. The implementation details including dividing the input image into (3 x 3) grid of patches as shown in figure 3. To increase the complexity and reduce the chance for learning shortcuts such as texture continuity and boundary patterns, a set of solutions was introduced including gaps and jitters addition among patches and color channel processing by shifting certain channels to the gray-scale or partial channel dropping to avoid the chromatic aberration effect. Consequently, a late-fusion convolutional model with AlexNet-like architecture (krizhevsky2012imagenet) is trained on a randomly sampled pair of patches (central patch and neighbor patch) to predict the relative position of neighbour patchs relatively to the central patch.

Figure 3: Self-supervised features learning by relative position prediction (doersch2015unsupervised).

3.1.3 Jigsaw puzzle

Solving Jigsaw puzzle is another pretext task proposed by noroozi2016unsupervised and inspired by the earlier work of doersch2015unsupervised for relative position prediction. To solve Jigsaw puzzle, a convolutional model is required to learn restoring a set of jumbled patches, e.g. 9 patches, to their original spatial arrangement. For this purpose, a special convolutional model called Context-Free Networks (CFN) with Siamese architecture and shared weights was proposed by the authors as shown in figure 4. To train the network, a shuffled image with random permutation of the 9 patches is fed to the network. However, for 9 patches there is

possible permutations. Hence, to avoid such large solution space the authors opted to limit the number of permutations into a predefined set of permutations with certain index for each permutation. Ultimately, the defined architecture’s role is to produce a likelihood vector over the predefined indices set that maximizes the probability of the input permutation.

Figure 4: Self-supervised features learning by Jigsaw puzzle solving (noroozi2016unsupervised).

3.1.4 Rotation prediction

Rotation prediction was first proposed by gidaris2018unsupervised to learn visual representation in unsupervised fashion. The main idea behind the rotation prediction task is to learn a convolutional model that can recognize the applied geometric transformation on the input image as shown in figure 5 in a simple classification problem. Geometric transformations are represented by applying rotation angles by multiple of to the input image which may fall into one of four categories including [, , , ]. The main intuition behind rotation prediction task is that enabling the convolutional model to learn recognizing the applied rotation to the input image is directly linked to learning the the prominent high-level objects in that image, their orientations and types in relation to the dominant geometric transformation. Thus, it enables learning representative semantic features about the distribution of the input data.

Figure 5: Self-supervised features learning by rotation prediction (gidaris2018unsupervised).

3.2 Generative self-supervised learning

3.2.1 Denoising auto-encoders

Auto-encoders are special neural models whose main task it to reconstruct its input (goodfellow2016deep). The basic auto-encoder consists of two parts, namely, encoder network and decoder network. The encoder network plays the role of compressing the network input into a latent dimensional space, while the decoder role is to reconstruct the compressed input from the latent space (tschannen2018recent). Ultimately, upon training completion, the decoder is discarded while the encoder is kept for further processing. Denoising auto-encoders are special models of auto-encoders proposed by vincent2008extracting for representation learning through learning to reconstruct a noise-free output from noisy input. As shown in figure 6, a noisy version of the original image is created by adding certain noise types such as Gaussian noise or Salt and Pepper and then passed to the auto-encoder to reconstruct the original image rather than the noisy image by minimizing the reconstruction loss. The authors argued that the quality of the learned representation is proportionally directed to the model ability to produce the original input from a partially corrupted one.

Figure 6: Self-supervised features learning by denoising auto-encoders.

3.2.2 Image inpainting

Image inpainting or context encoder is a generative self-supervised pretext task proposed by pathak2016context that aims at learning rich representation by fill-in-the-blank strategy. More clearly, part of the input image is cropped or masked rather than introducing noise to it and the role of the network is completing the cropped part. Further, three forms of masking are proposed including central block, random blocks and random region. An auto-encoder network with AlexNet architecture and channel-wise fully connected latent space is employed for this task as shown in figure 7. In addition, a combined loss function that integrates both reconstruction loss and adversarial loss (goodfellow2014generative) is optimized throughout the training. The reconstruction loss (L2) is meant to hold the overall structure of the input image and the masked part, while the adversarial loss role is to improve the appearance of the predicted masked part. Ultimately, the network must be able to capture a good semantic representation to perform its task.

Figure 7: Self-supervised features learning by image inpainting (pathak2016context).

3.2.3 Image colorization

Generation of colorized image from a gray-scale one was proposed by zhang2016colorful

as a solution for automatic image colorization problem and self-supervised pretext task simultaneously. Lab color space is employed in this task rather than RGB color space as it reflects the human color perception where the L channel represents the gray-scale and ab channels represents the color channels. Consequently, a convolution network is trained by taking the L channel as an input and the ab channels as a supervisory signal where the role of the network is to produce the input image in Lab color space as shown in figure


. Nonetheless, image colorization is multi-modal in nature which means that the same object may have different valid colors e.g. apple may be yellow, red or green but not other colors. To compensate for this issue, the network is designed to predict the probability distribution of the possible colors for each pixel. In addition, a weighted cross-entropy Loss function is utilized to compensate for rare colors. Ultimately, the annealed-mean of the probability distribution is computed to produce the final colorization. It is worth noting that understanding the coloring scheme of the objects in the input images would help in developing rich representation about them.

Figure 8: Self-supervised features learning by colorization (zhang2016colorful).

3.2.4 Split-brain auto-encoder

Split-brain auto-encoder is another pretext task proposed by zhang2017split and extended their earlier work on image colorization. The main idea behind split-brain auto-encoder is to obtain useful data representation by learning to generate portion of the data from the remaining data. By translating this idea to the image data in Lab* color space, the gray-scale channel (L) can be generated from the color channels (ab) and vice versa. This process is accomplished thorough modifying the traditional auto-encoder architecture by adding two splits to the network as shown in figure 9 where each disjoint split learn the underlying representation from the input data as described previously (part from the other). Eventually, the output of both splits is aggregated throughout concatenation to produce the final output of the network. The authors claim that learning from both gray-scale and color channels simultaneously rather than single channel as in colorization problems would enable achieving better representation.

Figure 9: Self-supervised features learning by split-brain auto-encoder (zhang2017split).

3.2.5 Deep Convolutional GAN

Generative adversarial networks (GAN) are a class of deep learning generative models that model random noisy input to generate new data which mimic the real training data. Typically, GAN architecture consists of two networks, namely, the generator network and the discriminator network. The role of the generator is to convert the random noisy input into imitation of the real data while the role of the discriminator is to distinguish whether the generator output is real or fake. Both networks are trained in a competing or zero-sum game until the discriminator being not able to recognize fake generations as fake (goodfellow2014generative).

Deep convolutional GAN, or DCGAN for short, is an extension of GAN proposed by radford2015unsupervised as an unsupervised representation learning architecture for image data. DCGAN is considered as the first successful attempt to scale GAN with convolutional neural networks as opposed to the earlier work of goodfellow2014generative

which is based on multi-layer perceptron architecture. Further, the authors provided architectural guidelines for designing stable DCGAN including pooling layer replacement with strided convolutional layer for discriminator, fractionally strided convolution for generator, Batch normalization


employment in generator and discriminator, fully connected layer removal, ReLU activation

(nair2010rectified) for all generator layers except the output layer which is Thanh activation and LeakyReLU activation (maas2013rectifier) for all layers in the discriminator network. Figure 10 depicts the generator network architecture as designed by the authors. Ultimately, the authors evaluated the quality of the learned features by DCGAN discriminators performing image classification task which showed superior performance in comparison to other unsupervised methods and opened the door for exploiting the GAN-based models as a pretext task.

Figure 10: Self-supervised features learning by deep convolutional GAN (radford2015unsupervised).

3.2.6 Bi-directional GAN

Bi-directional GAN (BiGAN) is another generative unsupervised learning architecture proposed by donahue2016adversarial that extended the earlier work of radford2015unsupervised and enabled the inverse mapping from the data to the latent dimensional space. As opposed to the architecture design of GAN and DCGAN which are built of generator and discriminator networks, the authors of BiGAN included encoder network to their architecture that maps the input data into a latent dimensional space . On the other side, the generator decode the the latent dimensional space to produce fake output . Consequently, the discriminator role is to recognize that is real whereas is fake. However, the authors stated that both and are completely separated modules that do not communicate with each other. And thus, both modules should learn to invert each other to be able to beat the . Upon training completion, the learned representation by can be transferred to the down-stream tasks. Figure 11 depicts the the architecture of BiGAN. It is worth noting that the same idea of BiGAN was presented simultaneously under the research paper Adversarially Learned Inference - ALI by dumoulin2016adversarially.

Figure 11: Self-supervised features learning by Bi-directional GAN (donahue2016adversarial).

3.3 Contrastive self-supervised learning

3.3.1 Contrastive predictive coding

Contrastive predictive coding (CPC), is a contrastive unsupervised representation learning proposed by oord2018representation that can fit not only image data but also text, audio and reinforcement in 3D environments. The main intuition behind CPC is to develop compact representation that maximizes the mutual information between the context and the target rather than predicting the directly from as given in the generative models. Thus, such approach enables learning representation that is rich of the high-level shared information, whereas it ignores the low-level information about the objects in the input data. To achieve this, three components constitute the architecture of the CPC Model including encoder network which is responsible for converting the input into a compact latent variable , an auto-regressive network which is responsible for producing the context

out of the encoded latent variables and generating future predictions. Ultimately, the contrastive loss function, which is called InfoNCE that is formulated on the Noise-Contrastive Estimation loss function (NCE)

(gutmann2010noise), is used to optimize the CPC model.

For CPC implementation on visual data (images), an input image of size (256 x 256) pixels is cropped into patches of size (64 x 64) pixels with overlap of 32 pixels with respect to the height and width between each two patches which results in a grid of patches of size (7 x 7) pixels. Consequently, each patch is encoded via ResNet-101-v2 (he2016identity) encoder into a vector of size 1024 while the whole image forms an array of size (7 x 7 x 1024) as shown in figure 12. Following that, a PixelCNN architecture (oord2016conditional) is employed as an auto-regressor that generates a context vector to generate future predictions in a top-down fashion and in a way that maximizes the mutual information between the context and predictions. Ultimately, the role of InfoNCE loss comes to play to contrast between the predicted patch and all other negative patches which may come from other locations in the input images or other images in the same mini-batch. It is worth noting that a second version of CPC has been developed henaff2020data that introduced significant improvements to the original CPC.

Figure 12: Self-supervised features learning by contrastive predicting coding (oord2018representation).

3.3.2 Momentum contrast

Momentum contrast (MoCo) is another self-supervised contrastive learning approach proposed by he2020momentum that is mainly based on the dynamic dictionary-lookup and queues ideas. As presented in figure 13, MoCo architecture consists of two branches, namely, query-encoder and momentum-encoder. The query-encoder role is to generate a features vector from the query images . On the other side, the momentum-encoder which acts as dictionary of data samples (whole images or patches ) form features vectors encodings . The authors stated that for maintaining a dynamic dictionary, it should be of large size and consistent. For size property, the dictionary is designed as a queue of feature vectors encodings represented in the encoded mini-batches where the present mini-batch enters the queue while the outdated mini-batches leave the queue in First-In-First-Out fashion. Moreover, the dictionary size is not restricted to the mini-batch size, but can be larger. On the other side, as the keys of the dictionary are derived from a group of previous mini-batches, they need to be updated regularly to maintain the consistency property. A momentum update of keys based on parameters values of the query-encoder is proposed by the authors where only the query-encoder parameters are updated by back-propagation while the momentum-encoder is updated consequently according to the moving average formula allowing it be updated slowly and in a smoother fashion than the query-encoder. Having the complete MoCo mechanism, the architecture is then optimised via InfoNCE contrastive loss (oord2018representation). A second version of MoCo was developed by chen2020improved that introduced the addition of projection head and strong data augmentation to the current architecture.

Figure 13: Self-supervised features learning by momentum contrast (he2020momentum).

3.3.3 Simple framework for contrastive learning of visual representations

Another contrastive learning approach is the simple framework for contrastive learning of visual representations, or SimCLR for short, which was proposed by chen2020simple. As its name implies, SimCLR depends mainly on two simple ideas including heavy data augmentation techniques that result in correlated views for the same input, in addition to a large batch size that includes a large set of negative examples. Furthermore, SimCLR omits the need for extra logic as seen in both CPC (oord2018representation) and MoCo (he2020momentum). To elaborate more on the SimCLR approach, a set of random transformations including cropping and resizing, flipping, rotation, color distortion and Gaussian blur are applied on the input image which results in a pair of positive correlated views as shown in figure 14. Consequently, both views are passed into a pair convolutional encoders , ResNet50 (he2016deep) in SimCLR case, to obtain their representation which denoted as . Following that, the generated representations are passed to a pair of projection heads which consists of two Dense layers with ReLU activation (nair2010rectified) for the first layer and linear activation for the second layer that results in a pair of feature vectors . Ultimately, the InfoNCE contrastive loss (oord2018representation; he2020momentum) termed as Normalized Temperature-Scaled Cross-Entropy Loss (NT-Xent) by the authors is employed to optimize the whole architecture based on the generated embedding by maximizing the agreement between the positive pair of augmented images while minimizing it for other images in the same batch which are considered as negative samples. Upon training completion, the Dense layers are discarded while the convolutional encoders are kept to be utilized in down-stream tasks. A second version of SimCLR developed by chen2020big

combined between unsupervised, supervised and semi-supervised learning to learn task-specific models.

Figure 14: Self-supervised features learning by simCLR (chen2020simple).

3.3.4 Bootstrap your own latent

Bootstrap Your Own Latent (BYOL) is an implicit contrastive learning approach proposed by grill2020bootstrap that omits the need for negative samples during the training. More clearly, BYOL architecture consists of two networks as shown in figure 15. The first network is a trainable network called Online Network denoted with parameter that consists of representation head , projection head and prediction head . On the other side, the second network is a non-trainable and randomly initialized network called Target Network denoted with parameter and have the same architecture as the Online Network except for the prediction head. Target Network acts as a slow-moving-average of the Online Network and updates based on the gradients update in the Online Network via the moving average. To train BYOL architecture, two augmented views are generated from the input image by applying two different augmentation operations . Consequently, both augmented views pass the two networks for encoding and representation generation while pass through the prediction head to produce the prediction for the subsequent computation. Following that, both and are normalized via (L2) norm and accordingly fed into mean squared error (MSE) loss function for optimization rather than contrastive loss. It is worth noting that the gradients flow back only over the Online Network and stopped for Target Network as indicated in figure 15 by the term which is updated with the momentum equation as a function of the Online Networks parameters . This way, BYOL enabled learning semantic features by training Online Network on an augmented view to predict the representation of another augmented view of the same image produced by the Target Network. Thus, both networks learn interactively from each other from the same image while omitting the the need for negative samples.

Figure 15: Self-supervised features learning by BYOL (grill2020bootstrap).

3.3.5 Swapping assignments between multiple views

While the previous contrastive methods are instance-discrimination-based methods, Swapping Assignments between multiple views (SwAV) is a cluster-discrimination-based method proposed by caron2020unsupervised. Two major elements form the core of SwAV method including multi-crop augmentation strategy and the online clustering assignment. The multi-crop strategy aims at generating multiple views of the same image without increasing the memory and compute requirements through generating two global views with standard resolution crops (e.g: 224 x 224) pixels as well as local views with smaller resolution crops (e.g: 96 x 96) pixels enabling producing multiple views rather than just pairs. Besides, each generated view undergoes additional random transformation such those implemented in SimCLR (chen2020simple). On the other side, unlike offline clustering assignment methods which require a complete pass over the dataset to compute the clusters’ assignment which becomes computationally intensive in the case of large datasets. Online clustering allow computing clusters’ assignment by mapping the encoded views to a prototype vector on the current batch by treating it as an optimal transportation problem (Sinkhorn-Knopp (cuturi2013sinkhorn)).

Figure 16 depicts the complete SwAV architecture. Given an input image , multiple views of the same image are generated by applying a set of random transformation according to multi-crop augmentation strategy resulting in augmented views. For simplicity, we will consider one global view and one local view . Consequently, the generated views are passed into convolutional encoders , ResNet50 (he2016deep) in SwAV case also, followed by two Dense layers with ReLU activation (nair2010rectified) to generate feature vectors . In fact, the initial steps in SwAV do not differ significantly from those of SimCLR chen2020simple except in the augmentation strategy. Following that, the feature vectors are passed through a Dense layer with linear activation called prototype layer which is responsible for mapping the feature vectors into learnable prototype (cluster) vectors grouped in a matrix such that . It is worth noting that value is not inferred but user-defined while the values represent the weights matrix of the prototype layer. Herein to compute clusters assignments online, only the features of the current batch are used where sinkhorn-knopp algorithm is employed to generate the cluster assignments (codes) that represent the mapping of feature vectors into clusters in a way that maximizes the similarity between them. Further, sinkhorn-knopp enforces the equi-partition constraint which prevents assigning all features into a single cluster. Eventually, a swapped prediction problem is performed upon codes generation. Intuitively, given two different views of the same image, they should maintain similar information. Thus, it is possible to predict the codes of one view from the features vector of the other. This is achieved by minimizing the cross-entropy loss between the code of one view and the softmax of the similarity of the features vector to all clusters. This way, SwAV takes the advantage of contrasting clusters of data with similar features rather than performing pair-wise comparison over the whole training sets as seen in the previous methods.

Figure 16: Self-supervised features learning by SwAV (caron2020unsupervised).

To sum up, we opted to provide a high-level overview for each of the previously discussed methods as this article is intended for self-supervised applications in medical imaging which renders it prone to nonspecialist readers from the medical field. One more point to mention is despite the fact that these methods are developed on natural images, they can be transferred to the medical imaging field as we will see in the next section. Such property encouraged us to briefly discuss them before proceeding towards the application of self-supervised learning in medical imaging. Table 1 summarizes the discussed pretext tasks according to their categories.

Authors Category Method
dosovitskiy2015discriminative Predictive Exemplar CNN
doersch2015unsupervised Predictive Relative position prediction
noroozi2016unsupervised Predictive Jigsaw puzzle
gidaris2018unsupervised Predictive Rotation prediction
vincent2008extracting Generative Denoising auto-encoder
pathak2016context Generative Image inpainting
zhang2016colorful Generative Image colorization
zhang2017split Generative Split-brain auto-encoder
radford2015unsupervised Generative Deep Convolutional GAN
donahue2016adversarial Generative Bi-directional GAN
oord2018representation Contrastive CPC
he2020momentum Contrastive MoCo
chen2020simple Contrastive SimCLR
grill2020bootstrap Contrastive BYOL
caron2020unsupervised Contrastive SwAV
Table 1: Summary of self-supervised learning pretext tasks.

3.4 Resources in self-supervised learning

We provided a curated list of pretext tasks that acted as milestones in the history of self-supervised learning in computer vision field, however, the efforts in this research area are not limited to those methods. As a result, we developed a list of self-supervised learning resources that includes review articles, surveys and papers as shown in Table 2 for those readers who need to enhance their understanding in the field. For in-depth reviews about self-supervised learning, we highly recommend the readers to refer to one of following articles: jing2020self provided an extensive review of self-supervised learning methods for visual features learning from image and video data; and ohri2021review provided a comprehensive review and performance comparison for a large list of the most recent self-supervised learning approaches developed for image data. Further, schmarje2021survey reviewed various deep learning methods for image classification with fewer labels where self-supervised learning is one of their work dimensions. For Contrastive Learning, both le2020contrastive and jaiswal2021survey provided a comprehensive surveys on contrastive self-supervised methods for different research areas such as computer vision and natural language processing. Ultimately, liu2021self summarized a set of generative and contrastive self-supervised learning approaches from computer vision, natural language processing and graph learning. To access lists of papers, readers may visit the following two repositories: Awesome-self-supervised-learning666 covers a curated list of research articles for self-supervised learning from different research areas. In addition, Awesome-contrastive-learning777 is a curated list of papers that is mainly dedicated for contrastive learning methods.

Authors Type Title Venue
jing2020self Survey

Self-supervised visual feature learning

with deep neural networks: A survey

IEEE Transactions on Pattern

Analysis and Machine Intelligence

ohri2021review Review

Review on self-supervised image

recognition using deep neural


Knowledge-Based Systems

schmarje2021survey Survey

A survey on semi-, self- and

unsupervised learning in image


IEEE Acsess

le2020contrastive Review

Contrastive representation learning:

A framework and review

IEEE Access

liu2021self Review

Self-supervised learning:

Generative or contrastive

IEEE Transactions on Knowledge

and Data Engineering

jaiswal2021survey Survey

Survey on contrastive self-supervised



Jason Ren Papers list

Awesome self-supervised


Ashish Jaiswal Papers list

Awesome contrastive learning

Table 2: A summary of Self-supervised learning resources

4 Self-supervised methods in medical imaging

Mainly, there are two paths to follow when employing self-supervised learning in medical images analysis (chen2021recent). The first path is to directly adopt one of the pre-designed pretext tasks from the computer vision field as given in section 3 or alternatively develop modified versions of these tasks and employ them in the medical context. On the other side, the second path needs to harness knowledge from the medical domain and computer vision to design a novel pretext task for target medical tasks. However, we prefer to preserve the same categorization of self-supervised learning methods in the medical field as given in section 3 in order to retain a unified categorization terminology throughout the paper and group all related works together based on their category of the pretext task. Further, after exploring self-supervised learning literature in medical imaging, we discovered that some researchers tend to utilize multiple methods separately or collectively in a multi-tasking fashion. Thus, we added an additional category called multiple-tasks/multi-tasking to fit such works.

4.1 Predictive methods in medical imaging

Inspired by relative position prediction (doersch2015unsupervised) task, zhang2017self introduced slices ordering pretext task. Knowing that 3D medical images such as CT and MRI can be represented as a group of successive 2D slices, thus, such property can by used as an auxiliary supervision signal to learn a good representation. As a result, the authors treated slice ordering task as a binary classification problem by developing a Siamese convolutional architecture called Paired-CNN that receives two successive slices and predicts their spatial order as bellow or above. On the other side, the authors tested their proposed task on fine-grained body part recognition (regression) as a down-stream task.

spitzer2018improving proposed to predict the geodesic distance between two patches located on the brain surface to learn rich representation about human brain. Thus, they trained a Siamese architecture with two identical branches and weights sharing to accomplish this task. The defined distance between two patches is the Euclidean distance while the ground truth distance is computed manually from the input data. Beside the distance prediction, the authors included the 3D location coordinates prediction of the input patches to the same task which improved the accuracy and convergence of the predicted distances. Ultimately, their approach was evaluated on Cytoarchitectonic segmentation as a down-stream task.

bai2019self proposed anatomical position prediction pretext task from cardiac MRI scans for segmentation purposes. As the cardiac MRI scans provide several cardiac views from different orientations, i.g: short-axis, 2CH long-axis and 4CH long-axis. Thus, different cardiac anatomical regions, e.g: left and right atrium and ventricle, can be expressed using these views. Such properties motivated the authors to define a set of anatomical positions with respect to a certain view as bounding boxes and forced the network to predict these anatomical positions through segmentation. For the down-stream task, a private dataset of 200 annotated cardiac MRI scans was used for evaluation purposes.

li2020self employed self-supervised learning to improve the pseudo-labeling uncertainty estimation in semi-supervised medical images’ segmentation by proposing a novel methodology called self-loop uncertainty. They adopted Jigsaw puzzle pretext task (noroozi2016unsupervised) in their approach and introduced random patches rotation with angles of [, , , ] to secure learning translation and rotation invariant features. Further, they omitted the need for Siamese architecture as compared to the original Jigsaw puzzle by combining the input patches into a single image for subsequent permutation classification. Beside the labeled data, they leveraged unlabeled data for uncertainty estimation in semi-supervised settings. Two different segmentation tasks were considered for methodology validation including nuclei segmentation and skin lesion segmentation as down-stream tasks.

taleb2021multimodal presented another work that is inspired by Jigsaw puzzle solving noroozi2016unsupervised that exploits information spanned over multi medical imaging modalities (e.g: T1 and T2 scans) rather than single modality which is called multi-modal Jigsaw puzzle. A significant improvement has been brought to the original Jigsaw puzzle beside the multi-modal settings represented in the employment of Sinkhorn network (mena2018learning) for Jigsaw puzzle solving. Sinkhorn network utilizes Sinkhorn function, which is synonymous to the Softmax function, that enables learning a permutation task rather than a classification task. They also introduced cross-modal synthesis data generation using CycleGAN architecture (zhu2017unpaired) to increase the amounts of data available for self-supervision. On the down-stream side, four tasks were utilized for method validation including brain tumor segmentation, prostate segmentation, liver segmentation and survival days prediction (regression).

zhuang2019self proposed a novel pretext task that is inspired by the early work of noroozi2016unsupervised on Jigsaw puzzle solving for 3D medical data called Rubik cube recovery. Two operations constitute the Rubik cube recovery pretext task including cube rearrangement and cube rotation. The same logic of the original Jigsaw puzzle task is adopted in Rubik cube recovery tasks with 3D input as a second-order Rubik cube (2 x 2 x 2) rather than than 2D input with respect to the rearrangement process. To introduce additional complexity, the authors introduced cube rotation process and limited it to only vertically and horizontally. This way, the authors secured learning translation and rotation invariant features as opposed to the original Jigsaw puzzle task which secures learning translation invariant features only. Ultimately, two down-stream tasks were used for evaluation purposes including brain hemorrhage classification and brain tumor segmentation which showed competitive performance.

As an extension of the previous work, zhu2020rubik introduced Rubik cube+ pretext task which adds additional level of complexity to the Rubik cube recovery problem represented as cube masking identification on the top of both cube rearrangement and cube rotation. Masking identification operation can be viewed as randomly blocking part of the information in a certain cube by masking. The intuition behind masking identification is that robust features learning can be achieved by solving harder tasks. Rubik cube+ was evaluated on the same down-stream tasks from the previous work which showed slight improvement.

nguyen2020self proposed spatial awareness pretext task that is able to learn semantic and spatial representation from volumetric medical images. Spatial awareness is inspired in the context restoration framework (chen2019self) but was treated as a classification problem. For a certain 3D image, single slice is selected as well as a neighbouring slice in the range where this range represents the spatial index. Following that, two patches of predefined dimensions are selected randomly and swapped between the two slices (T) times. Ultimately, a classification network is trained to predict if the input slice is corrupted or not to learn semantic representation. Further, the network is trained to learn the spatial index which enables learning spatial features.

Table 3 summarizes the predictive self-supervised learning methods in medical imaging.

Authors Pretext task Down-stream task
zhang2017self Slices ordering Body parts recognition
spitzer2018improving Geodesic distance prediction Brain area segmentation
bai2019self Anatomical position prediction

Short-axis cardiac MRI segmentation

long-axis cardiac MRI segmentation

li2020self Jigsaw puzzle

Nuclei Segmentation

Skin lesions segmentation

taleb2021multimodal Jigsaw puzzle

Brain tumor segmentation

Liver segmentation

Prostate segmentation

zhuang2019self Rubik cube

Brain tumor segmentation

Brain hemorrhage classification

zhu2020rubik Rubik cube+

Brain tumor segmentation

Brain hemorrhage classification

nguyen2020self Spatial awareness

Organ at risk segmentation

Intracranial Hemorrhage detection

Table 3: Summary of predictive self-supervised learning methods in medical imaging

4.2 Generative methods in medical imaging

ross2018exploiting adopted image colorization pretext task (zhang2016colorful) for solving endoscopic medical instruments segmentation task from endoscopic video data. However they did not utilize the original architecture as in colorization task, but rather a conditional GAN architecture was employed to encourage generating more realistic colored images, while six datasets from medical and natural domains were used in the evaluation of down-stream tasks.

chen2019self proposed a novel generative pretext task called context restoration that is inspired by the early works of relative position prediction (doersch2015unsupervised) and context encoder (pathak2016context). The authors described the context restoration task as a simple and straightforward method in which two isolated patches are selected randomly and their position are swapped. The swapping process repeats itself iteratively to produce a corrupted version of the input image but preserves the input image overall distribution. Following that, a generative model is employed to restore the corrupted image to its original version. Three down-stream tasks were used to test the context restoration feasibility including fetal standard scan plane classification, abdominal multi-organ localization and Brain tumour segmentation.

Another work which is built on the same idea of context restoration is called Models Genesis and is performed by zhou2019models

for 3D medical images. As opposed to context restoration pretext, models genesis introduced four distortion operations, namely, non-linear transformation using B´ezier transformation function, local pixel shuffling which is similar to the swapping operation in context restoration but in 3D settings, in-painting which is similar to context encoder method and out-painting which is the inverse operation of in-painting. It is worth noting that each input volume undergoes the first two operations and only one of the remaining operations. Consequently, a generative model is built to restore the the distorted image to its original context. Seven down-stream tasks were used to evaluate their method in terms of segmentation and classification problems.

matzkin2020self designed a self-supervised approach for bone flab reconstruction that results from decompressing craniectomy (DC) operations using normal CT scans rather than DC post-operative annotated CT scans. DC is the surgical procedure of removing part of the skull due to different causes such as stroke and traumatic brain injury. To achieve this, the authors designed a virtual craniectomy approach to simulate the DC from normal CT scans that is able to generate DC post-operative CT scans with bone flabs were removed from different parts of the upper head which in turn serve as input for the reconstruction model. Consequently, two strategies were proposed to reconstruct the bone flab including direct estimation as well as reconstruct and subtract. Further, two architectures were employed including U-Net (ronneberger2015u) and denoising auto-encoder (vincent2008extracting).

hervella2020learning proposed multi-modal reconstruction task as a self-supervised approach for retinal anatomy learning. The main assumption is that different modalities for the same organ can provide complementary information which enables learning useful representations for the subsequent tasks. The authors proposed to reconstruct fundus fluorescein angiography photo from color fundus photos using aligned pairs from both modalities for the same patient’s eye. Further, U-net architecture (ronneberger2015u) is employed for the sake of reconstruction task completion along with structural similarity index map (SSIM) (wang2004image) as a loss function. Subsequent research by the same authors experimented with their approach with different ophthalmic oriented down-stream tasks such as retinal vascular segmentation (morano2020multimodal), joint optic disc and cup segmentation (hervella2020multi) and retinal diseases diagnosis (hervella2021self).

holmberg2020self suggested that designing an effective pretext task for medical domains must accurately extract disease-related features which is typically present in a small part of the medical image. Hence, such condition makes traditional pretext tasks that are dominated by the presence of larger objects in natural images ineffective for the medical context. As a result, they have developed a novel pretext task for ophthalmic diseases diagnosis called cross modal self-supervised retinal thickness prediction that employs two different modalities including optical coherence tomography scans (OCT) and infrared fundus images. Initially, retinal thickness maps are extracted from OCT scans by developing segmentation model using small annotated dataset which then serves as ground-truth labels for the actual pretext task. Following that, the extracted thickness maps are predicted using infrared fundus images with U-Net-like architecture (ronneberger2015u). Learning disease-related features has been validated by three experienced ophthalmologist. Further, the quality of their task was assessed on diabetic retinopathy grading using color fundus as a down-stream task.

prakash2020leveraging adopted image denoising approach as a pretext task for nuclei images’ segmentation. A special denoising architecture called Noise2Void (krull2019noise2void) was employed as self-supervised pretraining method. Further, four scenarios are evaluated for segmenting nuclei images including random initialization with noisy images, random initialization with denoised images, fine-tuning with noisy images and fine-tuning with denoised images. The results showed superiority of self-supervised denoising as opposed to random initialization.

hu2020self adopted context encoder framework (pathak2016context) along with DICOM meta-data as weak supervision method to learn robust representations from ultrasound imaging. On the top of context encoder, the authors introduced additional projection discriminator (miyato2018cgans; luvcic2019high)

network that produces a feature vector of the in-painted image which to be fed into classification head and projection head. The classification head classifies the context encoder output as real or fake; while the projection head acts as a conditional classifier that incorporates the DICOM meta-data as weak labels. For DICOM meta-data, two tags were employed including the prop type and the study description as they directly relate to the ultrasound semantic context.

Another extension to Rubik cube pretext tasks is performed by tao2020revisiting as Rubik cube++ which introduced two substantial changes to the original Rubik cube problem. On the first hand, they introduced the concept of volume-wise transformation which bounds the sub-cubes rotation operation to the neighboring sub-cubes as in playing a real Rubik cube game and as contrast to zhuang2019self where the sub-cubes are rotated individually. On the second hand, rather than treating Rubik cube as a classification problem, it has been treated as generative problem using GAN-based architecture were the generator role is to restore the original shape of the Rubik cube, while the discriminator role is to discriminate between the correct and wrong arrangement of the generated cubes. As a down-stream task, Rubik cube++ has been tested on two segmentation tasks including pancreas segmentation and brain tissues segmentation.

Table 4 summarizes the generative self-supervised learning methods in medical imaging.

Authors Pretext task Down-stream task
ross2018exploiting Image Colorization Surgical instruments segmentation
chen2019self Context restoration

Fetal image classification

Abdominal multi-organ localization

Brain tumour segmentation

zhou2019models Models Genesis

Lung nodule segmentation

FPR for nodule detection

FPR for pulmonary embolism

Liver segmentation

pulmonary diseases classification

RoI, bulb, and background classification

Brain tumor segmentation

matzkin2020self skull reconstruction Bone flap volume estimation
hervella2020learning Multi-modal reconstruction

Fovea localization

Optic disc localization

Vasculature segmentation

Optic disc segmentation

holmberg2020self Cross modal retinal thickness prediction Diabetic retinopathy grading
prakash2020leveraging Image denoising Nuclei images segmentation
hu2020self Context encoder

Quality score classification

Thyroid nodule segmentation

Liver and kidney segmentation

tao2020revisiting Rubik cube++

Pancreas segmentation

Brain tissue segmentation

Table 4: Summary of generative self-supervised learning methods in medical imaging.

4.3 Contrastive learning in medical imaging

jamaludin2017self harnessed the power of longitudinal spinal MRI scans as a self-supervised contrastive learning task. This is supported by the fact that time-separated scans from the same patient will share similar representations. Thus, they trained a Siamese convolutional model that contrasts between two vertebral bodies (VB) MRI scans separated by a period of time irrespective of whether the two images belong to the same patient or not by employing chopra2005learning contrastive loss function. Along with the contrastive loss, they employed a categorical cross-entropy loss to classify the VB scans into seven classes (T1-S1). For the down-stream task, they tested the pretrained model on disc degeneration grading task which showed superior performance in comparison to random initialization.

lu2020@semi adopted contrastive predictive coding (oord2018representation) along with multiple instance learning (MIL) (ilse2018attention) for the classification of breast cancer histology images. As a first stage, CPC is employed to learn rich representations from breast cancer histopathological images rather than learning features from scratch using MIL network. The results showed superior performance to both learning from scratch as well as a pretrained model on ImageNet (deng2009imagenet).

Contrastive predictive coding (oord2018representation) is originally designed for 2D data, zhu2020embedding extended the early work on contrastive predictive coding in a way that enables handling 3D data by developing a new method called Task-related CPC. Initially, supervoxels are generated using simple linear iterative clustering method (SLIC) (achanta2012slic)

from the input volume to detect the potential lesion areas. consequently, the sub-volumes that surround the generated supervoxels are cropped to act as the input of TCPC encoder network. U-shape path around the center of the generated supervoxel is employed as compared to the original CPC which employs straight path to achieve better characterization of the lesion. Further, recurrent neural network acts as the auto-regressor which generates the future predictions, while the whole architecture is optimized using the InfoNCE loss

(oord2018representation). brain hemorrhage classification and lung nodule classification tasks were utilized as down stream tasks.

Self-supervised approaches in general and contrastive approaches in specific are known to consider the global consistency of the input data while ignoring the local consistency. xie2020pgl introduced Prior-Guided Local (PGL) algorithm for 3D medical images segmentation which extended the early work on BYOL method (grill2020bootstrap) to consider the local consistency between the different views of the same region. To achieve this, an additional block called prior-guided aligner on the top of projection head for both online and target networks are introduced to the original BYOL architecture. The prior guided local role is to exploit the augmentation information applied to the input image as a prior to guide aligning the features extracted from different views of the same region. Ultimately, a local consistency loss function is employed to minimize the difference between the aligned local features. Four down-stream segmentation tasks were employed for evaluation purposes including liver tumor, kidney tumors, spleen and abdominal organs.

li2020self-sup proposed patient feature-based Softmax embedding loss function to learn modality and transformation invariant as well as patients similarity features using ophthalmic data in contrastive settings. Modality invariant is achieved by combining color fundus photos with a synthesized fundus fluorescein angiography photo of the former photo, while transformations invariant is represented by the ordinary augmentation techniques of the color fundus photo. Such triplet of photos is assumed to share similar features for the same patient. Consequently, to learn patients similarity features, the triplet of each patient image is considered as a contrasting basis where the features of the same patients are pulled together while features from other patients are pushed apart using the proposed loss function.

sowrirajan2020moco adopted MoCo (he2020momentum) approach to build a self-supervised pretrained models for chest X-Ray classification problem. They used pretrained models on ImageNet (deng2009imagenet) in a supervised fashion as initialization weights for the self-supervised training to speed up the convergence. Further, they suggested that not all augmentation strategies implemented in the original MoCo paper can fit into gray-scale images. Instead, they settle only to random partial rotation and horizontal flipping. In addition, they tested their work on external chest X-Ray dataset to examine the generalizability of their work on tasks from the same domain, which showed the possibility of transferring the self-supervised learned knowledge to other related tasks.

vu2021medaug proposed MedAug approach as an augmentation strategy that benefits from the patient meta-data when training MoCo framework (he2020momentum) as an extension of the early work performed by sowrirajan2020moco. More clearly, MedAug requires that the different views must come from the same patient for certain pathology as such images are expected to have similar representation. In addition, MedAug considers study number and laterality as two additional conditions derived from the patient meta-data. For the same patient, study number represents images taken in different sessions while laterality represents the orientation as frontal or lateral. This way, MedAug leveraged medical knowledge to the learning algorithm rather than depending merely on the transformations obtained by ordinary augmentation techniques to generate positive views. MedAug was tested on plural effusion classification from chest X-ray as a down-stream task.

sriram2021covid purely adopted MoCo (he2020momentum) as an approach for COVID patients deterioration prediction tasks. They used non-COVID chest X-ray images from different public datasets to train MoCo for the subsequent tasks. On the other side, the authors defined three prediction tasks that indicate COVID patient deterioration including single image prediction, oxygen requirements prediction and multiple image prediction as a down-stream. The first two tasks are ordinary classification problems from a single image; while the third one requires multiple time-indexed radiographs. Continuous positional embedding module was employed to obtain representation from a set of time-indexed radiographs.

Another similar work performed by chen2021momentum, which adopted MoCo as a pretraining method, uses chest CT scans for COVID diagnosis via few-shot learning prototypical network (snell2017prototypical) as a down-stream task. Similarly, public non-COVID chest CT was utilized for MoCo training; and two public COVID datasets were utilized for evaluation.

chaitanya2020contrastive provided two significant improvements to the SimCLR (chen2020simple) contrastive learning approach for 3D images segmentation by developing domain-specific and problem-specific knowledge simultaneously. To elaborate more on the domain-specific knowledge, the original contrastive loss NT-Xent maximizes the similarity between a pair of transformed versions of the input image by augmentation alone to obtain a global representation. However, 3D medical images consist of a set sequential images that depict similar anatomical region. Thus, such sequences can be exploited as a positive pair to learn a global representation. On the other side for problem-specific knowledge, a segmentation task that is considered as a pixel-wise prediction problem requires local representation. As a result, the authors introduced a local contrastive loss function that helps learning a local representation based on the similarity between the local regions within the input volume. It is worth noting that the proposed approach employs encoder-decoder architecture where the encoder is optimized with global loss while the decoder is optimized with the local loss. Further, cardiac segmentation and prostate segmentation were employed as down-stream tasks.

azizi2021big adopted self-supervised contrastive learning approach in medical context in a way that combines learning features from both unlabelled natural images and unlabelled medical images in a sequential fashion. To elaborate more, they adopted SimCLR (chen2020simple) and introduced a novel contrastive learning called Multi-Instance Contrastive Learning (MICLe) that is built on the same logic of SimCLR with minor modification. The main idea behind MICLe is to leverage from the availability of multiple views of a certain pathology from the same patient as the foundation for contrastive learning. Such correlated views of the same patient are considered as positive pairs rather than generating multiple views from the same image as in SimCLR. In their experiments, the authors tested SimCLR on chest X-Ray images dataset with fourteen classes, while MICLe was tested on Dermatology dataset with twenty-seven classes as a downstream task.

Table 5 summarizes the contrastive self-supervised learning methods in medical imaging.

Authors Pretext task Down-stream task
jamaludin2017self Longitudinal spinal MRI Disc degeneration grading
lu2020@semi CPC Breast cancer classification
zhu2020embedding CPC

Brain hemorrhage classification

Lung Nodule classification

xie2020pgl BYOL

Liver segmentation

Spleen segmentation

Kidney tumour seg.

Abdominal organs seg.

li2020self-sup Feature-based softmax embedding

PM classification

AMD classification

Diabetic retinopathy detection

sowrirajan2020moco MoCo

Tuberculosis detection

Pleural effusion classification

vu2021medaug MoCo Pleural effusion classification
sriram2021covid MoCo COVID patient prognosis
chen2021momentum MoCo COVID few-shot classification
chaitanya2020contrastive SimCLR

Cardiac segmentation

Prostate segmentation

azizi2021big SimCLR

Chest X-ray classification

Skin lesions classification

Table 5: Summary of contrastive self-supervised learning methods in medical imaging.

4.4 Multiple-tasks/Multi-tasking in medical imaging

tajbakhsh2019surrogate experimented with three pretext tasks, namely, rotation prediction (gidaris2018unsupervised), image colorization (larsson2017colorization) and 3D Patch reconstruction (arjovsky2017wasserstein) on three different medical image analysis tasks including False Positive Reduction for nodule detection in chest CT scans, diabetic retinopathy severity classification, lung lobe segmentation and skin segmentation. However, due to the substantial differences among the utilized imaging modalities, each of them was assigned a specific pretext task. More clearly, image rotation was employed for both lung lobe segmentation and diabetic retinopathy classification while colorization was employed for skin segmentation task and finally 3D patch reconstruction was employed for nodule detection.

jiao2020self proposed temporal order correction and spatio-temporal transformation prediction pretext tasks to learn good representation from fetal ultrasound videos. For the first task, the order of the ultrasound video frames is shuffled and the role of the task is to predict the correct order of the shuffled video. For the second task, certain affine transformations are applied to the input video and the role of the task is to predict the applied transformation. To train both tasks jointly, the authors proposed two strategies including Siamese network with partial weights sharing that learns two tasks simultaneously with one branch for each task. The second strategy is called objective disentanglement which enables incorporating the proposed task into the same input video and train the network to recognize both of them.

li2020multi combined two colorization based pretext tasks to into a single multi-tasking framework called ColorMe to learn useful representations from scopy images. In a similar way to the original colorization task (zhang2016colorful), the authors proposed to predict red and blue channels from the green channel in an RGB scopy images to obtain local features. On the other side, the authors proposed color distribution estimation of the red and blue channels to force learning global features. Then, both tasks are trained jointly and evaluated on two down-stream tasks, namely, cervix type classification and skin lesion segmentation.

taleb20203d suggested that rich representation can be learned from medical images with 3D nature instead of 2D images. For this reason, they applied five pre-designed pretext tasks, namely, CPC (oord2018representation), exemplar CNN (dosovitskiy2015discriminative), rotation prediction (gidaris2018unsupervised), relative position prediction (doersch2015unsupervised) and Jigsaw puzzle (noroozi2016unsupervised) to be adaptive with medical images of 3D nature. Their methods were tested on two 3D down-stream tasks which are brain tumor segmentation and pancreas tumor segmentation.

luo2020retinal proposed self-supervised fuzzy clustering network as a pretext task for color fundus photo classification. The proposed approach consists of auto-encoder architecture which is responsible for initial features learning from the input data as a first stage. In addition, a clustering module that guides the self-supervision process is employed as a second stage. After gaining the initial representation, Fuzzy C-means algorithm is utilized (bezdek1984fcm) on top of the encoder part network to cluster similar inputs into a predefined clusters and update the encoder weights accordingly. Ultimately, the learned weights, after clustering phase, are transferred to the down stream task.

haghighi2020learning introduced Semantic Genesis as an extension to the previous work on Models Genesis framework (zhou2019models). Beside features learning by restoration, the authors introduced two additional functionalities called self discovery and self classification. Self discovery is the first stage of Semantic Genesis framework, where an auto-encoder is trained to reconstruct the input images. Such steps help in discovering a set of semantically similar patients who share similar anatomical patterns by comparing their encoding vectors. Consequently, random number of crops with fixed coordinates are derived from those patients and assigned a numerical label that denotes their positions. For self classification stage, a classification head on the top of the framework encoder is employed to classify the extracted batches according to their assigned labels. In addition, the same intuition of Models genesis is adopted in self restoration phase but applied to the extracted patches rather than the whole image. This way, Semantic Genesis enables learning semantically rich representation from similar anatomical patterns. Seven down-stream tasks were utilized for evaluation as classification and segmentation tasks.

zhang2020universal introduced scale-aware restoration pretext task for 3D medical images segmentation as an extension of Models Genesis framework (zhou2019models). In addition to the transformation restoration as in Models Genesis, the authors introduced scale discrimination property to the original model depending on the fact that desired objects, e.g. tumor, appear with different sizes across different patients. And hence, cubes of predefined sizes as small, medium and large are generated and resized into a unified size then labeled according to their original cropping size. Consequently, classification head is included on the top encoder part of the encoder-decoder architecture to accomplish the scale classification task while the whole architecture is responsible for transformation restoration task. Brain tumor segmentation and pancreas organs and tumors segmentation were used as down-stream tasks.


developed a multi-task self-supervised learning approach that combines both generative modeling and instance discrimination using sequential medical data. Given a sequence of medical images for the same patient e.g CT, an auto-encoder architecture with single encoder and two decoders is responsible for learning representation by predicting the (T) steps precedent and successor slices of the input slice which in turn enables learning the anatomical structural similarity between different slices. In addition, an instance discrimination task is included to avoid learning trivial features by generative modelling. To achieve this, an additional encoder is introduced to the whole architecture that takes another input slice from the same patient and try to contrast it with the generative model input by minimizing the negative cosine similarity between both inputs. It is worth noting that the second encoder shares the same weights with the generative model whereas it does not go through back-propagation process.

koohbanani2021self proposed self-path framework for histopathology images which comprises three pathology-specific tasks, namely, magnification prediction, magnification Jigsaw puzzle (JigMag) and Hematoxylin channel prediction in multi-tasking settings. For the first one, patches with different predefined levels of magnification are extracted whereas the task role is to predict the right magnification of the input image. For JigMag, the generated puzzles for training include patches with different magnifications levels for the same image, while the task role is to predict the right order of the puzzle. For the latter task, Out of histopathological image stained with Hematoxylin and Eosin, the role of the task is to predict the first channel from the the stained image. Ultimately , all proposed tasks along with the down stream tasks are trained jointly in a multi-tasking fashion.

zhang2021twin developed a semi-supervised multi-tasking approach that combines rotation prediction (gidaris2018unsupervised), Jigsaw puzzle (noroozi2016unsupervised) and SimCLR (chen2020simple) in a unified framework called twin self-supervision based semi-supervised learning (TS-SSL) for spectral-domain optical coherence tomography (SD-OCT) classification. For Jigsaw puzzle, the authors introduced patch rotation as given in (li2020self), while for SimCLR the authors introduced supervised category-wise contrastive loss which considers all samples for a certain label as positive examples. Consequently, the proposed approach is trained in end-to-end fashion and semi-supervised multitasking setting to learn representation by performing rotation prediction, Jigsaw puzzle solving, contrastive and supervised contrastive learning while evaluated on multi-class and binary OCT classification tasks.

li2021rotation suggested that rotation oriented collaborative features learning would provide a potent representation for fundus disorders. Thus, they simultaneously combined rotation prediction (gidaris2018unsupervised) with multi-view instance discrimination (wu2018unsupervised) to learn rotation-related and rotation-invariant features using fundus color photography in an end-to-end fashion. Their approach was tested on two ophthalmic diseases, namely, pathological myopia (PM) and age-related macular degeneration (AMD) as a binary classification down-stream task. Further, their approach showed that the collaborative approach provided better results than using a single pretext task at a time.

lu2021volumetric designed two domain-specific pretext tasks for white matter tract segmentation from diffusion MRI scans. The first task is concerned with predicting the fiber streamlines density map of the white matter in the human brain which represents the number of stream lines that passes through a voxel. On the other side, the second task is concerned with imitating registration-based white matter tract segmentation by registering the input data to a predefined white matter tract registration atlas. Further, both tasks are employed sequentially rather than independently as each of the proposed methods focus on part of the white matter properties, and hence, integrating them may provide complementary information.

Table 6 summarizes the multiple-tasks/multi-tasking self-supervised learning methods in medical imaging.

Authors Pretext task Down-stream task


Rotation prediction

3D patch reconstruction

Lung lobe segmentation

FPR for nodule detection

Skin lesions segmentation

Diabetic retinopathy grading


Temporal order correction

Transformation prediction

Standard plane detection

Saliency Prediction

li2020multi ColorMe

Cervix type classification

Skin lesion segmentation



Jigsaw puzzle

Exemplar CNN

Rotation Prediction

Relative position prediction

Brain tumors segmentation

Pancreas tumor segmentation

luo2020retinal Self-supervised fuzzy clustering

Color fundus classification

Diabetic retinopathy classification

haghighi2020learning Semantic Genesis

Lung nodule segmentation

FPR for nodule detection

Liver segmentation

Chest diseases classification

Brain tumor segmentation

Pneumothorax segmentation

zhang2020universal Scale-aware restoration

Brain tumor segmentation

Pancreas segmentation

dong2021self Multi-task self-supervised learning Whole heart segmentation
koohbanani2021self Self-path histopathology image classification


Jigsaw puzzle

Binary OCT classification

Multi-class OCT classification


Rotation prediction

multi-view instance discriminate

PM classification

AMD classification


Fiber streamlines density map prediction

Registration imitation

White matter tract segmentation
Table 6: Summary of multiple-tasks/multitasking self-supervised learning methods in medical imaging.

5 Discussion

We have covered the most recent self-supervised learning methods and applications in medical imaging analysis in the previous section. We further have categorized them into four categories according to their working mechanisms as predictive, generative, contrastive and multi-tasking. This section highlights the most prominent insights that can be derived from the previously discussed works which can be summarized as follow:

  • It can be observed from the literature of self-supervised learning in medical imaging, that there is a great interest toward utilizing contrastive learning approaches either as standalone approach or in multi-tasking settings. Such orientation can be justified by the superior performance provided by this category of self-supervised learning which is comparable to the supervised methods and even surpass them.

  • Another insight which can be concluded from the previous literature is the orientation toward multi-tasking especially when adopting pretext tasks from the computer vision field to medical imaging. This issue can be justified by the substantial difference between the nature of both natural images and medical images with respect to the object of interest, texture, intra-class similarity, global and local features for each type of images. Thus, multiple tasks are required to account for such variations and enable extracting robust representations for down-stream tasks from medical images.

  • In addition, it can be observed that there is a common consensus that the direct adoption of pretext tasks that have been developed on natural images per se is not adequate to secure learning a robust representation from medical images. And hence, such methods require to be further modified and improved to suit the nature of medical images and enable extracting robust representations.

  • Most of the presented works that proposed novel pretext tasks tend to be based on the manipulation of the input image as well as the property of the images while fewer works tend to incorporate medical knowledge into their approaches. This may be attributed to the fact that incorporating medical knowledge such as patient metadata, cross-modal images and disease specific knowledge may narrow the applicability of the proposed self-supervised learning approach to a certain task and may limit its transferability to other tasks without the need to modify the core of the proposed approach. On the other side, exploiting medical images’ properties as well as images manipulation as the bases for pretext design provide a wider range of applications for different imaging modalities that possess common attributes.

6 Future research direction

The following list states research directions that are currently less explored and they need attention from the research community:

  • Most of the presented works utilize either public datasets or private datasets for the pretext tasks training phase. Public datasets are known to be of small size excepts for some modalities such as X-ray (irvin2019chexpert; wang2017chestx), fundus888 color photo and optical coherence tomography (kermany2018large) which are available with considerable number of images. On the other side, private datasets are not available to the research community and not easy to obtain. Hence, there is a mass need for raising a large unlabeled data pools with different imaging modalities and medical conditions to be available for the research community to accelerate the application of self-supervised learning in the field.

  • Another line of future research pertains combining transfer learning from ImageNet with self-supervised learning. The common practice concentrates on comparing self-supervised learning against transfer learning (lu2020@semi; tajbakhsh2019surrogate) while fewer works discussed combining both approaches together (azizi2021big; sowrirajan2020moco). Thus, additional research needs to be performed to explore such direction.

  • Medical-knowledge incorporation with pretext design is another research direction that is explored by fewer researches hu2020self; lu2021volumetric; hervella2020multi; vu2021medaug and needs to be further explored to benefit from such available knowledge in designing self-supervised learning approaches that are able to provide better modelling for down-stream tasks.

  • Transferring knowledge learnt from certain imaging modality and task to another modality and task that is related or different from the first ones is a less explored research direction that needs to be addressed and compared with transfer learning from natural images for example to judge the suitability of both approaches with respect to medical images.


7 Conclusion

The availability of quality annotated medical imaging datasets is a major problem that encounters the researchers in the field of machine learning and hamper its advancements. Self-supervised learning can be considered as an effective solution for such problem where it totally depends on learning robust representations from unlabeled data. In this paper, we covered a set of self-supervised learning methods from computer vision field that have been developed on natural images as they are general purpose methods that can be utilized in medical image analysis. Further, we covered a set of the most recent researches of self-supervised learning in medical image analysis for different imaging modalities and medical conditions by concentrating on either the direct adoption of existing methods or on new methods development to expose the reader with the recent innovations in the field. To this end, self-supervised learning application in the medical image analysis is growing rapidly as it provides an effective solution for annotated data scarcity in the medical field. To compensate for such issue, we have developed a github repository999 that would serve as a resource for self-supervised learning applications in the field of medical imaging analysis which will be updated continuously.

Declaration of competing interests

The authors have no conflicts of interests to declare that are relevant to the content of this article.


Appendex A

Table 7 lists the implementation of the previously listed works from both computer vision and medical image analysis who render their code publicly available. Further, starred implementation represents the authors’ official code.

Authors Pretext task Implementation
dosovitskiy2015discriminative Exemplar CNN caffe
doersch2015unsupervised Relative position prediction caffe     |  pytorch
noroozi2016unsupervised Jigsaw puzzle pytorch
gidaris2018unsupervised Rotation prediction pytorch
vincent2008extracting Denoising auto-encoder theano    |  pytorch
pathak2016context Image inpainting caffe     |  tensorflow
zhang2016colorful Image colorization pytorch
zhang2017split Split-brain auto-encoder caffe     |  tensorflow
radford2015unsupervised Deep Convolutional GAN pytorch   |  tensorflow
donahue2016adversarial Bi-directional GAN theano   |  tensorflow
oord2018representation CPC pytorch         |  tensorflow
he2020momentum MoCo pytorch       |  tensorflow
chen2020simple SimCLR tensorflow  |  pytorch
grill2020bootstrap BYOL tensorflow  |  pytorch
caron2020unsupervised SwAV pytorch       |  tensorflow


Jigsaw puzzle

Exemplar CNN

Rotation Prediction

Relative position prediction


Rotation prediction

multi-view instance discriminate

sriram2021covid MoCo pytorch
sowrirajan2020moco MoCo pytorch
xie2020pgl BYOL pytorch
chaitanya2020contrastive SimCLR tensorflow
zhou2019models Models Genesis tensorflow
haghighi2020learning Semantic Genesis tensorflow
li2020self-sup Feature-based softmax embedding pytorch
holmberg2020self Cross modal retinal thickness prediction tensorflow
matzkin2020self skull reconstruction pytorch
zhang2021twin TS-SSL tensorflow
prakash2020leveraging Images denoising tensorflow
Table 7: Implementation code list