3D Self-Supervised Methods for Medical Imaging

06/06/2020 ∙ by Aiham Taleb, et al. ∙ 0

Self-supervised learning methods have witnessed a recent surge of interest after proving successful in multiple application fields. In this work, we leverage these techniques, and we propose 3D versions for five different self-supervised methods, in the form of proxy tasks. Our methods facilitate neural network feature learning from unlabeled 3D images, aiming to reduce the required cost for expert annotation. The developed algorithms are 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks. Our experiments show that pretraining models with our 3D tasks yields more powerful semantic representations, and enables solving downstream tasks more accurately and efficiently, compared to training the models from scratch and to pretraining them on 2D slices. We demonstrate the effectiveness of our methods on three downstream tasks from the medical imaging domain: i) Brain Tumor Segmentation from 3D MRI, ii) Pancreas Tumor Segmentation from 3D CT, and iii) Diabetic Retinopathy Detection from 2D Fundus images. In each task, we assess the gains in data-efficiency, performance, and speed of convergence. We achieve results competitive to state-of-the-art solutions at a fraction of the computational expense. We also publish the implementations for the 3D and 2D versions of our algorithms as an open-source library, in an effort to allow other researchers to apply and extend our methods on their datasets.



There are no comments yet.


page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to technological advancements in 3D sensing, the need for machine learning-based algorithms that perform analysis tasks on 3D imaging data has grown rapidly in the past few years 

Griffiths and Boehm (2019); Ioannidou et al. (2017); Su et al. (2017). 3D imaging has numerous applications, such as in Robotic navigation, in CAD imaging, in Geology, and in Medical Imaging. While we focus on medical imaging as a test-bed for our proposed 3D algorithms in this work, we ensure their applicability to other 3D domains. Medical imaging plays a vital role in patient healthcare, as it aids in disease prevention, early detection, diagnosis, and treatment. Yet efforts to utilize advancements in machine learning algorithms are often hampered by the sheer expense of the expert annotation required Grünberg et al. (2017). Generating expert annotations of 3D medical images at scale is non-trivial, expensive, and time-consuming. Another related challenge in medical imaging is the relatively small sample sizes. This becomes more obvious when studying a particular disease, for instance. Also, gaining access to large-scale datasets is often difficult due to privacy concerns. Hence, scarcity of data and annotations are some of the main constraints for machine learning applications in medical imaging.

Several efforts have attempted to address these challenges, as they are common to other application fields of deep learning. A widely used technique is transfer learning, which aims to reuse the features of already trained neural networks on different, but related, target tasks. A common example is adapting the features from networks trained on ImageNet, which can be reused for other visual tasks, e.g. semantic segmentation. To some extent, transfer learning has made it easier to solve tasks with limited number of samples. However, as mentioned before, the medical domain is supervision-starved. Despite attempts to leverage ImageNet 

Deng et al. (2009) features in the medical context Wang et al. (2017); Rajpurkar et al. (2017); Sahlsten et al. (2019); Islam et al. (2018), the difference in the distributions of natural and medical images is significant, i.e. generalizing across these domains is questionable and can suffer from dataset bias Torralba and Efros (2011). Recent analysis Raghu et al. (2019) has also found that such transfer learning offers limited performance gains, relative to the computational costs it incurs. Consequently, it is necessary to find better solutions for the aforementioned challenges.

A viable alternative is to employ self-supervised (unsupervised) methods, which proved successful in multiple domains recently. In these approaches, the supervisory signals are derived from the data. In general, we withhold some part of the data, and train the network to predict it. This prediction task defines a proxy loss, which encourages the model to learn semantic representations about the concepts in the data. Subsequently, this facilitates data-efficient fine-tuning on supervised downstream tasks, reducing significantly the burden of manual annotation. Despite the surge of interest in the machine learning community in self-supervised methods, only little work has been done to adopt these methods in the medical imaging domain. We believe that self-supervised learning is directly applicable in the medical context, and can offer cheaper solutions for the challenges faced by conventional supervised methods. Unlabelled medical images carry valuable information about organ structures, and self-supervision enables the models to derive notions about these structures with no additional annotation cost.

A particular aspect of most medical images, which received little attention by previous self-supervised methods, is their 3D nature Eisenberg and Margulis (2011). The paradigm that is common is to cast 3D imaging tasks in 2D, by extracting slices along an arbitrary axis, e.g. the axial dimension. However, such tasks can substantially benefit from the full 3D spatial context, thus capturing rich anatomical information. We believe that relying on the 2D context to derive data representations from 3D images, in general, is a suboptimal solution, which compromises the performance on downstream tasks.
Our contributions. As a result, in this work, we propose five self-supervised tasks that utilize the full 3D spatial context, aiming to better adopt self-supervision in 3D imaging. The proposed tasks are: 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks. These algorithms are inspired by their successful 2D counterparts, and to the best of our knowledge, except for Jigsaw, none of these methods have actually been extended to the 3D context, let alone applied to the medical domain. Several computational and methodological challenges arise when designing self-supervised tasks in 3D, due to the increased data dimensionality, which we address in our methods to ensure their efficiency. We perform extensive experiments using four datasets, and we show that our 3D tasks result in rich data representations that improve data-efficiency and performance on three different downstream tasks. Finally, we publish the implementations of our 3D tasks, and also of their 2D versions, in order to allow other researchers to evaluate these methods on other imaging datasets.

2 Related work

In general, unsupervised representation learning can be formulated as learning an embedding space, in which data samples that are semantically similar are closer, and those that are different are far apart. The self-supervised family constructs such a representation space by creating a supervised proxy task from the data itself. Then, the embeddings that solve the proxy task will also be useful for other real-world downstream tasks. Several methods in this line of research have been developed recently, and they found applications in numerous fields Jing and Tian (2019). In this work, we focus on methods that operate on images only.

Self-supervised methods differ in their core building block, i.e. the proxy task used to learn representations from unlabelled input data. A commonly used supervision source for proxy tasks is the spatial context from images, which was first inspired by the skip-gram Word2Vec Mikolov et al. (2013) algorithm. This idea was generalized to images in Doersch et al. (2015), in which a visual representation is learned by predicting the position of an image patch relative to another. A similar work extended this patch-based approach to solve Jigsaw Puzzles Noroozi and Favaro (2016). Other works have used different supervision sources, such as image colors Zhang et al. (2016), clustering Caron et al. (2018), image rotation prediction Gidaris et al. (2018), object saliency Wang et al. (2019), and image reconstruction Pathak et al. (2016). In recent works, Contrastive Predictive Coding (CPC) approaches van den Oord et al. (2018); Hénaff et al. (2019) advanced the results of self-supervised methods on multiple imaging benchmarks Chen et al. (2020)

. These methods utilize the idea of contrastive learning in the latent space, similar to Noise Contrastive Estimation 

Gutmann and Hyvärinen (2010). In 2D images, the model has to predict the latent representation for next (adjacent) image patches. Our work follows this line of research in the above works, however, our methods utilize the full 3D context.

While videos are rich with more types of supervisory signals Wang and Gupta (2015b); Vondrick et al. (2015); Walker et al. (2015); Purushwalkam and Gupta (2016), we discuss here a subset of these works that utilize 3D-CNNs to process input videos. In this context, 3D-CNNs are employed to simultaneously extract spatial features from each frame, and temporal features across multiple frames, which are typically stacked along the 3 (depth) dimension. The idea of exploiting 3D convolutions for videos was proposed in Ji et al. (2013) for human action recognition, and was later extended to other applications Jing and Tian (2019). In self-supervised learning, however, the number of pretext tasks that exploit this technique is still limited. Kim et al. Dahun et al. (2019) proposed a task that extracts cubic puzzles of , meaning that the 3 dimension is not actually utilized in puzzle creation. Jing et al. Jing and Tian (2018) extended the rotation prediction task Gidaris et al. (2018) to videos, by simply stacking video frames along the depth dimension, however, this dimension is not employed in the design of their task as only spatial rotations are considered. On the other hand, in our more general versions of 3D Jigsaw puzzles and 3D Rotation prediction, respectively, we exploit the depth (3) dimension in the design of our tasks. For instance, we solve larger 3D puzzles up to , and we also predict more rotations along all axes in the 3D space. In general, we believe the different nature of the data, 3D volumetric scans vs. stacked video frames, influences the design of proxy tasks, i.e. the depth dimension has an actual semantic meaning in volumetric scans. Hence, we consider the whole 3D context when designing all of our methods, aiming to learn valuable anatomical information from unlabeled 3D volumetric scans.

In the medical context, self-supervision has found use-cases in diverse applications such as depth estimation in monocular endoscopy Liu et al. (2018), robotic surgery Ye et al. (2017), medical image registration Li and Fan (2018), body part recognition Zhang et al. (2017), in disc degeneration using spinal MRIs Jamaludin et al. (2017), in cardiac image segmentation Bai et al. (2019), body part regression for slice ordering Yan et al. (2019), and medical instrument segmentation Roß et al. (2017). There are multiple other examples of self-supervised methods for medical imaging, such as Chen et al. (2019); Jiao et al. (2020); Taleb et al. (2019); Blendowski et al. (2019). While these attempts are a step forward for self-supervised learning in medical imaging, they have some limitations. First, as opposed to our work, many of these works make assumptions about input data, resulting in engineered solutions that hardly generalize to other target tasks. Second, none of the above works capture the complete spatial context available in 3-dimensional scans, i.e. they only operate on 2D/2.5D spatial context. In a more related work, Zhuang et al. Zhuang et al. (2019) developed a proxy task that solves small 3D jigsaw puzzles. Their proposed puzzles were only limited to of puzzle complexity. Our version of 3D Jigsaw puzzles is able to efficiently solve larger puzzles, e.g. , and outperforms their method’s results on the downstream task of Brain tumor segmentation. In this paper, we continue this line of work, and develop five different algorithms for 3D data, whose nature and performance can accommodate more types of target medical applications.

3 Self-Supervised Methods

In this section, we discuss the formulations of our 3D self-supervised pretext tasks, all of which learn data representations from unlabeled samples (3D images), hence requiring no manual annotation effort in the self-supervised pretraining stage. Each task results in a pretrained encoder model that can be fine-tuned in various downstream tasks, subsequently.

Figure 1: (a) 3D-CPC: each input image is split into 3D patches, and the latent representations of next patches

(shown in green) are predicted using the context vector

. The considered context is the current patch (shown in orange), plus the above patches that form an inverted pyramid (shown in blue).
(b) 3D-RPL: assuming a 3D grid of 27 patches (), the model is trained to predict the location of the query patch (shown in red), relative to the central patch (whose location is 13). (c) 3D-Jig: by predicting the permutation applied to the 3D image when creating a puzzle, we are able to reconstruct the scrambled input. (d) 3D-Rot: the network is trained to predict the rotation degree (out of the 10 possible degrees) applied on input scans. (e) 3D-Exe: the network is trained with a triplet loss, which drives positive samples closer in the embedding space ( to ), and the negative samples () farther apart.

3.1 3D Contrastive Predictive Coding (3D-CPC)

Following the contrastive learning idea, first proposed in Gutmann and Hyvärinen (2010), this universal unsupervised technique predicts the latent space for future (next or adjacent) samples. Recently, CPC found success in multiple application fields, e.g. its 1D version in audio signals van den Oord et al. (2018), and its 2D versions in images van den Oord et al. (2018); Hénaff et al. (2019), and was able to bridge the gap between unsupervised and fully-supervised methods Chen et al. (2020). Our proposed CPC version generalizes this technique to 3D inputs, and defines a proxy task by cropping equally-sized and overlapping 3D patches from each input scan. Then, the encoder model maps each input patch to its latent representation . Next, another model called the context network is used to summarize the latent vectors of the patches in the context of , and produce its context vector , where denotes a set of latent vectors. Finally, because captures the high level content of the context that corresponds to , it allows for predicting the latent representations of next (adjacent) patches , where . This prediction task is cast as an -way classification problem by utilizing the InfoNCE loss van den Oord et al. (2018), which takes its name from its ability to maximize the mutual information between and . Here, the classes are the latent representations of the patches, among which is one positive representation, and the rest are negative. Formally, the CPC loss can be written as follows:


This loss corresponds to the categorical cross-entropy loss, which trains the model to recognize the correct representation among the list of negative representations . These negative samples (3D patches) are chosen randomly from other locations in the input image. In practice, similar to the original NCE Gutmann and Hyvärinen (2010), this task is solved as a binary pairwise classification task.

It is noteworthy that the proposed 3D-CPC task, illustrated in Fig. 1 (a), allows employing any network architecture in the encoder and the context networks. In our experiments, we follow van den Oord et al. (2018) in using an autoregressive network using GRUs Cho et al. (2014) for the context network , however, masked convolutions can be a valid alternative van den Oord et al. (2016). In terms of what the 3D context of each patch includes, we follow the idea of an inverted pyramid neighborhood, which is inspired from Stollenga et al. (2015); Van Den Oord et al. (2016).

3.2 Relative 3D patch location (3D-RPL)

In this task, the spatial context in images is leveraged as a rich source of supervision, in order to learn semantic representations of the data. First proposed by Doersch et al. Doersch et al. (2015) for 2D images, this task inspired several works in self-supervision. In our 3D version of this task, shown in Fig. 1 (b), we leverage the full 3D spatial context in the design of our task. From each input 3D image, a 3D grid of non-overlapping patches is sampled at random locations. Then, the patch in the center of the grid is used as a reference, and a query patch is selected from the surrounding patches. Next, the location of relative to is used as the positive label . This casts the task as an -way classification problem, in which the locations of the remaining grid patches are used as the negative samples . Formally, the cross-entropy loss in this task is written as:


Where is the number of queries extracted from all samples. In order to prevent the model from solving this task quickly by finding shortcut solutions, e.g. edge continuity, we follow Doersch et al. (2015) in leaving random gaps (jitter) between neighbor 3D patches. More details in Appendix.

3.3 3D Jigsaw puzzle Solving (3D-Jig)

Deriving a Jigsaw puzzle grid from an input image, be it in 2D or 3D, and solving it can be viewed as an extension to the above patch-based RPL task. In our 3D Jigsaw puzzle task, which is inspired by its 2D counterpart Noroozi and Favaro (2016) and illustrated in Fig. 1 (c), the puzzles are formed by sampling an grid of 3D patches. Then, these patches are shuffled according to an arbitrary permutation, selected from a set of predefined permutations. This set of permutations with size is chosen out of the possible permutations, by following the Hamming distance based algorithm in Noroozi and Favaro (2016) (details in Appendix), and each permutation is assigned an index . Therefore, the problem is now cast as a -way classification task, I.e., the model is trained to simply recognize the applied permutation index , allowing us to solve the 3D puzzles in an efficient manner. Formally, we minimize the cross-entropy loss of , where is an arbitrary 3D puzzle from the list of extracted puzzles. Similar to 3D-RPL, we use the trick of adding random jitter in 3D-Jig.

3.4 3D Rotation prediction (3D-Rot)

Originally proposed by Gidaris et al. Gidaris et al. (2018), the rotation prediction task encourages the model to learn visual representations by simply predicting the angle by which the input image is rotated. The intuition behind this task is that for a model to successfully predict the angle of rotation, it needs to capture sufficient semantic information about the object in the input image. In our 3D Rotation prediction task, 3D input images are rotated randomly by a random degree out of the considered degrees. In this task, for simplicity, we consider the multiples of 90 degrees (0, 90, 180, 270, along each axis of the 3D coordinate system . There are 4 possible rotations per axis, amounting to 12 possible rotations. However, rotating input scans by 0 along the 3 axes will produce 3 identical versions of the original scan, hence, we consider 10 rotation degrees instead. Therefore, in this setting, this proxy task can be solved as a 10-way classification problem. Then, the model is tasked to predict the rotation degree (class), as shown in Fig. 1 (d). Formally, we minimize the cross-entropy loss of , where an arbitrary rotated 3D image from the list of rotated images. It is noteworthy that we create multiple rotated versions for each 3D image.

3.5 3D Exemplar networks (3D-Exe)

The task of Exemplar networks, proposed by Dosovitskiy et al. Dosovitskiy et al. (2014), is one of the earliest methods in the self-supervised family. To derive supervision labels, it relies on image augmentation techniques, i.e. transformations. Assuming a training set , and a set of image transformations , a new surrogate class is created by transforming each training sample , where . Therefore, the task becomes a regular classification task with a cross-entropy loss. However, this classification task becomes prohibitively expensive as the dataset size grows larger, as the number of classes grows accordingly. Thus, in our proposed 3D version of Exemplar networks, shown in Fig. 1 (e), we employ a different mechanism that relies on the triplet loss instead Wang and Gupta (2015a). Formally, assuming is a random training sample and is its corresponding embedding vector, is a transformed version of (seen as a positive example) with an embedding , and is a different sample from the dataset (seen as negative) with an embedding . The triplet loss is written as follows:


where is a pairwise distance function, for which we use the distance, following Schroff et al. (2015). is a margin (gap) that is enforced between positive and negative pairs, which we set to . The triplet loss enforces , i.e. the transformed versions of the same sample (positive samples) to come closer to each other in the learned embedding space, and farther away from other (negative) samples. It is noteworthy that we apply the following 3D transformations: random flipping along an arbitrary axis, random translation along an arbitrary axis, random brightness, and random contrast.

4 Experimental Results

In this section, we present the evaluation results of our methods, which we assess the quality of their learned representations by fine-tuning them on three downstream tasks. In each task, we analyze the obtained gains in data-efficiency, performance, and speed of convergence. In addition, each task aims to demonstrate a certain use-case for our methods. We follow the commonly used evaluation protocols for self-supervised methods in each of these tasks. The chosen tasks are:

  • Brain Tumor Segmentation from 3D MRI (Subsection 4.1): in which we study the possibility for transfer learning from a different unlabeled 3D corpus, following Goyal et al. (2019).

  • Pancreas Tumor Segmentation from 3D CT (Subsection 4.2): to demonstrate how to use the same unlabeled dataset, following the data-efficient evaluation protocol in Hénaff et al. (2019).

  • Diabetic Retinopathy Detection from 2D Fundus Images (Subsection 4.3): to showcase our implementations for the 2D versions of our methods, following Hénaff et al. (2019).

We provide additional architecture and training details in Appendix.

4.1 Brain Tumor Segmentation Results

In this task, we evaluate our methods by fine-tuning the learned representations on the Multimodal Brain Tumor Segmentation (BraTS) 2018 Menze et al. (2015); Bakas et al. (2017) benchmark. Before that, we pretrain our models on brain MRI data from the UK Biobank Sudlow et al. (2015) (UKB) corpus, which contains roughly 3D scans. Due to this large number of unlabeled scans, UKB is suitable for unsupervised pretraining. The BraTS dataset contains annotated MRI scans for training and validation cases. We fine-tune on BraTS’ training set, and evaluate on its validation set. Following the official BraTS challenge, we report Dice scores for the Whole Tumor (WT), Tumor Core (TC), and Enhanced Tumor (ET) tasks. The Dice score (F1-Score) is twice the area of overlap between two segmentation masks divided by the total number of pixels in both. In order to assess the quality of the learned representations by our 3D proxy tasks, we compare to the following baselines:

  • Training from scratch: the first sensible baseline for any self-supervised method, in general, is the same model trained on the downstream task when initialized from random weights. Comparing to this baseline provides insights about the benefits of self-supervised pretraining.

  • Training on 2D slices: this baseline aims to quantitatively show how our proposal to operate on the 3D context benefits the learned representations, compared to 2D methods.

  • Baselines from the BraTS challenge: we compare to the methods Isensee et al. (2018); Popli et al. (2018); Baid et al. (2018); Chandra et al. (2018), which all use a single model with an architecture similar to ours, i.e. 3D U-Net Ronneberger et al. (2015).

Discussion. We first assess the gains in data-efficiency in this task. To quantify these gains, we measure the segmentation performance at different sample sizes. We randomly select subsets of patients at 10%, 25%, 50%, and 100% of the full dataset size, and we fine-tune our models on these subsets. Here, we compare to the baselines listed above. As shown in Fig. 2, our 3D methods outperform the baseline model trained from scratch by a large margin when using few training samples, and behaves similarly as the number of labeled samples increases. The low-data regime case at 5% suggests the potential for generic unsupervised features, and highlights the huge gains in data-efficiency. Also, the proposed 3D versions considerably outperform their 2D counterparts, which are trained on slices extracted from the 3D images. We also measure how our methods affect the final brain tumor segmentation performance, in Table 1

. All our methods outperform the baseline trained from scratch as well as their 2D counterparts, confirming the benefits of pretraining with our 3D tasks on downstream performance. We also achieve comparable results to baselines from the BraTS challenge, and we outperform these baselines in some cases, e.g. our 3D-RPL method outperforms all baselines in terms of ET and TC dice scores. Also, our model pretrained with 3D-Exemplar, with fewer downstream training epochs, matches the result of Isensee

et al. Isensee et al. (2018) in terms of WT dice score, which is one of the top results on the BraTS 2018 challenge. Our results in this downstream task also demonstrate the generalization ability of our 3D tasks across different domains.

Figure 2: Data-efficient segmentation results in BraTS. With less labeled data, the supervised baseline (brown) fails to generalize, as opposed to our methods. Also, the proposed 3D methods outperform all 2D counterparts.
Table 1: BraTS segmentation results Model ET WT TC 3D-From scratch 76.38 87.82 83.11 2D-CPC 76.60 86.27 82.41 2D-RPL 77.53 87.91 82.56 2D-Jigsaw 76.12 86.28 83.26 2D-Rotation 76.60 88.78 82.41 2D-Exemplar 75.22 84.82 81.87 Popli et al. Popli et al. (2018) 74.39 89.41 82.48 Baid et al. Baid et al. (2018) 74.80 87.80 82.66 Chandra et al. Chandra et al. (2018) 74.06 87.19 79.89 Isensee et al. Isensee et al. (2018) 80.36 90.80 84.32 3D-CPC 80.83 89.88 85.11 3D-RPL 81.28 90.71 86.12 3D-Jigsaw 79.66 89.20 82.52 3D-Rotation 80.21 89.63 84.75 3D-Exemplar 79.46 90.80 83.87

4.2 Pancreas Tumor Segmentation Results

In this downstream task, we evaluate our models on 3D CT scans of Pancreas tumor from the medical decathlon benchmarks Simpson et al. (2019). The Pancreas dataset contains annotated CT scans for cases. Each scan in this dataset contains different classes: pancreas (class 1), tumor (class 2), and background (class 0). To measure the performance on this benchmark, two dice scores are computed for classes 1 and 2. In this task, we pretrain using our proposed 3D tasks on pancreas scans without their annotation masks. Then, we fine-tune the obtained models on subsets of annotated data to assess the gains in both data-efficiency and performance. Finally, we also compare to the baseline model trained from scratch and to 2D models, similar to the previous downstream task. Fig. 6 demonstrates the gains when fine-tuning our models on 5%, 10%, 50%, and 100% of the full data size. The results obtained by our 3D methods also outperform the baselines in this task with a margin when using only few training samples, e.g. 5% and 10% cases. Another significant benefit offered by pretraining with our methods is the speed of convergence on downstream tasks. As demonstrated in Fig 6, when training on the full pancreas dataset, within the first 20 epochs only, our models achieve much higher performances compared to the "from scratch" baseline. We should note that we evaluate this task on a held-out labeled subset of the Pancreas dataset that was not used for pretraining nor fine-tuning. We provide the full list of experimental results for this task in Appendix.

4.3 Diabetic Retinopathy Results

As part of our work, we also provide implementations for the 2D versions of the developed self-supervised methods. We showcase these implementations on the Diabetic Retinopathy 2019 Kaggle challenge 4.3. This dataset contains roughly Fundus 2D images, each of which was rated by a clinician on a severity scale of to . These levels define a classification task. In order to evaluate our tasks on this benchmark, we pretrain all the 2D versions of our methods on all fundus images, with no labels. Then, following the data-efficient evaluation protocol in Hénaff et al. (2019), we fine-tune the obtained models on subsets of labelled data to assess the gains in both data-efficiency, shown in Fig. 6, and speed of convergence in Fig. 6. In this 2D task, we achieve results consistent with the other downstream tasks, presented before. We should point out that we evaluate with 5-fold cross validation on this 2D dataset. The metric used in task, as in the Kaggle challenge, is the Quadratic Weighted Kappa, which measures the agreement between two ratings. Its values vary from random (0) to complete (1) agreement, and if there is less agreement than chance it may become negative.

Figure 3: Data-efficient segmentation results in Pancreas. With less labeled data, the supervised baseline (brown) fails to generalize, as opposed to our methods. Also, the proposed 3D methods outperform all 2D counterparts
Figure 4: Data-efficient classification in Diabetic Retinopathy. With less labels, the supervised baseline (brown) fails to generalize, as opposed to pretrained models. This result is consistent with the other downstream tasks
Figure 5: Speed of convergence in Pancreas segmentation. Our models converge faster than the baseline (brown)
Figure 6: Speed of convergence in Retinopathy classifcation. Our models also converge faster in this task

5 Conclusion

In this work, we asked whether designing 3D self-supervised tasks could benefit the learned representations from unlabeled 3D images, and found that it indeed greatly improves their downstream performance, especially when fine-tuned on only small amounts of labeled 3D data. We demonstrate the obtained gains by our proposed 3D algorithms in data-efficiency, performance, and speed of convergence on three different downstream tasks. Our 3D tasks outperform their 2D counterparts, hence supporting our proposal of utilizing the 3D spatial context in the design of self-supervised tasks, when operating on 3D domains. What is more, our results, particularly in the low-data regime, demonstrate the possibility to reduce the manual annotation effort required in the medical imaging domain, where data and annotation scarcity is an obstacle. Finally, we open source our implementations for all 3D methods (and also their 2D versions), and we publish them to help other researchers apply our methods on other medical imaging tasks. This work is only a first step toward creating a set of methods that facilitate self-supervised learning research for 3D data, e.g. medical scans. We believe there is room for improvement along this line, such as designing new 3D proxy tasks, evaluating different architectural options, and including other data modalities (e.g. text) besides images.

Broader Impact

Due to technological advancements in 3D data sensing, and to the growing number of its applications, the attention to machine learning algorithms that perform analysis tasks on such data has grown rapidly in the past few years. As mentioned before, 3D imaging has multitude of applications Ioannidou et al. (2017), such as in Robotics, in CAD imaging, in Geology, and in Medical Imaging. In this work, we developed multiple 3D Deep Learning algorithms, and evaluated them on multiple 3D medical imaging benchmarks. Our focus on medical imaging is motivated by the pressing demand for automatic (and instant) analysis systems, that may aid the medical community.

Medical imaging plays an important role in patient healthcare, as it aids in disease prevention, early detection, diagnosis, and treatment. With the continuous digitization of medical images, the hope that physicians and radiologists are able to instantly analyze them with Machine Learning algorithms is slowly shaping as a reality. Achieving this has become more critical recently, as the number of patients which contracted with a novel Coronavirus, called COVID-19, reached a high record. Radiography images provide a rich and a quick diagnosis tool, because other types of tests, e.g. RT-PCR which is an RNA/DNA based test, have low sensitivity and may require hours/days of processing Manna et al. (2020). Therefore, as imaging allows such instant insights into human body organs, it receives a growing attention from both the machine learning and the medical communities.

Yet efforts to leverage advancements in machine learning, particularly the supervised algorithms, are often hampered by the sheer expense of the expert annotation required Grünberg et al. (2017). Generating expert annotations of patient data at scale is non-trivial, expensive, and time-consuming, especially for 3D medical scans. Even current semi-automatic software tools fail to sufficiently address this challenge. Consequently, it is necessary to rely on annotation-efficient machine learning algorithms, such as self-supervised (unsupervised) approaches for representation learning from unlabelled data. Our work aims to provide the necessary tools for 3D image analysis, in general, and to aid physicians and radiologists in their diagnostic tasks from 3D scans, in particular. And as the main consequence of this work, the developed methods can help reduce the effort and cost of annotation required by these practitioners. In the larger goal of leveraging Machine Learning for good, our work is only a small step toward achieving this goal for patient healthcare.


  • W. Bai, C. Chen, G. Tarroni, J. Duan, F. Guitton, S. E. Petersen, Y. Guo, P. M. Matthews, and D. Rueckert (2019) Self-supervised learning for cardiac mr image segmentation by anatomical position prediction. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 541–549. External Links: ISBN 978-3-030-32245-8 Cited by: §2.
  • U. Baid, A. Mahajan, S. Talbar, S. Rane, S. Thakur, A. Moiyadi, M. Thakur, and S. Gupta (2018) GBM segmentation with 3d u-net and survival prediction with radiomics. In International MICCAI Brainlesion Workshop, pp. 28–35. Cited by: Figure 2, 3rd item.
  • S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos (2017) Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific Data 4, pp. 170117 EP –. Cited by: Appendix A, §4.1.
  • M. Blendowski, H. Nickisch, and M. P. Heinrich (2019) How to learn from unlabeled volume data: self-supervised 3d context feature learning. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 649–657. External Links: ISBN 978-3-030-32226-7 Cited by: §2.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features


    The European Conference on Computer Vision (ECCV)

    Munich, Germany, pp. . Cited by: §2.
  • S. Chandra, M. Vakalopoulou, L. Fidon, E. Battistella, T. Estienne, R. Sun, C. Robert, E. Deutch, and N. Paragios (2018) Context aware 3-d residual networks for brain tumor segmentation. In International MICCAI Brainlesion Workshop, pp. 74–82. Cited by: Figure 2, 3rd item.
  • L. Chen, P. Bentley, K. Mori, K. Misawa, M. Fujiwara, and D. Rueckert (2019) Self-supervised learning for medical image analysis using image context restoration. Medical Image Analysis 58, pp. 101539. External Links: ISSN 1361-8415, Document, Link Cited by: §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: Appendix A, §2, §3.1.
  • K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Doha, Qatar, pp. 1724–1734. External Links: Link, Document Cited by: §3.1.
  • K. Dahun, D. Cho, and S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33, pp. 8545–8552.
    External Links: Document Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR09, Miami, FL, USA, pp. . Cited by: §1.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, USA, pp. 1422–1430. External Links: ISBN 9781467383912, Link, Document Cited by: §2, §3.2.
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    In Advances in Neural Information Processing Systems 27 (NIPS), External Links: Link Cited by: §3.5.
  • R. Eisenberg and A. Margulis (2011) A patient’s guide to medical imaging. New York: Oxford University Press, NY, USA. Cited by: §1.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. CoRR abs/1803.07728, pp. . External Links: Link, 1803.07728 Cited by: §2, §2, §3.4.
  • P. Goyal, D. Mahajan, H. Mulam, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. In The IEEE International Conference on Computer Vision (ICCV), pp. 6390–6399. External Links: Document Cited by: 1st item.
  • D. Griffiths and J. Boehm (2019) A review on deep learning techniques for 3d sensed data classification. CoRR abs/1907.04444. External Links: Link, 1907.04444 Cited by: §1.
  • K. Grünberg, O. Jimenez-del-Toro, A. Jakab, G. Langs, T. Salas Fernandez, M. Winterstein, M. Weber, and M. Krenn (2017) Annotating medical image data. In Cloud-Based Benchmarking of Medical Image Analysis, pp. 45–67. Cited by: §1, Broader Impact.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Y. W. Teh and M. Titterington (Eds.), Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 297–304. External Links: Link Cited by: §2, §3.1.
  • O. J. Hénaff, A. Srinivas, J. D. Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord (2019) Data-efficient image recognition with contrastive predictive coding. CoRR abs/1905.09272, pp. . External Links: Link, 1905.09272 Cited by: §2, §3.1, 2nd item, 3rd item, §4.3.
  • G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 2261–2269. Cited by: Appendix A.
  • A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris (2017) Deep learning advances in computer vision with 3d data: a survey. ACM Computing Surveys 50, pp. . External Links: Document Cited by: §1, Broader Impact.
  • F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein (2018) No new-net. In International MICCAI Brainlesion Workshop, Granada, Spain, pp. 234–244. Cited by: Figure 2, 3rd item, §4.1.
  • S. M. S. Islam, M. M. Hasan, and S. Abdullah (2018) Deep learning based early detection and grading of diabetic retinopathy using retinal fundus images. CoRR abs/1812.10595. External Links: Link, 1812.10595 Cited by: §1.
  • A. Jamaludin, T. Kadir, and A. Zisserman (2017) Self-supervised learning for spinal mris. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Cham, pp. 294–302. External Links: ISBN 978-3-319-67557-2, Document Cited by: §2.
  • S. Ji, W. Xu, M. Yang, and K. Yu (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: §2.
  • J. Jiao, R. Droste, L. Drukker, A. T. Papageorghiou, and J. A. Noble (2020) Self-supervised representation learning for ultrasound video. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Vol. , pp. 1847–1850. Cited by: §2.
  • L. Jing and Y. Tian (2018) Self-supervised spatiotemporal feature learning by video geometric transformations. CoRR abs/1811.11387. External Links: Link, 1811.11387 Cited by: §2.
  • L. Jing and Y. Tian (2019) Self-supervised visual feature learning with deep neural networks: A survey. CoRR abs/1902.06162. External Links: Link, 1902.06162 Cited by: §2, §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Link Cited by: Appendix A.
  • A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix A.
  • H. Li and Y. Fan (2018) Non-rigid image registration using self-supervised fully convolutional networks without training data. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Vol. , Washington, DC, USA, pp. 1075–1078. External Links: ISSN 1945-8452 Cited by: §2.
  • X. Liu, A. Sinha, M. Unberath, M. Ishii, G. D. Hager, R. H. Taylor, and A. Reiter (2018) Self-supervised learning for dense depth estimation in monocular endoscopy. CoRR abs/1806.09521, pp. . External Links: Link, 1806.09521 Cited by: §2.
  • S. Manna, J. Wruble, S. Z. Maron, D. Toussie, N. Voutsinas, M. Finkelstein, M. A. Cedillo, J. Diamond, C. Eber, A. Jacobi, M. Chung, and A. Bernheim (2020) COVID-19: a multimodality review of radiologic techniques, clinical utility, and imaging features. Radiology: Cardiothoracic Imaging 2 (3), pp. e200210. External Links: Document, Link, https://doi.org/10.1148/ryct.2020200210 Cited by: Broader Impact.
  • B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, and et al. (2015) The multimodal brain tumor image segmentation benchmark (brats). IEEE Transactions on Medical Imaging 34 (10), pp. 1993–2024. Cited by: Appendix A, §4.1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Scottsdale, Arizona, USA, pp. . External Links: Link Cited by: §2.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 69–84. External Links: ISBN 978-3-319-46466-4 Cited by: 5th item, §2, §3.3.
  • D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • A. Popli, M. Agarwal, and G.N. Pillai (2018) Automatic brain tumor segmentation using u-net based 3d fully convolutional network. In Pre-Conference Proceedings of the 7th MICCAI BraTS Challenge, pp. 374–382. Cited by: Figure 2, 3rd item.
  • S. Purushwalkam and A. Gupta (2016) Pose from action: unsupervised learning of pose features based on motion. CoRR abs/1609.05420. External Links: Link, 1609.05420 Cited by: §2.
  • M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio (2019) Transfusion: understanding transfer learning for medical imaging. In Advances in Neural Information Processing Systems 32, pp. 3347–3357. External Links: Link Cited by: §1.
  • P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Y. Ding, A. Bagul, C. Langlotz, K. S. Shpanskaya, M. P. Lungren, and A. Y. Ng (2017) CheXNet: radiologist-level pneumonia detection on chest x-rays with deep learning. CoRR abs/1711.05225. External Links: Link, 1711.05225 Cited by: §1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: Appendix A, 3rd item.
  • T. Roß, D. Zimmerer, A. Vemuri, F. Isensee, S. Bodenstedt, F. Both, P. Kessler, M. Wagner, B. Müller, H. Kenngott, S. Speidel, K. Maier-Hein, and L. Maier-Hein (2017) Exploiting the potential of unlabeled endoscopic video data with self-supervised learning. International Journal of Computer Assisted Radiology and Surgery 13, pp. . External Links: Document Cited by: §2.
  • J. Sahlsten, J. Jaskari, J. Kivinen, L. Turunen, E. Jaanio, K. Hietala, and K. Kaski (2019) Deep learning fundus image analysis for diabetic retinopathy and macular edema grading. Scientific Reports 9, pp. . External Links: Document Cited by: §1.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FaceNet: a unified embedding for face recognition and clustering

    In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 815–823. Cited by: §3.5.
  • A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. Kopp-Schneider, B. A. Landman, G. J. S. Litjens, B. H. Menze, O. Ronneberger, R. M. Summers, P. Bilic, P. F. Christ, R. K. G. Do, M. Gollub, J. Golia-Pernicka, S. Heckers, W. R. Jarnagin, M. McHugo, S. Napel, E. Vorontsov, L. Maier-Hein, and M. J. Cardoso (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. CoRR abs/1902.09063. External Links: Link, 1902.09063 Cited by: §4.2.
  • C. G. M. Snoek, M. Worring, and A. W. M. Smeulders (2005) Early versus late fusion in semantic video analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA ’05, New York, NY, USA, pp. 399–402. External Links: ISBN 1-59593-044-2, Link, Document Cited by: Appendix A.
  • M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber (2015) Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation. CoRR abs/1506.07452. External Links: Link, 1506.07452 Cited by: §3.1.
  • H. Su, L. Guibas, M. Bronstein, E. Kalogerakis, J. Yang, C. Qi, and Q. Huang (2017) 3D deep learning. External Links: Link Cited by: §1.
  • C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray, B. Liu, P. Matthews, G. Ong, J. Pell, A. Silman, A. Young, T. Sprosen, T. Peakman, and R. Collins (2015) UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine 12 (3), pp. 1–10. External Links: Link, Document Cited by: Appendix A, §4.1.
  • A. Taleb, C. Lippert, T. Klein, and M. Nabi (2019) Multimodal self-supervised learning for medical image analysis. External Links: 1912.05396 Cited by: §2.
  • tensorflow.org (2020) Tensorflow v2.1. External Links: Link Cited by: Appendix A.
  • A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In CVPR 2011, Vol. , pp. 1521–1528. Cited by: §1.
  • A. van den Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves (2016) Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4790–4798. External Links: Link Cited by: §3.1.
  • A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel recurrent neural networks

    In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1747–1756. Cited by: §3.1.
  • A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748, pp. . External Links: Link, 1807.03748 Cited by: §2, §3.1, §3.1.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2015) Anticipating the future by watching unlabeled video. CoRR abs/1504.08023. External Links: Link, 1504.08023 Cited by: §2.
  • J. Walker, A. Gupta, and M. Hebert (2015) Dense optical flow prediction from a static image. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2443–2451. Cited by: §2.
  • J. Wang, S. Zhu, J. Xu, and D. Cao (2019) The retrieval of the beautiful: self-supervised salient object detection for beauty product retrieval. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, New York, NY, USA, pp. 2548–2552. External Links: ISBN 9781450368896, Link, Document Cited by: §2.
  • X. Wang and A. Gupta (2015a) Unsupervised learning of visual representations using videos. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2794–2802. Cited by: §3.5.
  • X. Wang and A. Gupta (2015b) Unsupervised learning of visual representations using videos. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2794–2802. Cited by: §2.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3462–3471. Cited by: §1.
  • K. Yan, X. Wang, L. Lu, L. Zhang, A. P. Harrison, M. Bagheri, and R. M. Summers (2019) Deep lesion graph in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics, pp. 413–435. External Links: ISBN 978-3-030-13969-8, Document, Link Cited by: §2.
  • M. Ye, E. Johns, A. Handa, L. Zhang, P. Pratt, and G. Yang (2017) Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. In The Hamlyn Symposium on Medical Robotics, pp. 27–28. External Links: Document Cited by: §2.
  • P. Zhang, F. Wang, and Y. Zheng (2017) Self supervised deep representation learning for fine-grained body part recognition. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), Vol. , Melbourne, Australia, pp. 578–582. Cited by: §2.
  • R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 649–666. External Links: ISBN 978-3-319-46487-9 Cited by: §2.
  • X. Zhuang, Y. Li, Y. Hu, K. Ma, Y. Yang, and Y. Zheng (2019) Self-supervised feature learning for 3d medical images by playing a rubik’s cube. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 420–428. External Links: ISBN 978-3-030-32251-9 Cited by: §2.

Appendix A Implementation and training details for all tasks

It is noteworthy that our attached implementations are flexible enough to allow for evaluating several types of network architectures for encoders, decoders, and classifiers. We also provide implementations for multiple losses, augmentation techniques, and evaluation metrics. More information can be found in the

README.md file in our attached code-base. We rely on tensorflow v2.1 tensorflow.org [2020] with Keras API in our implementations. Below, we provide the training details we used in implementing our 3D self-supervised tasks (and their 2D counterparts), and when fine-tuning them in subsequent downstream tasks.

Architecture details.

For all 3D encoders , which are pretrained with our 3D self-supervised tasks and later fine-tuned on 3D segmentation tasks, we use a 3D U-Net Ronneberger et al. [2015]-based encoder (the downward path), which consists of five levels of residual convolutional blocks. The numbers of filters in these blocks are , respectively. The U-Net decoder (the upward path) is added in the downstream fine-tuning stage, and it includes five levels of deconvolutional blocks with skip connections from the U-Net encoder blocks. For the 2D encoders, we use a standard Densenet-121 Huang et al. [2017] architecture, which is fine-tuned later on 2D classification tasks. When training our 3D self-supervised tasks, we follow Chen et al. [2020]

in adding nonlinear transformations (a hidden layer with ReLU activation) before the final classification layers. These classification layers are removed when fine-tuning the resulting encoders

in downstream tasks.

Optimization details.

In all self-supervised and downstream tasks, we use Adam Kingma and Ba [2014] optimizer to train the models. The initial learning rate we use is in 3D self-supervised tasks, in 3D segmentation tasks, in 2D self-supervised tasks, and in 2D classification tasks. When we fine-tune our pretrained encoders in subsequent downstream tasks, we follow a warm-up procedure inspired from Kolesnikov et al. [2019] by keeping the encoder weights frozen for a number of initial warm-up epochs while the network decoders or classifiers are trained. These warm-up epochs are 5 in 2D classification tasks, and 25 epochs in 3D segmentation tasks. The alternative options we evaluated were: 1) fine-tuning the encoder directly with a randomly initialized decoder, 2) keeping the encoder frozen throughout the training procedure. And the 3 option we followed in the end was the hybrid approach of warm-up epochs described above, as it provided a performance boost over the other alternatives.

Input preprocessing.

For all input scans, we perform the following preprocessing steps:

  • In self-supervised pretraining using 3D scans, we find the boundaries of the brain or the pancreas along each axis, and then we crop the remaining empty parts from the scan. This step reduces the amount of empty background voxels, as they might confuse patch-based self-supervised methods with no additional semantic information. This step is not performed when fine-tuning on 3D downstream tasks.

  • Then, we resize each 3D image from BraTS or Pancreas to a unified resolution of , and to the resolution for 2D images from Diabetic Retinopathy.

  • Then, each image’s intensity values are normalized by scaling them to the range .

Processing multimodal inputs.

In the first downstream task of brain tumor segmentation with 3D multimodal MRI, we pretrain using the UK Biobank Sudlow et al. [2015] corpus, as mentioned earlier. Brain scans obtained from UKB contain 2 MRI modalities (T1 and T2-Flair), which are co-registered. This allows us to stack these 2 modalities as color channels in each input sample, similar to RGB channels. This form of early fusion Snoek et al. [2005] of MRI modalities is common when they are registered, and is a practical solution for combining all information that exist in these modalities. However, as mentioned earlier, we use the BraTS Menze et al. [2015], Bakas et al. [2017] dataset for fine-tuning, and each scan consists of 4 different MRI modalities, as opposed to only 2 in UKB that is used for pretraining. This difference only affects the input layer of the pretrained encoder, as fine-tuning on an incompatible number of input channels causes this process of fine-tuning to fail. We resolve this issue by duplicating (copying) the weights of only the pretrained input layer. This minor modification only adds a few additional parameters to the input layer, but allows us to leverage its weights. The other alternative for this solution would have been to discard the weights of this input layer, and initialize the rest of the model layers from pretrained models normally. But we believe our solution for this issue takes advantage of any useful information encoded in these weights. This multimodal inputs problem does not occur in the other downstream tasks, as the inputs include only one modality/channel.

Task specific details.

  • 3D-CPC and 3D-Exe: we use latent representation code size of 1024 in these tasks.

  • 3D-Jig and 3D-RPL: We split the input 3D images into patches in this task. We apply a random jitter of 3 pixels per side (axis).

  • Patch-based tasks (3D-CPC, 3D-RPL, 3D-Jig): each extracted patch is represented using an embedding vector of size 64.

  • 3D-Exe: the value used for the triplet loss is .

  • 3D-Jig: the complexity of the Jigsaw puzzle solving task relies on the number of permutations used in generating the puzzles, i.e. the more permutations used, the harder the task is to solve. We follow the Hamming distance-based algorithm from Noroozi and Favaro [2016] in sampling the permutations for this task. However, in our 3D puzzles task, we sample permutations that are more complex with 27 different entries. This algorithm works as follows: we sample a subset of 1000 permutations which are selected based on their Hamming distance, i.e., the number of different tile locations between 2 permutations. When generating permutations, we ensure that the average Hamming distance across permutations is kept as high as possible. This results in a set of permutations (classes) that are as far as possible from each other.

Appendix B Detailed experimental results

(a) CPC 3D vs. baseline
(b) RPL 3D vs. baseline
(c) Jigsaw 3D vs. baseline
(d) Rotation 3D vs. baseline
(e) Exemplar 3D vs. baseline
Figure 7: Pancreas segmentation: Detailed data-efficiency results per method (blue) vs. the supervised baseline (orange). Our methods consistently outperform the baseline in low-data cases
(a) CPC 3D vs. baseline
(b) RPL 3D vs. baseline
(c) Jigsaw 3D vs. baseline
(d) Rotation 3D vs. baseline
(e) Exemplar 3D vs. baseline
Figure 8: Pancreas segmentation: Detailed speed of convergence results per method (blue) vs. the supervised baseline (orange). This benefit of our methods helps achieve high results using only few epochs
(a) CPC 2D vs. baseline
(b) RPL 2D vs. baseline
(c) Jigsaw 2D vs. baseline
(d) Rotation 2D vs. baseline
(e) Exemplar 2D vs. baseline
Figure 9: Retinopathy detection: Detailed data-efficiency results per method (blue) vs. the supervised baseline (orange). Our methods consistently outperform the baseline in low-data cases
(a) CPC 2D vs. baseline
(b) RPL 2D vs. baseline
(c) Jigsaw 2D vs. baseline
(d) Rotation 2D vs. baseline
(e) Exemplar 2D vs. baseline
Figure 10: Retinopathy detection: Detailed speed of convergence results per method (blue) vs. the supervised baseline (orange). This benefit of our methods helps achieve high results using only few epochs