Self-supervised learning methods have witnessed a recent surge of interest after proving successful in multiple application fields. In this work, we leverage these techniques, and we propose 3D versions for five different self-supervised methods, in the form of proxy tasks. Our methods facilitate neural network feature learning from unlabeled 3D images, aiming to reduce the required cost for expert annotation. The developed algorithms are 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks. Our experiments show that pretraining models with our 3D tasks yields more powerful semantic representations, and enables solving downstream tasks more accurately and efficiently, compared to training the models from scratch and to pretraining them on 2D slices. We demonstrate the effectiveness of our methods on three downstream tasks from the medical imaging domain: i) Brain Tumor Segmentation from 3D MRI, ii) Pancreas Tumor Segmentation from 3D CT, and iii) Diabetic Retinopathy Detection from 2D Fundus images. In each task, we assess the gains in data-efficiency, performance, and speed of convergence. We achieve results competitive to state-of-the-art solutions at a fraction of the computational expense. We also publish the implementations for the 3D and 2D versions of our algorithms as an open-source library, in an effort to allow other researchers to apply and extend our methods on their datasets.READ FULL TEXT VIEW PDF
In this paper, we propose a self-supervised learning approach that lever...
Transfer learning has gained attention in medical image analysis due to
Self-supervision has demonstrated to be an effective learning strategy w...
Training a neural network with a large labeled dataset is still a domina...
Recent advances have spurred incredible progress in self-supervised
Purpose: Due to the breakthrough successes of deep learning-based soluti...
Increasing model size when pretraining natural language representations ...
Due to technological advancements in 3D sensing, the need for machine learning-based algorithms that perform analysis tasks on 3D imaging data has grown rapidly in the past few yearsGriffiths and Boehm (2019); Ioannidou et al. (2017); Su et al. (2017). 3D imaging has numerous applications, such as in Robotic navigation, in CAD imaging, in Geology, and in Medical Imaging. While we focus on medical imaging as a test-bed for our proposed 3D algorithms in this work, we ensure their applicability to other 3D domains. Medical imaging plays a vital role in patient healthcare, as it aids in disease prevention, early detection, diagnosis, and treatment. Yet efforts to utilize advancements in machine learning algorithms are often hampered by the sheer expense of the expert annotation required Grünberg et al. (2017). Generating expert annotations of 3D medical images at scale is non-trivial, expensive, and time-consuming. Another related challenge in medical imaging is the relatively small sample sizes. This becomes more obvious when studying a particular disease, for instance. Also, gaining access to large-scale datasets is often difficult due to privacy concerns. Hence, scarcity of data and annotations are some of the main constraints for machine learning applications in medical imaging.
Several efforts have attempted to address these challenges, as they are common to other application fields of deep learning. A widely used technique is transfer learning, which aims to reuse the features of already trained neural networks on different, but related, target tasks. A common example is adapting the features from networks trained on ImageNet, which can be reused for other visual tasks, e.g. semantic segmentation. To some extent, transfer learning has made it easier to solve tasks with limited number of samples. However, as mentioned before, the medical domain is supervision-starved. Despite attempts to leverage ImageNetDeng et al. (2009) features in the medical context Wang et al. (2017); Rajpurkar et al. (2017); Sahlsten et al. (2019); Islam et al. (2018), the difference in the distributions of natural and medical images is significant, i.e. generalizing across these domains is questionable and can suffer from dataset bias Torralba and Efros (2011). Recent analysis Raghu et al. (2019) has also found that such transfer learning offers limited performance gains, relative to the computational costs it incurs. Consequently, it is necessary to find better solutions for the aforementioned challenges.
A viable alternative is to employ self-supervised (unsupervised) methods, which proved successful in multiple domains recently. In these approaches, the supervisory signals are derived from the data. In general, we withhold some part of the data, and train the network to predict it. This prediction task defines a proxy loss, which encourages the model to learn semantic representations about the concepts in the data. Subsequently, this facilitates data-efficient fine-tuning on supervised downstream tasks, reducing significantly the burden of manual annotation. Despite the surge of interest in the machine learning community in self-supervised methods, only little work has been done to adopt these methods in the medical imaging domain. We believe that self-supervised learning is directly applicable in the medical context, and can offer cheaper solutions for the challenges faced by conventional supervised methods. Unlabelled medical images carry valuable information about organ structures, and self-supervision enables the models to derive notions about these structures with no additional annotation cost.
A particular aspect of most medical images, which received little attention by previous self-supervised methods, is their 3D nature Eisenberg and Margulis (2011). The paradigm that is common is to cast 3D imaging tasks in 2D, by extracting slices along an arbitrary axis, e.g. the axial dimension. However, such tasks can substantially benefit from the full 3D spatial context, thus capturing rich anatomical information. We believe that relying on the 2D context to derive data representations from 3D images, in general, is a suboptimal solution, which compromises the performance on downstream tasks.
Our contributions. As a result, in this work, we propose five self-supervised tasks that utilize the full 3D spatial context, aiming to better adopt self-supervision in 3D imaging. The proposed tasks are: 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks. These algorithms are inspired by their successful 2D counterparts, and to the best of our knowledge, except for Jigsaw, none of these methods have actually been extended to the 3D context, let alone applied to the medical domain. Several computational and methodological challenges arise when designing self-supervised tasks in 3D, due to the increased data dimensionality, which we address in our methods to ensure their efficiency. We perform extensive experiments using four datasets, and we show that our 3D tasks result in rich data representations that improve data-efficiency and performance on three different downstream tasks. Finally, we publish the implementations of our 3D tasks, and also of their 2D versions, in order to allow other researchers to evaluate these methods on other imaging datasets.
In general, unsupervised representation learning can be formulated as learning an embedding space, in which data samples that are semantically similar are closer, and those that are different are far apart. The self-supervised family constructs such a representation space by creating a supervised proxy task from the data itself. Then, the embeddings that solve the proxy task will also be useful for other real-world downstream tasks. Several methods in this line of research have been developed recently, and they found applications in numerous fields Jing and Tian (2019). In this work, we focus on methods that operate on images only.
Self-supervised methods differ in their core building block, i.e. the proxy task used to learn representations from unlabelled input data. A commonly used supervision source for proxy tasks is the spatial context from images, which was first inspired by the skip-gram Word2Vec Mikolov et al. (2013) algorithm. This idea was generalized to images in Doersch et al. (2015), in which a visual representation is learned by predicting the position of an image patch relative to another. A similar work extended this patch-based approach to solve Jigsaw Puzzles Noroozi and Favaro (2016). Other works have used different supervision sources, such as image colors Zhang et al. (2016), clustering Caron et al. (2018), image rotation prediction Gidaris et al. (2018), object saliency Wang et al. (2019), and image reconstruction Pathak et al. (2016). In recent works, Contrastive Predictive Coding (CPC) approaches van den Oord et al. (2018); Hénaff et al. (2019) advanced the results of self-supervised methods on multiple imaging benchmarks Chen et al. (2020)
. These methods utilize the idea of contrastive learning in the latent space, similar to Noise Contrastive EstimationGutmann and Hyvärinen (2010). In 2D images, the model has to predict the latent representation for next (adjacent) image patches. Our work follows this line of research in the above works, however, our methods utilize the full 3D context.
While videos are rich with more types of supervisory signals Wang and Gupta (2015b); Vondrick et al. (2015); Walker et al. (2015); Purushwalkam and Gupta (2016), we discuss here a subset of these works that utilize 3D-CNNs to process input videos. In this context, 3D-CNNs are employed to simultaneously extract spatial features from each frame, and temporal features across multiple frames, which are typically stacked along the 3 (depth) dimension. The idea of exploiting 3D convolutions for videos was proposed in Ji et al. (2013) for human action recognition, and was later extended to other applications Jing and Tian (2019). In self-supervised learning, however, the number of pretext tasks that exploit this technique is still limited. Kim et al. Dahun et al. (2019) proposed a task that extracts cubic puzzles of , meaning that the 3 dimension is not actually utilized in puzzle creation. Jing et al. Jing and Tian (2018) extended the rotation prediction task Gidaris et al. (2018) to videos, by simply stacking video frames along the depth dimension, however, this dimension is not employed in the design of their task as only spatial rotations are considered. On the other hand, in our more general versions of 3D Jigsaw puzzles and 3D Rotation prediction, respectively, we exploit the depth (3) dimension in the design of our tasks. For instance, we solve larger 3D puzzles up to , and we also predict more rotations along all axes in the 3D space. In general, we believe the different nature of the data, 3D volumetric scans vs. stacked video frames, influences the design of proxy tasks, i.e. the depth dimension has an actual semantic meaning in volumetric scans. Hence, we consider the whole 3D context when designing all of our methods, aiming to learn valuable anatomical information from unlabeled 3D volumetric scans.
In the medical context, self-supervision has found use-cases in diverse applications such as depth estimation in monocular endoscopy Liu et al. (2018), robotic surgery Ye et al. (2017), medical image registration Li and Fan (2018), body part recognition Zhang et al. (2017), in disc degeneration using spinal MRIs Jamaludin et al. (2017), in cardiac image segmentation Bai et al. (2019), body part regression for slice ordering Yan et al. (2019), and medical instrument segmentation Roß et al. (2017). There are multiple other examples of self-supervised methods for medical imaging, such as Chen et al. (2019); Jiao et al. (2020); Taleb et al. (2019); Blendowski et al. (2019). While these attempts are a step forward for self-supervised learning in medical imaging, they have some limitations. First, as opposed to our work, many of these works make assumptions about input data, resulting in engineered solutions that hardly generalize to other target tasks. Second, none of the above works capture the complete spatial context available in 3-dimensional scans, i.e. they only operate on 2D/2.5D spatial context. In a more related work, Zhuang et al. Zhuang et al. (2019) developed a proxy task that solves small 3D jigsaw puzzles. Their proposed puzzles were only limited to of puzzle complexity. Our version of 3D Jigsaw puzzles is able to efficiently solve larger puzzles, e.g. , and outperforms their method’s results on the downstream task of Brain tumor segmentation. In this paper, we continue this line of work, and develop five different algorithms for 3D data, whose nature and performance can accommodate more types of target medical applications.
In this section, we discuss the formulations of our 3D self-supervised pretext tasks, all of which learn data representations from unlabeled samples (3D images), hence requiring no manual annotation effort in the self-supervised pretraining stage. Each task results in a pretrained encoder model that can be fine-tuned in various downstream tasks, subsequently.
Following the contrastive learning idea, first proposed in Gutmann and Hyvärinen (2010), this universal unsupervised technique predicts the latent space for future (next or adjacent) samples. Recently, CPC found success in multiple application fields, e.g. its 1D version in audio signals van den Oord et al. (2018), and its 2D versions in images van den Oord et al. (2018); Hénaff et al. (2019), and was able to bridge the gap between unsupervised and fully-supervised methods Chen et al. (2020). Our proposed CPC version generalizes this technique to 3D inputs, and defines a proxy task by cropping equally-sized and overlapping 3D patches from each input scan. Then, the encoder model maps each input patch to its latent representation . Next, another model called the context network is used to summarize the latent vectors of the patches in the context of , and produce its context vector , where denotes a set of latent vectors. Finally, because captures the high level content of the context that corresponds to , it allows for predicting the latent representations of next (adjacent) patches , where . This prediction task is cast as an -way classification problem by utilizing the InfoNCE loss van den Oord et al. (2018), which takes its name from its ability to maximize the mutual information between and . Here, the classes are the latent representations of the patches, among which is one positive representation, and the rest are negative. Formally, the CPC loss can be written as follows:
This loss corresponds to the categorical cross-entropy loss, which trains the model to recognize the correct representation among the list of negative representations . These negative samples (3D patches) are chosen randomly from other locations in the input image. In practice, similar to the original NCE Gutmann and Hyvärinen (2010), this task is solved as a binary pairwise classification task.
It is noteworthy that the proposed 3D-CPC task, illustrated in Fig. 1 (a), allows employing any network architecture in the encoder and the context networks. In our experiments, we follow van den Oord et al. (2018) in using an autoregressive network using GRUs Cho et al. (2014) for the context network , however, masked convolutions can be a valid alternative van den Oord et al. (2016). In terms of what the 3D context of each patch includes, we follow the idea of an inverted pyramid neighborhood, which is inspired from Stollenga et al. (2015); Van Den Oord et al. (2016).
In this task, the spatial context in images is leveraged as a rich source of supervision, in order to learn semantic representations of the data. First proposed by Doersch et al. Doersch et al. (2015) for 2D images, this task inspired several works in self-supervision. In our 3D version of this task, shown in Fig. 1 (b), we leverage the full 3D spatial context in the design of our task. From each input 3D image, a 3D grid of non-overlapping patches is sampled at random locations. Then, the patch in the center of the grid is used as a reference, and a query patch is selected from the surrounding patches. Next, the location of relative to is used as the positive label . This casts the task as an -way classification problem, in which the locations of the remaining grid patches are used as the negative samples . Formally, the cross-entropy loss in this task is written as:
Where is the number of queries extracted from all samples. In order to prevent the model from solving this task quickly by finding shortcut solutions, e.g. edge continuity, we follow Doersch et al. (2015) in leaving random gaps (jitter) between neighbor 3D patches. More details in Appendix.
Deriving a Jigsaw puzzle grid from an input image, be it in 2D or 3D, and solving it can be viewed as an extension to the above patch-based RPL task. In our 3D Jigsaw puzzle task, which is inspired by its 2D counterpart Noroozi and Favaro (2016) and illustrated in Fig. 1 (c), the puzzles are formed by sampling an grid of 3D patches. Then, these patches are shuffled according to an arbitrary permutation, selected from a set of predefined permutations. This set of permutations with size is chosen out of the possible permutations, by following the Hamming distance based algorithm in Noroozi and Favaro (2016) (details in Appendix), and each permutation is assigned an index . Therefore, the problem is now cast as a -way classification task, I.e., the model is trained to simply recognize the applied permutation index , allowing us to solve the 3D puzzles in an efficient manner. Formally, we minimize the cross-entropy loss of , where is an arbitrary 3D puzzle from the list of extracted puzzles. Similar to 3D-RPL, we use the trick of adding random jitter in 3D-Jig.
Originally proposed by Gidaris et al. Gidaris et al. (2018), the rotation prediction task encourages the model to learn visual representations by simply predicting the angle by which the input image is rotated. The intuition behind this task is that for a model to successfully predict the angle of rotation, it needs to capture sufficient semantic information about the object in the input image. In our 3D Rotation prediction task, 3D input images are rotated randomly by a random degree out of the considered degrees. In this task, for simplicity, we consider the multiples of 90 degrees (0, 90, 180, 270, along each axis of the 3D coordinate system . There are 4 possible rotations per axis, amounting to 12 possible rotations. However, rotating input scans by 0 along the 3 axes will produce 3 identical versions of the original scan, hence, we consider 10 rotation degrees instead. Therefore, in this setting, this proxy task can be solved as a 10-way classification problem. Then, the model is tasked to predict the rotation degree (class), as shown in Fig. 1 (d). Formally, we minimize the cross-entropy loss of , where an arbitrary rotated 3D image from the list of rotated images. It is noteworthy that we create multiple rotated versions for each 3D image.
The task of Exemplar networks, proposed by Dosovitskiy et al. Dosovitskiy et al. (2014), is one of the earliest methods in the self-supervised family. To derive supervision labels, it relies on image augmentation techniques, i.e. transformations. Assuming a training set , and a set of image transformations , a new surrogate class is created by transforming each training sample , where . Therefore, the task becomes a regular classification task with a cross-entropy loss. However, this classification task becomes prohibitively expensive as the dataset size grows larger, as the number of classes grows accordingly. Thus, in our proposed 3D version of Exemplar networks, shown in Fig. 1 (e), we employ a different mechanism that relies on the triplet loss instead Wang and Gupta (2015a). Formally, assuming is a random training sample and is its corresponding embedding vector, is a transformed version of (seen as a positive example) with an embedding , and is a different sample from the dataset (seen as negative) with an embedding . The triplet loss is written as follows:
where is a pairwise distance function, for which we use the distance, following Schroff et al. (2015). is a margin (gap) that is enforced between positive and negative pairs, which we set to . The triplet loss enforces , i.e. the transformed versions of the same sample (positive samples) to come closer to each other in the learned embedding space, and farther away from other (negative) samples. It is noteworthy that we apply the following 3D transformations: random flipping along an arbitrary axis, random translation along an arbitrary axis, random brightness, and random contrast.
In this section, we present the evaluation results of our methods, which we assess the quality of their learned representations by fine-tuning them on three downstream tasks. In each task, we analyze the obtained gains in data-efficiency, performance, and speed of convergence. In addition, each task aims to demonstrate a certain use-case for our methods. We follow the commonly used evaluation protocols for self-supervised methods in each of these tasks. The chosen tasks are:
We provide additional architecture and training details in Appendix.
In this task, we evaluate our methods by fine-tuning the learned representations on the Multimodal Brain Tumor Segmentation (BraTS) 2018 Menze et al. (2015); Bakas et al. (2017) benchmark. Before that, we pretrain our models on brain MRI data from the UK Biobank Sudlow et al. (2015) (UKB) corpus, which contains roughly 3D scans. Due to this large number of unlabeled scans, UKB is suitable for unsupervised pretraining. The BraTS dataset contains annotated MRI scans for training and validation cases. We fine-tune on BraTS’ training set, and evaluate on its validation set. Following the official BraTS challenge, we report Dice scores for the Whole Tumor (WT), Tumor Core (TC), and Enhanced Tumor (ET) tasks. The Dice score (F1-Score) is twice the area of overlap between two segmentation masks divided by the total number of pixels in both. In order to assess the quality of the learned representations by our 3D proxy tasks, we compare to the following baselines:
Training from scratch: the first sensible baseline for any self-supervised method, in general, is the same model trained on the downstream task when initialized from random weights. Comparing to this baseline provides insights about the benefits of self-supervised pretraining.
Training on 2D slices: this baseline aims to quantitatively show how our proposal to operate on the 3D context benefits the learned representations, compared to 2D methods.
Discussion. We first assess the gains in data-efficiency in this task. To quantify these gains, we measure the segmentation performance at different sample sizes. We randomly select subsets of patients at 10%, 25%, 50%, and 100% of the full dataset size, and we fine-tune our models on these subsets. Here, we compare to the baselines listed above. As shown in Fig. 2, our 3D methods outperform the baseline model trained from scratch by a large margin when using few training samples, and behaves similarly as the number of labeled samples increases. The low-data regime case at 5% suggests the potential for generic unsupervised features, and highlights the huge gains in data-efficiency. Also, the proposed 3D versions considerably outperform their 2D counterparts, which are trained on slices extracted from the 3D images. We also measure how our methods affect the final brain tumor segmentation performance, in Table 1
. All our methods outperform the baseline trained from scratch as well as their 2D counterparts, confirming the benefits of pretraining with our 3D tasks on downstream performance. We also achieve comparable results to baselines from the BraTS challenge, and we outperform these baselines in some cases, e.g. our 3D-RPL method outperforms all baselines in terms of ET and TC dice scores. Also, our model pretrained with 3D-Exemplar, with fewer downstream training epochs, matches the result of Isenseeet al. Isensee et al. (2018) in terms of WT dice score, which is one of the top results on the BraTS 2018 challenge. Our results in this downstream task also demonstrate the generalization ability of our 3D tasks across different domains.
In this downstream task, we evaluate our models on 3D CT scans of Pancreas tumor from the medical decathlon benchmarks Simpson et al. (2019). The Pancreas dataset contains annotated CT scans for cases. Each scan in this dataset contains different classes: pancreas (class 1), tumor (class 2), and background (class 0). To measure the performance on this benchmark, two dice scores are computed for classes 1 and 2. In this task, we pretrain using our proposed 3D tasks on pancreas scans without their annotation masks. Then, we fine-tune the obtained models on subsets of annotated data to assess the gains in both data-efficiency and performance. Finally, we also compare to the baseline model trained from scratch and to 2D models, similar to the previous downstream task. Fig. 6 demonstrates the gains when fine-tuning our models on 5%, 10%, 50%, and 100% of the full data size. The results obtained by our 3D methods also outperform the baselines in this task with a margin when using only few training samples, e.g. 5% and 10% cases. Another significant benefit offered by pretraining with our methods is the speed of convergence on downstream tasks. As demonstrated in Fig 6, when training on the full pancreas dataset, within the first 20 epochs only, our models achieve much higher performances compared to the "from scratch" baseline. We should note that we evaluate this task on a held-out labeled subset of the Pancreas dataset that was not used for pretraining nor fine-tuning. We provide the full list of experimental results for this task in Appendix.
As part of our work, we also provide implementations for the 2D versions of the developed self-supervised methods. We showcase these implementations on the Diabetic Retinopathy 2019 Kaggle challenge 4.3. This dataset contains roughly Fundus 2D images, each of which was rated by a clinician on a severity scale of to . These levels define a classification task. In order to evaluate our tasks on this benchmark, we pretrain all the 2D versions of our methods on all fundus images, with no labels. Then, following the data-efficient evaluation protocol in Hénaff et al. (2019), we fine-tune the obtained models on subsets of labelled data to assess the gains in both data-efficiency, shown in Fig. 6, and speed of convergence in Fig. 6. In this 2D task, we achieve results consistent with the other downstream tasks, presented before. We should point out that we evaluate with 5-fold cross validation on this 2D dataset. The metric used in task, as in the Kaggle challenge, is the Quadratic Weighted Kappa, which measures the agreement between two ratings. Its values vary from random (0) to complete (1) agreement, and if there is less agreement than chance it may become negative.
In this work, we asked whether designing 3D self-supervised tasks could benefit the learned representations from unlabeled 3D images, and found that it indeed greatly improves their downstream performance, especially when fine-tuned on only small amounts of labeled 3D data. We demonstrate the obtained gains by our proposed 3D algorithms in data-efficiency, performance, and speed of convergence on three different downstream tasks. Our 3D tasks outperform their 2D counterparts, hence supporting our proposal of utilizing the 3D spatial context in the design of self-supervised tasks, when operating on 3D domains. What is more, our results, particularly in the low-data regime, demonstrate the possibility to reduce the manual annotation effort required in the medical imaging domain, where data and annotation scarcity is an obstacle. Finally, we open source our implementations for all 3D methods (and also their 2D versions), and we publish them to help other researchers apply our methods on other medical imaging tasks. This work is only a first step toward creating a set of methods that facilitate self-supervised learning research for 3D data, e.g. medical scans. We believe there is room for improvement along this line, such as designing new 3D proxy tasks, evaluating different architectural options, and including other data modalities (e.g. text) besides images.
Due to technological advancements in 3D data sensing, and to the growing number of its applications, the attention to machine learning algorithms that perform analysis tasks on such data has grown rapidly in the past few years. As mentioned before, 3D imaging has multitude of applications Ioannidou et al. (2017), such as in Robotics, in CAD imaging, in Geology, and in Medical Imaging. In this work, we developed multiple 3D Deep Learning algorithms, and evaluated them on multiple 3D medical imaging benchmarks. Our focus on medical imaging is motivated by the pressing demand for automatic (and instant) analysis systems, that may aid the medical community.
Medical imaging plays an important role in patient healthcare, as it aids in disease prevention, early detection, diagnosis, and treatment. With the continuous digitization of medical images, the hope that physicians and radiologists are able to instantly analyze them with Machine Learning algorithms is slowly shaping as a reality. Achieving this has become more critical recently, as the number of patients which contracted with a novel Coronavirus, called COVID-19, reached a high record. Radiography images provide a rich and a quick diagnosis tool, because other types of tests, e.g. RT-PCR which is an RNA/DNA based test, have low sensitivity and may require hours/days of processing Manna et al. (2020). Therefore, as imaging allows such instant insights into human body organs, it receives a growing attention from both the machine learning and the medical communities.
Yet efforts to leverage advancements in machine learning, particularly the supervised algorithms, are often hampered by the sheer expense of the expert annotation required Grünberg et al. (2017). Generating expert annotations of patient data at scale is non-trivial, expensive, and time-consuming, especially for 3D medical scans. Even current semi-automatic software tools fail to sufficiently address this challenge. Consequently, it is necessary to rely on annotation-efficient machine learning algorithms, such as self-supervised (unsupervised) approaches for representation learning from unlabelled data. Our work aims to provide the necessary tools for 3D image analysis, in general, and to aid physicians and radiologists in their diagnostic tasks from 3D scans, in particular. And as the main consequence of this work, the developed methods can help reduce the effort and cost of annotation required by these practitioners. In the larger goal of leveraging Machine Learning for good, our work is only a small step toward achieving this goal for patient healthcare.
Deep clustering for unsupervised learning of visual features. In
The European Conference on Computer Vision (ECCV), Munich, Germany, pp. . Cited by: §2.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. External Links: Cited by: §3.1.
Proceedings of the AAAI Conference on Artificial Intelligence33, pp. 8545–8552. External Links: Cited by: §2.
Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems 27 (NIPS), External Links: Cited by: §3.5.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2261–2269. Cited by: Appendix A.
FaceNet: a unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 815–823. Cited by: §3.5.
Colorful image colorization. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 649–666. External Links: Cited by: §2.
It is noteworthy that our attached implementations are flexible enough to allow for evaluating several types of network architectures for encoders, decoders, and classifiers. We also provide implementations for multiple losses, augmentation techniques, and evaluation metrics. More information can be found in the
It is noteworthy that our attached implementations are flexible enough to allow for evaluating several types of network architectures for encoders, decoders, and classifiers. We also provide implementations for multiple losses, augmentation techniques, and evaluation metrics. More information can be found in theREADME.md file in our attached code-base. We rely on tensorflow v2.1 tensorflow.org  with Keras API in our implementations. Below, we provide the training details we used in implementing our 3D self-supervised tasks (and their 2D counterparts), and when fine-tuning them in subsequent downstream tasks.
For all 3D encoders , which are pretrained with our 3D self-supervised tasks and later fine-tuned on 3D segmentation tasks, we use a 3D U-Net Ronneberger et al. -based encoder (the downward path), which consists of five levels of residual convolutional blocks. The numbers of filters in these blocks are , respectively. The U-Net decoder (the upward path) is added in the downstream fine-tuning stage, and it includes five levels of deconvolutional blocks with skip connections from the U-Net encoder blocks.
For the 2D encoders, we use a standard Densenet-121 Huang et al.  architecture, which is fine-tuned later on 2D classification tasks. When training our 3D self-supervised tasks, we follow Chen et al.  in adding nonlinear transformations (a hidden layer with ReLU activation) before the final classification layers. These classification layers are removed when fine-tuning the resulting encoders
in adding nonlinear transformations (a hidden layer with ReLU activation) before the final classification layers. These classification layers are removed when fine-tuning the resulting encodersin downstream tasks.
In all self-supervised and downstream tasks, we use Adam Kingma and Ba  optimizer to train the models. The initial learning rate we use is in 3D self-supervised tasks, in 3D segmentation tasks, in 2D self-supervised tasks, and in 2D classification tasks. When we fine-tune our pretrained encoders in subsequent downstream tasks, we follow a warm-up procedure inspired from Kolesnikov et al.  by keeping the encoder weights frozen for a number of initial warm-up epochs while the network decoders or classifiers are trained. These warm-up epochs are 5 in 2D classification tasks, and 25 epochs in 3D segmentation tasks. The alternative options we evaluated were: 1) fine-tuning the encoder directly with a randomly initialized decoder, 2) keeping the encoder frozen throughout the training procedure. And the 3 option we followed in the end was the hybrid approach of warm-up epochs described above, as it provided a performance boost over the other alternatives.
For all input scans, we perform the following preprocessing steps:
In self-supervised pretraining using 3D scans, we find the boundaries of the brain or the pancreas along each axis, and then we crop the remaining empty parts from the scan. This step reduces the amount of empty background voxels, as they might confuse patch-based self-supervised methods with no additional semantic information. This step is not performed when fine-tuning on 3D downstream tasks.
Then, we resize each 3D image from BraTS or Pancreas to a unified resolution of , and to the resolution for 2D images from Diabetic Retinopathy.
Then, each image’s intensity values are normalized by scaling them to the range .
In the first downstream task of brain tumor segmentation with 3D multimodal MRI, we pretrain using the UK Biobank Sudlow et al.  corpus, as mentioned earlier. Brain scans obtained from UKB contain 2 MRI modalities (T1 and T2-Flair), which are co-registered. This allows us to stack these 2 modalities as color channels in each input sample, similar to RGB channels. This form of early fusion Snoek et al.  of MRI modalities is common when they are registered, and is a practical solution for combining all information that exist in these modalities. However, as mentioned earlier, we use the BraTS Menze et al. , Bakas et al.  dataset for fine-tuning, and each scan consists of 4 different MRI modalities, as opposed to only 2 in UKB that is used for pretraining. This difference only affects the input layer of the pretrained encoder, as fine-tuning on an incompatible number of input channels causes this process of fine-tuning to fail. We resolve this issue by duplicating (copying) the weights of only the pretrained input layer. This minor modification only adds a few additional parameters to the input layer, but allows us to leverage its weights. The other alternative for this solution would have been to discard the weights of this input layer, and initialize the rest of the model layers from pretrained models normally. But we believe our solution for this issue takes advantage of any useful information encoded in these weights. This multimodal inputs problem does not occur in the other downstream tasks, as the inputs include only one modality/channel.
3D-CPC and 3D-Exe: we use latent representation code size of 1024 in these tasks.
3D-Jig and 3D-RPL: We split the input 3D images into patches in this task. We apply a random jitter of 3 pixels per side (axis).
Patch-based tasks (3D-CPC, 3D-RPL, 3D-Jig): each extracted patch is represented using an embedding vector of size 64.
3D-Exe: the value used for the triplet loss is .
3D-Jig: the complexity of the Jigsaw puzzle solving task relies on the number of permutations used in generating the puzzles, i.e. the more permutations used, the harder the task is to solve. We follow the Hamming distance-based algorithm from Noroozi and Favaro  in sampling the permutations for this task. However, in our 3D puzzles task, we sample permutations that are more complex with 27 different entries. This algorithm works as follows: we sample a subset of 1000 permutations which are selected based on their Hamming distance, i.e., the number of different tile locations between 2 permutations. When generating permutations, we ensure that the average Hamming distance across permutations is kept as high as possible. This results in a set of permutations (classes) that are as far as possible from each other.