Self-supervised Learning from 100 Million Medical Images

by   Florin C. Ghesu, et al.
Siemens Healthineers

Building accurate and robust artificial intelligence systems for medical image assessment requires not only the research and design of advanced deep learning models but also the creation of large and curated sets of annotated training examples. Constructing such datasets, however, is often very costly – due to the complex nature of annotation tasks and the high level of expertise required for the interpretation of medical images (e.g., expert radiologists). To counter this limitation, we propose a method for self-supervised learning of rich image features based on contrastive learning and online feature clustering. For this purpose we leverage large training datasets of over 100,000,000 medical images of various modalities, including radiography, computed tomography (CT), magnetic resonance (MR) imaging and ultrasonography. We propose to use these features to guide model training in supervised and hybrid self-supervised/supervised regime on various downstream tasks. We highlight a number of advantages of this strategy on challenging image assessment problems in radiography, CT and MR: 1) Significant increase in accuracy compared to the state-of-the-art (e.g., AUC boost of 3-7 detection of abnormalities from chest radiography scans and hemorrhage detection on brain CT); 2) Acceleration of model convergence during training by up to 85 detection of brain metastases in MR scans); 3) Increase in robustness to various image augmentations, such as intensity variations, rotations or scaling reflective of data variation seen in the field.



There are no comments yet.


page 7

page 10

page 12


Self-supervised Feature Learning for 3D Medical Images by Playing a Rubik's Cube

Witnessed the development of deep learning, increasing number of studies...

Self-Supervised Learning for Gastritis Detection with Gastric X-Ray Images

We propose a novel self-supervised learning method for medical image ana...

Imbalance-Aware Self-Supervised Learning for 3D Radiomic Representations

Radiomic representations can quantify properties of regions of interest ...

Quantifying and Leveraging Predictive Uncertainty for Medical Image Assessment

The interpretation of medical images is a challenging task, often compli...

Self-supervised Skull Reconstruction in Brain CT Images with Decompressive Craniectomy

Decompressive craniectomy (DC) is a common surgical procedure consisting...

The Ladder Algorithm: Finding Repetitive Structures in Medical Images by Induction

In this paper we introduce the Ladder Algorithm; a novel recurrent algor...

maskSLIC: Regional Superpixel Generation with Application to Local Pathology Characterisation in Medical Images

Supervoxel methods such as Simple Linear Iterative Clustering (SLIC) are...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-supervised learning has enjoyed much attention in recent years in the vision research community, with methods powered by large amounts of data nearing the accuracy level of state-of-the-art supervised learning strategies on well known benchmarks such as ImageNet 

[13, 16, 19]

. Moreover, they demonstrate that one can use visual representations derived through self-supervised learning to guide regular downstream supervised learning and achieve increased performance (e.g., via transfer learning).

Only few studies have investigated the impact of self-supervised learning in the medical image analysis domain (e.g., [57, 14, 42]) – a field where the development of AI technologies is impacted by a high cost of annotations (often requiring expert radiologists precision) and scarcity of medical imaging data. These solutions are generally limited in their design to focus on architectures for segmentation (i.e., encoder-decoder) and do not support deep architectures often used for classification or detection [30, 51]. In addition, these methods do not exploit truly large datasets and are at best trained with thousands or hundreds of thousands of cases - the same range as many systems trained with supervised learning [25]. In this work we overcome these limitations by proposing a method for self-supervised learning from medical image data which enables the training of classification-optimized architectures. In particular, we make a first step towards truly big-data training and break the barrier of 100,000,000 training images.

The contributions of the paper are the following:

  • We propose a method for self-supervised learning based on contrastive learning [27] and online feature clustering [13]. The method enables hybrid self-supervised/ supervised learning from multi-modality data, and is applicable to 2D and 3D image data. As core part of the system, we propose a new set of image transformation operations optimized for medical image data. Closest to our work is the contribution of Caron et al. [13].

  • We conduct large scale self-supervised training experiments, including a dataset of over 1,300,000 X-rays and a dataset of over 105,000,000 multi-modality image data (including X-ray, CT, MR, US). To the best of our knowledge this represents the largest machine learning experiment to date focused on medical image data that has been reported in the literature.

  • We perform a rigorous validation of the method on three medical computer aided diagnosis (CAD) problems: 1) Chest radiography abnormality assessment; 2) Brain metastasis detection in MR; and 3) Brain hemorrhage detection in CT data. For this purpose, we use challenging test datasets that are reflective of real clinical practice and with highly curated annotations derived by consensus of multiple expert radiologists. This is an essential step in obtaining an accurate assessment of performance. We intentionally avoid public datasets such as ChestX-ray8 [52] with reported suboptimal image quality and label error rates of 65 - 85% in terms of sensitivity [25].

  • We demonstrate that by using the proposed method one can achieve a considerable performance increase on all the previously enumerated tasks, i.e., significant accuracy increase (average of 6-8% AUC), robustness gain, and acceleration of model training convergence (up to 85%).

The paper is organized as follows: Section 2 provides an overview of related work, with the last subsection focusing on recent developments for self-supervised learning in the medical imaging domain; Section 3 describes the proposed method followed by Section 4 in which we present the experiments on various abnormality detection problems based on different 2D/3D image modalities. Finally, Section 5 concludes the paper with a summary and outlook on future work.

2 Background and Motivation

2.1 Self-Supervised Learning by Contrastive Learning

Proposed as a principled approach for dimensionality reduction [27], contrastive learning based on invariant input transformations has become a key optimization strategy for self-supervised feature learning. Using various transformations of the input data which determine a series of surrogate classes, Dosovitskiy et al. [21] propose a supervised discriminative learning approach as a means to learn robust features from unlabeled data. In contrast, Bojanowski et al. [9] learn a supervised mapping to a set of target deep representations sampled from an uninformative distribution, referred to as noise-as-targets. Using this strategy, they argue that one can avoid learning trivial feature sets or the effects of feature collapse. One limitation of instance learning discussed also in [21] is the intractable number of classes which is proportional to the number of instances. Wu et al. [53]

address this limitation using a non-parametric approach which constructs a memory bank to store the target instance representations and applies noise-contrastive estimation (NCE) 

[26] to compare instances. A memory bank is used also by Zhuang et al. [58] for their local aggregation scheme, designed to optimize the instance representation such the similar data samples are clustered, while dissimilar ones become separated in the target manifold. Recently, Kaiming et al. [29] proposed to replace the memory bank with a momentum encoder coupled with a queue to generate and store representations for contrastive learning. In contrast, Hjelm et al. [32] propose to use mutual information maximization based on NCE [26] for unsupervised feature learning - applying adversarial learning to constrain the representation according to a given prior. Bachman et al. [6] extend the approach to optimize the mutual information on multiple feature scales based on so called multiple views, i.e., different augmentations of the input. Tian et al. [50] further extend the method proposed by Hjelm et al. [32] to support more than two views for an improved performance. Similar principles are applied by Henaff et al. [31] using contrastive predictive coding to learn deep representations from a spatial decomposition of the input image.

2.2 Self-Supervised Learning by Clustering

Unsupervised representation learning using clustering [8, 12, 58, 13] is a common alternative to instance learning and contrastive learning. Caron et al. [12]

propose DeepCluster, an end-to-end unsupervised feature learning approach using the k-means clustering algorithm as optimization criteria. Coupled with the self-supervised learning method presented in 

[23], the method is further enhanced in [11] to effectively scale to large uncurated datasets. In their approach Xueting et al. [54] also rely on the k-means algorithm, but in a two stage approach: first cluster assignments are computed from a pretrained model and used as pseudo-labels in the second stage for feature learning. In contrast, Huang et al. [33]

introduce anchor neighborhood discovery - a divide-and-conquer strategy coupled with curriculum learning for effective sample clustering. Using this optimization criteria they demonstrate that one can learn representative deep features in an end-to-end manner. Different from this, Asano et al. 

[4] propose an effective algorithm for simultaneous feature learning and label inference by maximizing the mutual information between data indices and labels in the form of an optimal transport problem.

2.3 Learning from Pretext Task

Another formulation for self-supervised learning reduces the problem to learning from a supervised signal that is artificially constructed to model a pretext task, e.g., solving a Jigsaw puzzle [43, 36]. Agrawal et al. [1] propose to use egomotion as supervision, demonstrating that the features learnt from movement prediction are superior to features learned from traditional image labels. Similarly, Misra et al. [40]

learn feature representations by estimating the correct temporal order of frames in video captures. Inspired by early approaches for landmark detection via coordinate regression 

[56], Doersch et al. [20] propose to use visual context as supervised signal, learning to estimate the relative position of pairs of patches extracted from given unlabeled images. Noroozi et al. [43] propose as pretext task and artificial Jigsaw puzzle of image tiles. They demonstrate that one can train a deep learning model (in the form of a context free network) to solve the puzzle and thereby learn rich semantic features. An alternative strategy is feature learning by inpainting using context encoders trained with an adversarial optimization criterion [44]. Finally, Larsson et al. [39]

use colorization as pretext task, learning to estimate a per-pixel color histogram.

2.4 Self-Supervised Learning in the Medical Domain

Similar principles for self-supervised feature learning are applied in medical image analysis to improve the accuracy and robustness of downstream tasks, e.g., abnormality classification or anatomy segmentation [57, 41]. For instance, Chen et al. [15] propose a commonly known restoration strategy for feature learning from images with artificially swapped local patches. In contrast, Zhou et al. [57] apply various image manipulation steps (nonlinear intensity transformation, local pixel shuffling or in/out-painting) and train an encoder-decoder architecture to reconstruct the original image information thereby learning rich semantic features in an unsupervised way. With focus on volumetric anatomy segmentation, Chaitanya et al. [14] introduce a framework for self-supervised learning based on a hybrid contrastive loss, that learns both global and local image representations. For the same application, Nguyen et al. [42] propose to use spatial awareness as signal for self-supervised learning – learning to predict the displacement of different image slices after random swaps of image patches between slices. Finally, Azizi et al. [5] rely on the contrastive learning based method proposed in [16] to pretrain features and improve the accuracy of various downstream classification tasks from radiography or dermatology images.

3 Proposed Method

We assume that a dataset is given which we denote as , signal samples, e.g., 2D or 3D images ; or images accompanied by non-imaging information such as text, audio, etc. ; . A subset of consists of samples which are paired with labels () such as: binary image labels, masks, etc. While the extension to support non-imaging information as input can be realized, e.g., by using robust feature fusion [37], we focus here on learning only from image signal. We propose to use this dataset for hybrid self-supervised / supervised model pretraining, to learn rich, representative features that can be transferred to downstream use-cases, i.e., used as initialization in a supervised training routine. Figure 1 provides an overview.

Figure 1: Schematic overview of the training methodology. A given input image is randomly transformed based on the set of augmentation operations to . These are processed by the learning model to the features . In turn, these are mapped to their cluster assignments of the set of so called prototypes and used for optimization in a swapped prediction setting. Existing labels can be leveraged during training.

3.1 Online Clustering - Swapped Prediction Optimization

Following [13], we use an online clustering strategy coupled with principles of contrastive learning to learn image features in a self-supervised way. Given a family of image augmentation operations (described later in section 3.3), the goal is to estimate the visual features parametrized by in the model / projector (as shown in Figure 1) via assignment to cluster codes. In particular, this assignment is optimized to be invariant to various hierarchies of augmentation operations sampled from and applied to any given image . The workflow is as follows:

  1. Based on an arbitrary image , two transformed images are computed using random augmentation operations sampled from and applied hierarchically;

  2. The nonlinear model / projector (i.e., the model that we attempt to pretrain) is applied on to compute features which in turn are assigned to cluster codes from a set of prototypes ( is a parameter set by the user);

Given the pair of features and code pair

, the self-supervised optimization criterion is formulated using a swapped prediction strategy based on the cross-entropy loss function:


where is a temperature parameter, and refers to self-supervised loss. Without loss of generality, (similar derivation also for

) and all prototype vectors

are trainable. Following the notation proposed in [13], let  denote a matrix with column vectors defined by the prototypes. The optimization described in Equation 1 is performed using stochastic batch-wise sampling of cases from the training set :


In the following sections we describe in detail the online clustering algorithm, based on two different scenarios: 1) consists only of images of one modality (e.g., radiography); and 2) the dataset contains images of multiple modalities (e.g., radiography, ultrasonography, computed tomography, magnetic resonance imaging, etc.).

3.1.1 Single-Modality Clustering

Assume a batch-size of samples is used for training. In the following, we focus on one branch of the processing steps depicted in Figure 1; all projectors are shared on the other branch (without loss of generality, let that be ). The set of computed output features is captured by matrix , where denotes the size of the any given feature (column vector). The prototypes are captured by matrix , where denotes the number of prototypes. Finally, the codes that enable the mapping of projected features to prototypes are captured by matrix of size . In order to prevent a trivial solution that would map all images in one batch to the same code, an equipartition constraint is enforced based on the entropy measure:


where denotes the entropy and the regularization weight [13]. Inspired by the work of Asano et al. [4], we follow [13] in constraining the solution space to ensure that each prototype is selected at least times in one batch. In addition, empirical evidence indicates that using continuous codes is more effective in the online training setting, compared to discretizing the solution. Following the derivation of [13] and optimal transport theory, the solution to Equation 3 can be determined as a normalized exponential matrix using the Sinkhorn-Knopp algorithm [17].

3.1.2 Multi-Modality Clustering

Training with images from multiple modalities is more challenging. While in theory it may enable the pretraining of robust, modality-invariant rich features that would generalize to a variety of downstream tasks; in practice, simply mixing images of multiple modalities in one single batch impacts the training stability. We hypothesize that a modality specific clustering can alleviate this issue. Let contain images of different modalities. In this case we propose to partition , where denotes the set of all images on modality indexed by (e.g., radiography); and denotes the number of different modalities captured in . Without loss of generality, let us assume the batch-size is a multiple of the number of modalities . We propose to partition each batch of samples in subsets of equal size, each subset containing only images of one modality sampled randomly from any . In this case, Equation 3 can be adapted to:


with conditioned on modality ; and denoting the aggregate of vectors associated with modality (the remaining variables follow the same definition as in 3). The same logic can be applied to adapt Equation 1.

3.2 Hybrid Self-Supervised – Supervised Learning

As we defined dataset ,

cases are associated with labels, e.g., provided by human (expert) annotators, extracted via natural language processing, or other automatic means from the image, associated clinical reports, or other corresponding non-imaging data. Recall, for any arbitrary training sample

, we denote the corresponding label as . We propose to dynamically learn from these labels in a joint self-supervised / supervised strategy. Let

be a deep neural network projector parametrized by

, mapping for any such sample from features of model (output features and/or intermediate layer features) to an output :


In this case, the goal is similar to any supervised learning problem, i.e., minimize the distance of to according to a loss function (sup supervised):


We combine Equations 2 and 6 to a single global optimization criterion, re-balancing the contribution of each using factors :


3.3 Augmentation Strategies

In order to determine the input to the model during training , random augmentation operations are applied from a set of augmentation operations defined as . These operations are as listed below.

Image rescaling is applied to a fraction of the size of the original image as , with denoting the rescaling function and sampled uniformly at random from the interval . The rescaling factor is sampled for large range of possible values to encourage the learning of robust, scale-invariant features.

Energy-based augmentation is performed based on the image normalization algorithm proposed by Philipsen et al. [45]. Following their methodology, an image is decomposed into energy bands using Gaussian filtering. For each band , the energy value is computed as the brightness dispersion in a predefined image region defined by (in our case the entire image). Following [45], the normalized image is computed as:


where denotes the reference energy value on band , with denoting pre-selected reference images. We set and propose to augment the image using a variable reference energy for any given band around the mean value. Concretely, on each band we propose to model the distribution of the

reference values using a Gaussian distribution and sample the value of the reference energy from the range

around the mean energy value.

Intensity rescaling is applied in two different ways: 1) nonlinear rescaling using a gamma transform with the exponent sampled uniformly at random from the interval ; and 2) linear rescaling of the intensity as with a random uniform sample and restricted to of the intensity range of .

Cropping from random image locations (sampled uniformly at random) is the final augmentation applied.

4 Experiments and Results

4.1 Datasets for Self-Supervised Training

We constructed several datasets for self-supervised training (based on 2D and 3D image modalities). The weights of models trained on these datasets using self-supervision were then transferred to initialize models that were optimized using supervision on downstream tasks. The datasets are the following:

  • 2D X-ray dataset containing 1,297,699 X-ray images capturing various anatomies, including chest, spine, back, arm, leg, and more. The data is acquired from both public [35, 24, 47, 18, 10, 3, 34, 28, 46, 52] and internal sources.

  • 2D mixed modality dataset containing 105,006,320 images/slices of various anatomies (head, abdomen, chest, legs, etc.) and from various imaging modalities, including the x-ray dataset . Except the public data contained in as described in the previous section, this dataset contains only internal data. The proportions per modality are: 72% computed tomography (CT) slices, 25% magnetic resonance imaging (MRI) slices, around 1% X-ray and the rest ultrasonography (US) images.

  • 3D computed tomography (CT) dataset containing 24,440 3D CT volumes coming from 1,345,040 DICOM slices of non-contrast CT head scans aquired from internal sources.

4.2 Training Hyper-Parameters, Infrastructure and Scaling

Different architectures have been investigated as part of our experiments, in the 2D context all variants of residual networks [30] (including ResNet-152 and ResNet-50 with several variants denoted as ResNet-50w2/w4 as described in [13]). Training hyper-parameters are defined in Table 1. Several parameters are inherited from SwaV [13]

, for model details we refer the reader to that reference. The training infrastructure consists of 4 nodes (each with 8 Volta GPUs with 16GB GPU memory, 80 cores and 512 GB main memory). All nodes are connected via InfiniBand. The system uses the Quobyte file system for parallel and distributed IO operation and PyTorch distributed functionality is applied to scale the training to multiple nodes.

Parameter 2D-Experiments 3D-Experiments
Number of crops [2,6] [1,4]
Size of crops [224,92] [(224,192),(112,96)]
Image size range [600,1200] [(112,96),(250,200)]
Cropped image border [0.1,0.1] [0.1,0.1]
Assigned crops [0,1] [0,1]
Temperature 0.1 0.1
Epsilon 0.03 0.03
# Sinkhorn iterations 3 3
Feature dimension 128 128
# Prototypes 3000 1500
Queue length 3840 1920
Epoch queue starts 10 3
Epochs 100 100
Batch size 32 4
Start Learning Rate 0.6 0.6
Final Learning Rate 0.0006 0.0006
Freeze prototypes 10000 4000
Weight Decay 1e-06 1e-06
Table 1: Hyper-parameters for self-supervised training in 2D/3D.

4.3 Abnormality Detection in Chest Radiography

We focused on the detection of lung lesions (i.e., nodules/masses) and pneumothorax from frontal chest radiographs. These are critical findings, lesions with potential long-term relevance (e.g., pulmonary malignancy, cancer) while pneumothorax as acute finding often immediately endangers the life of the patient. As proposed in our previous work [48, 49], we approach the problem as a multi-class detection problem using bounding boxes to isolate the abnormalities. The model architecture is fully-convolutional and multi-scale, and is inspired by [50]

. The entire content of the image is processed in one single forward pass and labeled bounding boxes (with an associated probability) around relevant abnormalities are predicted (see Figure 

2). A total of 11730 chest radiographs were used for training. The data was acquired from internal data sources. Various abnormalities (including lesion and pneumothorax) were annotated by expert radiologists on each image using bounding boxes (see Figures 3 and 4). Further details related to the training data and training routine can be found in our previous work [48, 49].

Figure 2: Architecture used for classification and detection of lesion and pneumothorax in chest radiographs. The backbone (ResNet-50w2) is pretrained using self-supervision.

For testing the pneumothorax detection feature we use a test set of 321 chest radiographs, all acquired at bedside in anterior-posterior view – the majority capturing severely ill patients, covered by tubes and/or wires with 34 cases acquired in the ICU. The ground-truth is determined by a consensus of 3 expert radiologists. Each image is first read independently by each reader, followed by a joint discussion to determine the final annotation. As pneumothoraces can be very small (to a few millimeters in sectional width) or obscured by other structures, e.g., the ribs, they can easily be overlooked. By using three readers, we intend to minimize that risk. Of the 321 radiographs, 125 are identified as pneumothorax positive using 148 bounding boxes to isolate the abnormal anatomy. Figure 3 shows an example.

Figure 3: Example case from the pneumothorax test set. Left: Chest radiograph with right upper zone showing a lucent area towards the apex with vaguely defined pleural line suggestive of pneumothorax. ICD tube along with numerous lines make the distinction of the pneumothorax difficult; Middle: Image sub-region capturing the pneumothorax; Right: Curve highlighting the dehiscent visceral pleural separated from the thoracic wall - indicating the pneumothorax.

For testing the lesion detection feature we use a test set of 288 radiographs from the LIDC dataset [3]. Each radiograph is paired with a CT scan acquired in close time-proximity. The information from this additional modality is used to improve the ground-truth quality. As not all lesions captured in CT are also visible in chest radiography [7], we propose two protocols to derive two versions of the ground-truth:

  1. Synchronous reading of chest radiograph and corresponding CT by an expert radiologist, marking on the radiograph only lesions that are visible. We denote this version of the ground-truth as LIDC-synch.

  2. A staged approach is applied. In the first stage, 3 expert radiologists read independently all chest radiographs and mark lesions (without looking at CT). Subsequently all marks are aggregated by an additional radiologist, removing any duplicate marks of the same lesion. These candidate marks are then assessed in a second stage using CT as point of reference - for each mark on the radiograph, if the CT displays a lesion at that location that mark is positive; if not, the mark is removed. We denote this version of the ground-truth as LIDC-staged.

Figure 4: Example cases from lesion test set. Left: Chest radiographs with lesions highlighted using a red bounding box; Middle: Image sub-regions with arrows indicating the lesions. Right: Axial slices of the corresponding CT scan with arrows indicating the same lesions. The first row shows an image of a patient who has sustained a right thoracotomy. In the resection area there is a very subtle lesion that was marked positive only in LIDC-synch. In the generation of LIDC-staged it was missed by all three readers when reading the radiograph to generate candidates for CT-confirmation, and as such the case is marked as negative in LIDC-staged. In contrast, the lesion captured in the second row is much larger and easier to see. It is marked as positive in both LIDC-synch and LIDC-staged.

With 146 (out of 288) positive radiographs and 187 lesions marked, LIDC-synch is a significantly more sensitive ground-truth than LIDC-staged (111 positive radiographs with 133 lesions marked). However, LIDC-staged has the benefit of removing any bias related to the assessment of visibility on radiographs (a step performed in the derivation of LIDC-synch). Example images and annotations are shown in Figure 4.

ROC-AUC is used to assess the classification performance, the accuracy of the bounding box detection is assessed using fROC. In particular we report the sensitivity at instance level (captured in the -axis in fROC) averaged at the following average numbers of false-positive per image (captured in the -axis in fROC): . We compare our solution with 4 alternative approaches:

  1. Supervised learning [30]

    on the ImageNet dataset 


  2. Self-supervised learning [13] on the ImageNet dataset (denoted as SwaV);

  3. SimCLR method for self-supervised learning [16] on the internal dataset;

  4. Using no pretraining, relying on a random initialization of the network weights (in our experience, often used for medical image analysis applications).

For more simplicity in the interpretation of the results, we often focus the comparison of our method with method 1) proposed by He et al. [30] – the previously best strategy for this task. Table 2 gives an overview of the performance of all reference methods.

Method AUC Performance (LIDC-staged)
100% 50% 25% 10%
No pretraining 0.77 0.73 0.65 0.53
SimCLR [16] 0.90 0.88 0.82 0.79
SwAV [13] 0.90 0.89 0.85 0.80
Supervised NI [30] 0.91 0.89 0.82 0.80
Ours 0.94 0.91 0.85 0.85
Average of 5 models; selected with highest AUC on validation set

Over 5 training rounds standard deviation of AUC

0.003 for each,
indicating the high stability of the performance measurement
Table 2: AUC performance for lesion detection on LIDC-staged when using 100%, 50%, 25% or 10% of the training data. Selection of the subsets is done randomly.
Figure 5: Performance evolution of lesion and pneumothorax detection in chest radiographs during training (left: lesion on LIDC-staged, middle: lesion on LIDC-synch and right: pneumothorax). Our solution (initializing the model with self-supervised pretrained weights on the X-ray dataset ) significantly outperforms both in terms of AUC and average instance detection sensitivity the previously best pretraining strategy, i.e., supervised pretraining on ImageNet [30]. The difference is significant, ranging between 3 - 5%. The difference is much larger when compared to using no-pretraining - ranging between 20 - 25% on the lesion test and almost 30% on the pneumothorax test data.

Figure 5 shows the performance evolution for lesion and pneumothorax detection on the test and validation sets during training. We highlight the AUC and average instance level fROC sensitivity as a function of the epochs. A significant increase in average performance is achieved for the test set.

Figure 6 shows the acceleration in convergence speed when using our method compared to conventional supervised pretraining or no pretraining. The speedup is at least 50% (also on both lesion test sets); both when analyzing evolution of performance on test data and validation data. Finally, Figure 7 highlights the increased robustness of self-supervised pretrained models after down-stream fine-tuning with respect to certain image variations which are typically observed in practice (e.g., image rotations/scaling or intensity variations).

Figure 6: Visualization of the training speed-up in model for pneumothorax. We denote the convergence point as the earliest epoch during training where 98% of the final best performance is achieved (denoted as HP-thresh, i.e., high performance threshold). Using our method, one can achieve convergence 72% faster than using no pretraining. The transparent area along each curve denotes the standard deviation.
Figure 7: Using self-supervised pretraining on leads in an increase in robustness. The box plot shows the relative deviation in probability for lesion (at case level) when applying various augmentations, such as gamma transform with , and/or random image rotation/scaling. The distribution is shown for a random set of 50 radiographs with 50 random transformations per image (i.e., 2500 data points). The reference method refers to [30].

4.4 Brain Metastases Detection in MRI

Automated detection and segmentation of brain metastases in 3D MRI scans could support therapy workflows. In this study, we conducted another experiment on slice-wise brain metastasis on contrast-enhanced magnetization-prepared rapid acquisition with gradient echo (MPRAGE) scans, which can be used for treatment selection, planning and longitudinal monitoring by guiding radiosurgery protocols and other treatment decisions. However, this task remains challenging due, in part, to the scarcity of training data containing metastatic tissue in an MRI volume, which makes learning clinically meaningful image features from scratch challenging. A reliable pretrained model may have the potential to mitigate this limited data problem. Thus we focused on analyzing the impact of self-supervised training as pretraining on classifying 2D slices with metastases in MPRAGE volumes. We utilized a 2.5D encoder-decoder network to first obtain a segmentation mask showing potential areas of suspected metastases 

[22]. The output segmentation mask was subsequently used as an attention channel along with the 5 input slices to train a 2.5D classification network to perform a slice-wise classification. The architecture of the classification network followed the concept of ResNet-50w2. The six channel input was compressed to make three channels by a convolution. Our dataset included a training set (341 cases), a validation set (36 cases) and a test set (43 cases). The details about the data and preprocessing can be found in [55].

The metastatic slice detection performance was evaluated by ROC-AUC and mean average precision. Figure 8 shows the training evolution using detection AUC measured on the validation dataset for each epoch. Training the model from scratch reached AUC 90% after 300 epochs. On the other hand, training the model initialized with the pretrained ResNet-50 achieved AUC 92-93% within 10 epochs. Also the pretrained model consistently outperformed the model without pretraining by AUC 2-3%. The testing AUC with our pretraining method was 93.2% which was higher than both the model without pretraining and the model pretrained with SwAV [13] by 2% as shown in Table 3. Mean average precision with our pretraining method was 85.6% which was higher than the model without pretraining by 5.6% and than the model pretrained with SwAV [13] by 1.2%. Both the pretrained models produced accuracy 94.3% that is higher than the model without pretraining by 3.3%. Figure 9 shows examples of metastatic slice detection in a patient with brain metastasis.

Figure 8: Validation AUC evolution over training epochs for brain metastatic slice detection. Using our self-supervised pretraining leads to an increase in training convergence rate and validation AUC.
Figure 9: Example post-contrast MPRAGE slices from a patient with brain metastases. The images show metastatic slices and their detection scores by the model pretrained with our self-supervised learning. Image courtesy: University of Michigan.
Method AUC mAP Accuracy
No pretraining 0.913 0.799 0.910
SwAV [13] 0.913 0.844 0.943
Ours 0.932 0.856 0.943
Table 3: Performance comparison for slice-wise brain metastasis detection on 8145 testing MPRAGE slice images (1336 metastatic images). AUC: area under the ROC-curve, mAP: mean average precision. Accuracy was computed with an operation point 0.6.

4.5 Brain Hemorrhage Detection in CT

Current standard of care for evaluation of patients presenting with stroke symptoms or following head trauma involves assessment of non-contrast CT (NCCT) scans for presence of hemorrhage. Accurate and timely detection of head bleeding is critical to start the appropriate treatment as soon as possible such as starting administration of thrombolytics for stroke patients or surgical intervention for trauma patients. Automated detection of hemorrhage in NCCT scans [2, 38], has the potential to minimize the time it takes for a patient to receive the appropriate treatment.

In this work we investigate the gains of automated AI-based detection of hemorrhage in 3D NCCT scans by using both auxiliary tasks as well as self-supervised pretraining of a 3D network. The input dicoms are stacked together in a 3D volume, we then reformat the input volume to be in axial orientation and a set of head landmarks are used to crop the brain region and scale it in a box of dimensions 40x224x192. The Hounsfield units are normalized to (0,1) by using a transformation with a (center, window) = (55, 200). For feature extraction we employ a 3D densely connected network with variable input size and 5 dense blocks with (1,2,3,3,3) units, each unit having 3D Convolution, 3D BatchNormalization and LeakyReLU activation layers with 16 initial features and a growth rate of 7. The first two blocks only process data in-plane with (1,3,3) kernels and downsample only in (x,y) plane and the last three process data with full (3,3,3) 3D kernels. The final features of fixed dimension 1024 are computed through adaptive pooling (both max and average pooling). A set of 3017 3D volumes are used for validation and model selection and a set of 2945 volumes are used for testing (both datasets coming from patients not included in pretraining). The main task is hemorrhage detection for each 3D volume and we have used as auxiliary tasks training with hemorrage types labels (subarachnoid, subdural, epidural, intraventricular, intraparenchymal hemorrhages) and with presence/absence of hemorrhage for each slice. Figure 

10 illustrates the AUC and accuracy of the base network and the improvement by training with auxiliary tasks as well as with self-supervised pretraining. The network trained with only presence/absence of hemorrhage for the whole 3D volume achieves an AUC of 0.88/0.87 for the validation/test sets while the network trained also with the auxiliary tasks and the self-supervision achives and AUC of 0.95/0.94 respectively, that is an 8% increase in performance. Figure 11 illustrates different types of hemorrhage successfully detected by the system. Table 4 shows the performance gains by sequentially using each of the auxiliary tasks and self-supervision – the best performance is obtained when all are used with self-supervision.

Figure 10: Performance of hemorrhage detection on 3D NCCT: Using auxiliary tasks and self-supervised pretraining of a 3D network on leads in an increase in performance. The auxiliary tasks include the use of hemorrhage types and hemorrhage labels on each slice. The left figure illustrates AUC and accuracy for the base model trained only on 3D hemorrhage labels for training, validation and testing data splits. The right figure illustrates the performance for the model trained using auxiliary tasks and self-supervision. It shows a significant increase of the AUC from 0.88/0.87 (validation/test) to 0.95/0.94.
Figure 11: Examples of hemorrhage detection on 3D NCCT on various types of hemorrhage. On the left the it is illustrated with color overlay the hemorrhage region and on the right the NCCT image slice.
Labels 3D 3D, types 3D, types, slice 3D, types, slice, self-supervision
AUC 0.87/0.88 0.89/0.88 0.94/0.93 0.95/0.94
Table 4: Ablation performance of hemorrhage detection on 3D NCCT showing the AUC for validation/test when only 3D labels are used and adding types, slice labels and the self-supervision pretraining of the same 3D network.

4.6 Directions of Future Research

Further optimization and research is required in different directions: 1) With a training time of 6.5 - 14 days (depending on the training dataset - , or ) further optimization and better scalability of the training is required to execute more training rounds, and perform more ablative analysis. This has limited the amount of experiments and analysis; 2) Once the previous point is addressed, more work is needed to investigate the effectiveness of the proposed training technique on more diverse models; and 3) More dedicated focus is needed to investigate the utility of self-supervised learning in tracking and registration tasks, in which often models are very small and shallow to ensure high efficiency.

5 Conclusion

In conclusion, we propose an effective technique for self-supervised learning based on contrastive learning and online clustering, with support for hybrid self-supervised / supervised learning and multi-modality training data (2D and 3D). We demonstrate the scalability of the method on a large dataset of over 105,000,000 images, highlighting the impact of the learned image representations in improving the accuracy (average of 6-8% AUC), robustness and training speed (up to 85%) on various downstream tasks.

The concepts and information presented in this paper are based on research results that are not commercially available. Acknowledgements
The authors acknowledge the National Cancer Institute and the Foundation for the National Institutes of Health, and their critical role in the creation of the free publicly available LIDC/IDRI Database used in this study.


  • [1] P. Agrawal, J. Carreira, and J. Malik (2015) Learning to see by moving. In

    IEEE International Conference on Computer Vision

    pp. 37–45. Cited by: §2.3.
  • [2] M. Arbabshirani, B. Fornwalt, G. Mongelluzzo, J. Suever, B. Geise, A. Patel, et al. (2018) Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. In NPJ digital medicine, Vol. 1, pp. 2398–6352. Cited by: §4.5.
  • [3] S. G. Armato III, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A. Hoffman, et al. (2011) The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical Physics 38 (2), pp. 915–931. Cited by: item –, §4.3.
  • [4] YM. Asano, C. Rupprecht, and A. Vedaldi (2020) Self-labelling via simultaneous clustering and representation learning. In International Conference on Learning Representations, Cited by: §2.2, §3.1.1.
  • [5] S. Azizi, B. Mustafa, F. Ryan, Z. Beaver, J. Freyberg, J. Deaton, A. Loh, A. Karthikesalingam, S. Kornblith, T. Chen, V. Natarajan, and M. Norouzi (2021) Big self-supervised models advance medical image classification. External Links: 2101.05224 Cited by: §2.4.
  • [6] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §2.1.
  • [7] E. J. M. Barbosa Jr, W. B. Gefter, F. C. Ghesu, S. Liu, B. Mailhe, A. Mansoor, S. Grbic, and S. Vogt (2021)

    Automated detection and quantification of COVID-19 airspace disease on chest radiographs: a novel approach achieving expert radiologist-level performance using a deep convolutional neural network trained on digital reconstructed radiographs from computed tomography–derived ground truth

    Investigative radiology. Cited by: §4.3.
  • [8] M. A. Bautista, A. Sanakoyeu, E. Tikhoncheva, and B. Ommer (2016) CliqueCNN: deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems, Vol. 29. Cited by: §2.2.
  • [9] P. Bojanowski and A. Joulin (2017) Unsupervised learning by predicting noise. In International Conference on Machine Learning, pp. 517–526. Cited by: §2.1.
  • [10] A. Bustos, A. Pertusa, J. Salinas, and M. de la Iglesia-Vayá (2020) PadChest: a large chest x-ray image dataset with multi-label annotated reports. Medical Image Analysis 66, pp. 101797. External Links: ISSN 1361-8415 Cited by: item –.
  • [11] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In IEEE International Conference on Computer Vision, pp. 2959–2968. Cited by: §2.2.
  • [12] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In European Conference on Computer Vision, Cham, pp. 139–156. Cited by: §2.2.
  • [13] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: 1st item, §1, §2.2, §3.1.1, §3.1, §3.1, item 2, §4.2, §4.4, Table 2, Table 3.
  • [14] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu (2020) Contrastive learning of global and local features for medical image segmentation with limited annotations. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: §1, §2.4.
  • [15] L. Chen, P. Bentley, K. Mori, K. Misawa, M. Fujiwara, and D. Rueckert (2019) Self-supervised learning for medical image analysis using image context restoration. Medical Image Analysis 58, pp. 101539. External Links: ISSN 1361-8415 Cited by: §2.4.
  • [16] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119, pp. 1597–1607. Cited by: §1, §2.4, item 3, Table 2.
  • [17] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, Vol. 26. Cited by: §3.1.1.
  • [18] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald (2016) Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23 (2), pp. 304–310. Cited by: item –.
  • [19] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition

    pp. 248–255. Cited by: §1, item 1.
  • [20] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §2.3.
  • [21] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, Vol. 27. Cited by: §2.1.
  • [22] F. C. Ghesu, B. Georgescu, A. Mansoor, Y. Yoo, E. Gibson, R. Vishwanath, A. Balachandran, J. M. Balter, Y. Cao, R. Singh, et al. (2021) Quantifying and leveraging predictive uncertainty for medical image assessment. Medical Image Analysis 68, pp. 101855. Cited by: §4.4.
  • [23] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §2.2.
  • [24] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C. Peng, and H. E. Stanley (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Cited by: item –.
  • [25] S. Gündel, A. A.A. Setio, F. C. Ghesu, S. Grbic, B. Georgescu, A. Maier, and D. Comaniciu (2021) Robust classification from noisy labels: integrating additional knowledge for chest radiography abnormality assessment. Medical Image Analysis 72, pp. 102087. Cited by: 3rd item, §1.
  • [26] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics, Y. W. Teh and M. Titterington (Eds.), Vol. 9, pp. 297–304. Cited by: §2.1.
  • [27] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: 1st item, §2.1.
  • [28] S. S. Halabi, L. M. Prevedello, J. Kalpathy-Cramer, A. B. Mamonov, A. Bilbily, M. Cicero, I. Pan, L. A. Pereira, R. T. Sousa, N. Abdala, F. C. Kitamura, H. H. Thodberg, L. Chen, G. Shih, K. Andriole, M. D. Kohli, B. J. Erickson, and A. E. Flanders (2019) The RSNA pediatric bone age machine learning challenge. Radiology 290 (2), pp. 498–503. Cited by: item –.
  • [29] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9726–9735. Cited by: §2.1.
  • [30] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, Figure 5, Figure 7, item 1, §4.2, §4.3, Table 2.
  • [31] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119, pp. 4182–4192. Cited by: §2.1.
  • [32] D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, Cited by: §2.1.
  • [33] J. Huang, Q. Dong, S. Gong, and X. Zhu (2019) Unsupervised deep learning by neighbourhood discovery. In International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 2849–2858. Cited by: §2.2.
  • [34] S. Jaeger, S. Candemir, S. Antani, Y. J. Wáng, P. Lu, and G. Thoma (2014) Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery 4 (6), pp. 475. Cited by: item –.
  • [35] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, and S. Horng (2019) MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6 (1), pp. 1–8. Cited by: item –.
  • [36] D. Kim, D. Cho, D. Yoo, and I. Kweon (2018) Learning image representations by completing damaged jigsaw puzzles. In IEEE Winter Conference on Applications of Computer Vision, pp. 793–802. Cited by: §2.3.
  • [37] T. Kim and J. Ghosh (2019) On single source robustness in deep fusion models. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §3.
  • [38] W. Kuo, C. Häne, P. Mukherjee, J. Malik, and E. Yuh (2019) Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning. In Proceedings of the National Academy of Sciences of the United States of America, Vol. 116, pp. 22737–22745. Cited by: §4.5.
  • [39] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European Conference on Computer Vision, Cham, pp. 577–593. Cited by: §2.3.
  • [40] I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, Cham, pp. 527–544. Cited by: §2.3.
  • [41] F. Navarro, C. Watanabe, S. Shit, A. Sekuboyina, J. C. Peeken, S. E. Combs, and B. H. Menze (2021) Evaluating the robustness of self-supervised learning in medical imaging. External Links: 2105.06986 Cited by: §2.4.
  • [42] X. Nguyen, G. S. Lee, S. H. Kim, and H. J. Yang (2020) Self-supervised learning based on spatial awareness for medical image analysis. IEEE Access 8, pp. 162973–162981. Cited by: §1, §2.4.
  • [43] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, Cham, pp. 69–84. Cited by: §2.3.
  • [44] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §2.3.
  • [45] R. H. H. M. Philipsen, P. Maduskar, L. Hogeweg, J. Melendez, C. I. Sánchez, and B. van Ginneken (2015) Localized energy-based normalization of medical images: application to chest radiography. IEEE Transactions on Medical Imaging 34 (9), pp. 1965–1975. Cited by: §3.3.
  • [46] P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, B. Yang, K. Zhu, D. Laird, R. L. Ball, C. Langlotz, K. Shpanskaya, M. P. Lungren, and A. Y. Ng (2018) MURA: large dataset for abnormality detection in musculoskeletal radiographs. External Links: 1712.06957 Cited by: item –.
  • [47] A. Rosenthal, A. Gabrielian, E. Engle, D. E. Hurt, S. Alexandru, V. Crudu, et al. (2017) The TB portals: an open-access, web-based platform for global drug-resistant-tuberculosis data sharing and analysis. Journal of Clinical Microbiology 55 (11), pp. 3267–3282. Cited by: item –.
  • [48] J. Rudolph, C. Huemmer, F. Ghesu, A. Mansoor, A. Preuhs, A. Fieselmann, N. Fink, J. Dinkel, V. Koliogiannis, V. Schwarze, et al. (2021) Artificial intelligence in chest radiography reporting accuracy: added clinical value in the emergency unit setting without 24/7 radiology coverage. Investigative Radiology. Cited by: §4.3.
  • [49] J. Rueckel, C. Huemmer, A. Fieselmann, F. Ghesu, A. Mansoor, B. Schachtner, P. Wesp, L. Trappmann, B. Munawwar, J. Ricke, et al. (2021) Pneumothorax detection in chest radiographs: optimizing artificial intelligence system for accuracy and confounding bias reduction using in-image annotations in algorithm training. European Radiology, pp. 1–13. Cited by: §4.3.
  • [50] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In European Conference on Computer Vision, Cham, pp. 776–794. Cited by: §2.1, §4.3.
  • [51] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In IEEE International Conference on Computer Vision, pp. 9627–9636. Cited by: §1.
  • [52] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) ChestX-Ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3462–3471. Cited by: 3rd item, item –.
  • [53] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §2.1.
  • [54] X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan (2020) ClusterFit: improving generalization of visual representations. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6508–6517. Cited by: §2.2.
  • [55] Y. Yoo, P. Ceccaldi, S. Liu, T. J. Re, Y. Cao, J. M. Balter, and E. Gibson (2021) Evaluating deep learning methods in detecting and segmenting different sizes of brain metastases on 3D post-contrast T1-weighted images. Journal of Medical Imaging 8 (3). Cited by: §4.4.
  • [56] S. K. Zhou, J. Zhou, and D. Comaniciu (2007) A boosting regression approach to medical anatomy detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.3.
  • [57] Z. Zhou, V. Sodha, M. M. Rahman Siddiquee, R. Feng, N. Tajbakhsh, M. B. Gotway, and J. Liang (2019) Models genesis: generic autodidactic models for 3D medical image analysis. In Medical Image Computing and Computer Assisted Intervention, Cham, pp. 384–393. Cited by: §1, §2.4.
  • [58] C. Zhuang, A. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In IEEE International Conference on Computer Vision, pp. 6001–6011. Cited by: §2.1, §2.2.