Contrastive learning of global and local features for medical image segmentation with limited annotations

06/18/2020 ∙ by Krishna Chaitanya, et al. ∙ ETH Zurich 6

A key requirement for the success of supervised deep learning is a large labeled dataset - a condition that is difficult to meet in medical image analysis. Self-supervised learning (SSL) can help in this regard by providing a strategy to pre-train a neural network with unlabeled data, followed by fine-tuning for a downstream task with limited annotations. Contrastive learning, a particular variant of SSL, is a powerful technique for learning image-level representations. In this work, we propose strategies for extending the contrastive learning framework for segmentation of volumetric medical images in the semi-supervised setting with limited annotations, by leveraging domain-specific and problem-specific cues. Specifically, we propose (1) novel contrasting strategies that leverage structural similarity across volumetric medical images (domain-specific cue) and (2) a local version of the contrastive loss to learn distinctive representations of local regions that are useful for per-pixel segmentation (problem-specific cue). We carry out an extensive evaluation on three Magnetic Resonance Imaging (MRI) datasets. In the limited annotation setting, the proposed method yields substantial improvements compared to other self-supervision and semi-supervised learning techniques. When combined with a simple data augmentation technique, the proposed method reaches within 8 for training, corresponding to only 4 train the benchmark.



There are no comments yet.


page 7

page 8

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised deep learning provides state-of-the-art medical image segmentation Ronneberger et al. (2015); Milletari et al. (2016); Kamnitsas et al. (2016, 2017), when large labeled datasets are available. However, assembling such large annotated datasets is challenging, thus methods that can alleviate this requirement are highly desirable. Self-supervised learning (SSL) is a promising direction to this end: it provides a pre-training strategy that relies only on unlabeled data to obtain a suitable initialization for training downstream tasks with limited annotations. In recent years, SSL methods Doersch et al. (2015); Pathak et al. (2016); Noroozi and Favaro (2016); Gidaris et al. (2018) have been highly successful for downstream analysis of not only natural images Russakovsky et al. (2015); Krizhevsky et al. (2009); Everingham et al. (2010), but also medical images Zhuang et al. (2019); Bai et al. (2019); Zhang et al. (2017b); Jamaludin et al. (2017); Tajbakhsh et al. (2019); Chen et al. (2019).

In this work, we focus on contrastive learning Chen et al. (2020); Misra and van der Maaten (2019); Bachman et al. (2019); Hjelm et al. (2019); Wu et al. (2018); Oord et al. (2018), a successful variant of SSL. The intuition of this approach is that different transformations of an image should have similar representations and that these representations should be dissimilar from those of a different image. In practice, a suitable contrastive loss Hadsell et al. (2006); Chen et al. (2020) is formulated to express this intuition and a neural network (NN) is trained with unlabeled data to minimize this loss. The resulting NN extracts image representations that are useful for downstream tasks, such as classification or object detection, and constitutes a good initialization that can be fine-tuned into an accurate model, even with limited labeled examples.

Despite its success, we believe that two important aspects have been largely unexplored in the existing contrastive learning literature that can improve the current state-of-the-art with respect to medical image segmentation. Firstly, most works focus on extracting global representations and do not explicitly learn distinctive local representations, which we believe will be useful for per-pixel prediction tasks such as image segmentation. Secondly, the contrasting strategy is often devised based on transformations used in data augmentation, and do not necessarily utilize a notion of similarity that may be present across different images in a dataset. We believe that a domain-specific contrasting strategy that leverages such inherent structure in the data may lead to additional gains by providing the network with more complex similarity cues than what augmentation can offer.

In this work, we aim to fill these gaps in the contrastive learning literature in the context of segmentation of volumetric medical images and make the following contributions. 1) We propose new domain-specific contrasting strategies for volumetric medical images, such as Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). 2) We propose a local version of contrastive loss, which encourages representations of local regions in an image to be similar under different transformations, and dissimilar to those of other local regions in the same image. 3) We evaluate the proposed strategies on three MRI datasets and find that combining the proposed global and local strategies consistently leads to substantial performance improvements compared to no pre-training, pre-training with pretext tasks, pre-training with global contrastive loss, as demonstrated to yield state-of-the-art accuracy in Chen et al. (2020), as well as semi-supervised learning methods. 4) We investigate if pre-training with the proposed strategies has complementary benefits to other methods for learning with limited annotations, such as data augmentation and semi-supervised training.

2 Related works

Recent works have shown that SSL Doersch et al. (2015); Pathak et al. (2016); Noroozi and Favaro (2016); Gidaris et al. (2018)

can learn useful representations from unlabeled data by minimizing an appropriate unsupervised loss during training. Resulting network constitutes a good initialization for downstream tasks. For brevity, we discuss only the SSL literature relevant to our work. This can be coarsely classified into two categories.

Pretext task-based methods employ a task whose labels can be freely acquired from unlabeled images, to learn useful representations. Examples of such pretext tasks include predicting image orientation Gidaris et al. (2018), inpainting Pathak et al. (2016), context restoration Chen et al. (2019), among many others Dosovitskiy et al. (2014); Doersch et al. (2015); Noroozi and Favaro (2016); Zhang et al. (2016, 2017c); Doersch and Zisserman (2017).

Contrastive learning methods employ a contrastive loss Hadsell et al. (2006) to enforce representations to be similar for similar pairs and dissimilar for dissimilar pairs Wu et al. (2018); He et al. (2019); Misra and van der Maaten (2019); Chen et al. (2020); Tian et al. (2019). Similarity is defined in an unsupervised way, mostly through using different transformations of an image as similar examples, as was proposed in Dosovitskiy et al. (2014). Oord et al. (2018); Hénaff et al. (2019); Hjelm et al. (2019); Bachman et al. (2019) maximize mutual information (MI), which is very similar in implementation to contrastive loss, as pointed in Tschannen et al. (2019). In Hjelm et al. (2019); Bachman et al. (2019); Hénaff et al. (2019), MI between global and local features from one or more levels of an encoder network is maximized. Authors in Tschannen et al. (2020) used domain-specific knowledge of videos while modeling the global contrastive loss. Others leverage memory bank Wu et al. (2018) or momentum contrast He et al. (2019); Misra and van der Maaten (2019) to use more negative samples per batch.

The proposed work differs from existing contrastive learning approaches in multiple ways. Firstly, previous works focus on encoder type architectures, used in image-wise prediction tasks, while we focus on encoder-decoder architectures used for pixel-wise predictions. Secondly, works that consider local representations learn global and local level representations simultaneously, where the aim is to facilitate learning of better image-wide representations. In our work, we aim to learn local representations to complement image-wide representations, providing the ability to distinguish different areas in an image to a decoder. As such the minimization objective is different. Thirdly, while learning global representations, we integrate domain knowledge from medical imaging in defining the set of similar image pairs that goes beyond different transformations of the same image.

Other relevant directions that leverage unlabeled data to address the limited annotation problem are semi-supervised learning and data augmentation. Semi-supervised learning methods make use of unlabeled along labeled data  Chapelle et al. (2009); Zhu (2005); Grandvalet and Bengio (2005); Zhu and Goldberg (2009); Lee (2013); Kingma et al. (2014); Rasmus et al. (2015); Laine and Aila (2017); Sajjadi et al. (2016); Tarvainen and Valpola (2017); Miyato et al. (2018) and successful approaches for medical image analyses include (i) self-training Yarowsky (1995); McClosky et al. (2006); Bai et al. (2017) and (ii) adversarial training Zhang et al. (2017d). Extensive data augmentation has also been shown to be beneficial for medical image analyses in limited label settings. Augmentation through random affine Cireşan et al. (2011), elastic Ronneberger et al. (2015) and contrast transformations Hong et al. (2017); Perez et al. (2018), Mixup Zhang et al. (2017a); Eaton-Rosen et al. (2018) and synthetic data generation via GANs Goodfellow et al. (2014); Costa et al. (2018); Shin et al. (2018); Bowles et al. (2018) have been explored. Recent works also leveraged unlabeled data for augmentation either through image registration Zhao et al. (2019) or, in more general way, optimizing the augmentation procedures for a given task Chaitanya et al. (2019).

3 Methods

Our investigation is based on the contrastive loss shown to achieve state-of-the-art performance in Chen et al. (2020). We present first this loss briefly as “global contrastive loss” before our contributions.

3.1 Global contrastive loss

For a given encoder network , the contrastive loss is defined as:


Here, and are two differently transformed versions of the same image , i.e. and where are simple transformations as used in Hadsell et al. (2006); Chen et al. (2020); Bachman et al. (2019); Dosovitskiy et al. (2014); Tian et al. (2019), such as crop followed by color transformations, and is the set of such transformations. These two images are treated as similar and their representations are encouraged to be similar. In contrast, the set consists of images that are dissimilar to and its transformed versions. This set may include all images other than , including their possible transformations. Minimizing the loss increases the similarity between the representations of and , denoted by and , while increasing the dissimilarity between the representation of and those of dissimilar images. Note that the representations used in the loss are extracted after appending with , a shallow fully-connected network with limited capacity, also referred to as “projection head” in Chen et al. (2020). The presence of allows some flexibility to also retain information regarding the transformations, as was empirically shown in Chen et al. (2020)

. Lastly, similarity in the representation space is defined via the cosine similarity between two vectors, i.e.

, and is a temperature scaling parameter. Equation 1 only defines the loss for a given pair of similar images. Using this loss, the global contrastive loss is defined as:


where is the set of all similar pairs of images that can be constructed from a given set of images . Authors in Chen et al. (2020) construct similar pairs by randomly sampling and from and applying them to any given image . When the global contrastive loss is optimized using mini-batches, each batch is composed of a number of similar image pairs. While computing for each pair, images in the other pairs form .

The specific definitions of the sets of similar image pairs and dissimilar images are the guiding components in the global contrastive loss. Definitions of these sets can be done in a number of ways. Integrating domain and problem-specific information in these definitions have the potential to improve the effectiveness of the resulting self-supervised learning process and impact performance gains on downstream tasks Hoffer and Ailon (2015). With this motivation, we investigate different definitions of these sets, focusing on image segmentation for medical applications. Particularly, we leverage domain-specific knowledge for providing global cues for learning with volumetric medical images and problem-specific knowledge for providing local cues for image segmentation. Below, we describe the notions of being similar and dissimilar that we investigated and the associated definitions of sets and . In the following discussion, we focus on 2D encoder-decoder architectures that are used for image segmentation, such as the UNet architecture Ronneberger et al. (2015). In such architectures, we use the encoder to extract the global representation and the decoder layers to extract the complementary local representation.

3.2 Leveraging structure within medical volumes for global contrastive loss

A distinctive aspect of medical imaging, in particular of MRI and CT, is that volumetric images of the same anatomical region for different subjects have similar content. Moreover, such images, especially when they are acquired with the same modality and capture the same field-of-view, can be roughly aligned relatively easily using linear registration, for instance as implemented in Elastix Klein et al. (2009). Post-alignment, corresponding 2D slices in different volumes capture similar anatomical areas and as such, information in such slices can be considered to be similar.

We leverage the similarity across corresponding slices in different volumes as an additional cue for defining and in the global contrastive loss. In the rest of the paper, we refer to a volumetric image by a volume and to a 2D slice by an image, unless otherwise stated. Suppose we have volumes, each composed of images and roughly aligned rigidly. We group the images of each volume into partitions, each containing consecutive images, and denote an image from the th partition of the th volume by (see Fig. 1.i.b). We use the same grouping in all volumes and based on the assumed alignment, corresponding partitions in different volumes can be considered to capture similar anatomical areas. Accordingly, our main hypothesis is that and can be considered to be similar while training for contrastive learning. Similar to Sec. 3.1, we denote different transformations of an image with and . Representations are extracted with an encoder followed by , and denoted with , e.g. . In our explanations, we focus on mini-batch optimization and describe how each mini-batch is constructed.

Figure 1: (i) Sketch of global contrastive loss used for pre-training the encoder with dense layers . Here, the number of partitions per volume are 4. denotes the image from partition of volume , and the corresponding global representation. (ii) Sketch of local contrastive loss used for pre-training the decoder with convolutional layers , with frozen weights of encoder obtained from the previous training stage. is the corresponding feature map for the image .

Random strategy (): We refer to the application of the original idea (Sec. 3.1) as the “random strategy”. As in Chen et al. (2020); He et al. (2019), we form a batch by randomly sampling images across all volumes and applying a pair of random transformations, and , to each image. is composed of (, ) pairs and for each related pair, the unrelated image set consists of all the remaining images.

Proposed strategies: We build upon with two contrasting strategies that leverage similarity of corresponding partitions in different volumes. For both these strategies, we form batches by first randomly sampling volumes. Then, for each sampled volume, we sample one image per partition resulting in images per volume. Next, we apply a pair of random transformations on each sampled image () to get the transformed images (, ) and add them to the batch.
Strategy : Here, we restrict contrasting images from corresponding partitions in different volumes as such images likely contain similar global information and treating them as dissimilar may be detrimental. For a given related pair from partition , we enforce to contain images only from partitions other than : . The set of similar pairs, , contains three pairs: , and , for each partition and sampled volume.

Strategy : Here, we further incentivize similar representations for images coming from corresponding partitions of different volumes, thus encouraging partition-wise representation clusters (as illustrated in Fig. 1.i.c). Compared to , complements random transformations with real images from other volumes for which representations should be similar. Accordingly, contains pairs of images from corresponding partitions across volumes (), as well as transformed versions (), in addition to the pairs described in . For a given pair, remains the same as in .

3.3 Local contrastive loss

The global contrastive loss incentivizes image-level representations that are similar for similar images and dissimilar for dissimilar images. This strategy is useful for downstream tasks such as object classification, that require distinctive image-level representations. Downstream tasks that involve pixel-level predictions, for instance image segmentation, may additionally require distinctive local representations to distinguish neighbouring regions. A suitable local strategy, in combination with the global strategy, may be more apt for such downstream tasks.

With this motivation, we propose a self-supervised learning strategy that encourages the decoder of an encoder-decoder network (Fig. 1.ii) to extract distinctive local representations to complement global representations extracted by the encoder. Specifically, we train the first decoder blocks, , using a local contrastive loss, . For a given image , this loss incentivizes the representations to be such that different local regions within are dissimilar, while each local region in remains similar across intensity transformations of .

is defined similarly as , but using appropriate sets of similar and dissimilar local regions from , here denoted as and to distinguish from the sets in the definition of . For computing , we feed a pair of similar images through the encoder (), the first decoder blocks (), and a shallow network () to obtain the feature maps and as illustrated in Fig. 1.ii. We divide each feature map into local regions, each of dimensions , with . Now, corresponding local regions in and form the similar pair set and for each similar pair, the dissimilar set consists of all other local regions in both feature maps, and . These pairs are illustrated in different colors in Fig. 1.ii. The local contrastive loss for a given similar pair is defined as:


where is the cosine similarity as defined before, indexes local regions in the feature maps and . The total local contrastive loss for a set of images can be defined as:


where is the set of transformations (intensity transformations) used to compute the local contrastive loss. For notational simplicity, we define the similar pair set using only the region indices in the summation. , as defined above, extends  Chen et al. (2020) for pixel-level prediction tasks. Additionally, we may introduce domain-specific knowledge into , as we did for the global loss in Sec. 3.1. With domain knowledge and without, we propose two strategies for constructing the sets and .

Random Sampling (): This is a direct application of the local contrastive loss without integrating domain-specific knowledge. A mini-batch is formed by randomly sampling N 2D images and applying a random pair of intensity transformations on each image. Using the notation including volume and partition indices, the random sampling strategy chooses randomly and uses pairs for multiple , and indices to form . For each similar pair, the dissimilar set consists of representations of all other local regions’ indices () with within the same feature maps and .

Strategy : While computing with strategy , we assumed rough alignment between different volumes, defined corresponding partitions accordingly and encouraged similar global representations for such partitions. A similar strategy can be used for computing by assuming correspondence between local regions within images. To this end, we include additional similar pairs in by taking corresponding local regions from different volumes, i.e. , as well as their transformed versions, i.e. . In the dissimilar set , for each related local region pair , we consider the remaining local regions and , , as well as their transformed versions. Note that the local representations extracted by for both the proposed strategies (, ) are conceptually different than those extracted by an encoder as in Hjelm et al. (2019). Here, the local representations are designed to be distinctive across local regions.

3.4 Pre-training using global and local contrastive losses

We carry out the overall pre-training with the global and local contrastive losses as follows. First, we pre-train the encoder along with a shallow dense network using a global contrastive loss . Next, we discard and use for further processing, following observations in Chen et al. (2020). Now, we freeze and add blocks of the decoder, , and a shallow network on top of that. We then pre-train and using the local contrastive loss . As before, we discard after pre-training with the local loss. After both these stages, the pre-trained and are expected to extract representations that capture useful information at both the global as well as local level. Now, we add the remaining decoder blocks (so that the output of the network has the same dimensions as the input) with random weights and fine-tune the entire network for the downstream segmentation task using a small labeled dataset. Note that it is possible to train with , , and the segmentation loss jointly, but we choose stage-wise training to avoid potentially cumbersome hyper-parameters tuning to weight each loss.

4 Experiments and Results

Datasets: For the evaluation of the proposed approach, we use three publicly available MRI datasets. [I] The ACDC dataset was hosted in MICCAI 2017 ACDC challenge O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester, et al. (2018); 1. It comprises of 100 3D short-axis cardiac cine-MRIs, captured using 1.5T and 3T scanners with expert annotations for three structures: left ventricle, myocardium and right ventricle. [II] The MMWHS dataset was hosted in STACOM and MICCAI 2017 challenge X. Zhuang, K. S. Rhode, R. S. Razavi, D. J. Hawkes, and S. Ourselin (2010); X. Zhuang and J. Shen (2016); X. Zhuang, W. Bai, J. Song, S. Zhan, X. Qian, W. Shi, Y. Lian, and D. Rueckert (2015); X. Zhuang (2013); 43. It consists of 20 3D cardiac MRIs with expert annotations for seven structures: left ventricle, left atrium, right ventricle, right atrium, myocardium, ascending aorta, and pulmonary artery. [III] Prostate dataset was hosted in MICCAI 2018 medical segmentation decathlon challenge 39. It consists of 48 3D T2-weighted MRIs of the prostate region with expert annotations for two structures: peripheral zone and central gland.

Pre-processing: We apply the following pre-processing steps: (i) intensity normalization of each 3D volume, , using min-max normalization: -/-, where denotes the intensity percentile in , and (ii) re-sampling of all 2D images and corresponding labels to a fixed pixel size

using bi-linear and nearest-neighbour interpolation, respectively, followed by cropping or padding images with zeros to a fixed image size of

. The fixed resolutions and dimensions for each dataset are: (a) ACDC: = and =, (b) MMWHS: = and =, (c) Prostate: = and =. We did not have to use an external tool to align volumes in any of the datasets, they were already roughly aligned as they were acquired.

Details of network architectures and optimization are provided in the Appendix.

Experimental setup: We split each dataset into a pre-training set and a test set , each consisting of volumetric images and corresponding segmentation labels. We pre-train a UNet architecture using only the images from without their labels, then fine-tune the pre-trained network with a small number of labeled examples chosen from . Fine-tuned model’s segmentation performance is used to assess the pre-training procedure. The test set is not used in any stage of pre-training nor fine-tuning, it is used only in the final evaluation. The sizes of these subsets for the different datasets are: (a) =52, =20 for ACDC, (b) =10, =10 for MMWHS, and (c) =22, =15 for Prostate. For the fine-tuning stage, we form a training set and a validation set , both of which are subsets of . We experiment with different sizes of the training set volumes, whereas is fixed to 2 volumes.

Evaluation: Dice similarity coefficient (DSC) is used to evaluate the segmentation performance. For all fine-tuning experiments, we report mean scores of all structures on over 6 runs. For each run, and were constructed by randomly sampling the required number of volumes from .

Summary of experiments: We conducted four sets of experiments, investigating: (a) the benefits of the proposed contrasting strategies in the global contrastive loss, (b) the benefits of the local contrastive loss and the two contrasting strategies, (c) the performance of the overall pre-training method (global + local contrastive losses + contrasting strategies) as compared with other pre-training, data augmentation and semi-supervised learning methods, and (d) whether the proposed pre-training method also improve performance of other techniques compared to random initialization.


Initialization of
Encoder Decoder =1 =2 =8 =1 =2 =8 =1 =2 =8
random random 0.614 0.702 0.844 0.489 0.550 0.636 0.451 0.637 0.787
Global contrasting strategies
random 0.631 0.729 0.847 0.521 0.580 0.654 0.500 0.659 0.785
random 0.683 0.774 0.864 0.553 0.616 0.681 0.529 0.684 0.796
random 0.691 0.784 0.870 0.579 0.600 0.677 0.553 0.686 0.793
Local contrasting strategies
random 0.631 0.729 0.847 0.521 0.580 0.654 0.500 0.659 0.785
0.668 0.760 0.850 0.557 0.601 0.663 0.528 0.687 0.791
0.638 0.740 0.855 0.542 0.605 0.672 0.520 0.664 0.779
Proposed method
0.725 0.789 0.872 0.579 0.619 0.684 0.569 0.694 0.794
Table 1: Comparison of contrasting strategies (CSs) for global and local losses. (1) For the global loss, both and are better than  Chen et al. (2020). (2) For the local loss, both and are better than random decoder initialization. (3) We combine the best CSs for each loss in the proposed pre-training. Within the global and local loss results, underlines indicate the best performing CS, while the best results in each column are in bold.

I. Contrasting strategies for global contrastive loss: To begin with, we investigated different contrasting strategies () for pre-training the encoder using the global contrastive loss (Eqn.1, Sec.3.1.1). After pre-training, we fine-tuned the pre-trained encoder along with a randomly initialized decoder for the segmentation task using a small number of annotated volumes (). The two baselines for this set of experiments are (1) no pre-training and (2) pre-training the encoder with a random contrasting strategy () (as in Chen et al. (2020)). The results of this set of experiments are shown in the top part of Table 1. Firstly, we note that  Chen et al. (2020); He et al. (2019), can be directly applied to medical images to achieve performance gains as compared to random initialization. Secondly, exploiting domain-specific knowledge of naturally occurring structure within the data with strategies provides substantial further improvements, across all datasets and . These additional gains show that leveraging slice correspondence across volumes allows the network to model more complex similarity cues as compared to random augmentations, as in . Finally, the performance with contrasting strategies is similar, but is slightly better than for 6 out of the 9 settings. This indicates the benefit of leveraging domain-specific structure in the data for defining both similar and dissimilar sets. We used as the contrasting strategy for the global loss in further experiments.

II. Contrasting strategies for local contrastive loss: Next, we investigated the effect of pre-training the decoder with the local contrastive loss, , with two contrasting strategies: (see Sec. 3.2.1). In order to study this independently of the global contrasting strategies, we fixed the encoder, pre-trained with using the random strategy . The results of this set of experiments are shown in lower part of Table 1. We experimented with different values for the number of decoder blocks to be pre-trained, , (results in appendix) and the results shown here correspond to , which lead to the best overall performance. It can be seen that pre-training the decoder with provides an additional performance boost as compared to only pre-training the encoder and randomly initializing the decoder. Importantly, such pre-training with the random strategy can also be used in applications when there is no obvious domain-specific clustering in the data. Further, among and , it can be seen that fares better for 6 out of the 9 settings across datasets and . Hence, we used for further experiments and inferred that rough alignment that is easy to obtain across volumes may not necessarily provide correspondence between local regions of different volumes, hence encouraging similarity between assumed-to-be corresponding regions may adversely affect the learning.

III. Comparison with other methods: Next, we compared the proposed pre-training strategy with several relevant methods, using the same architecture across all methods. As a baseline, we trained a network (randomly initialized) with extensive data augmentation: rotation, cropping, flipping, scaling Cireşan et al. (2011), elastic deformations Simard et al. (2003), random contrast and brightness changes Hong et al. (2017); Perez et al. (2018). We found that these augmentations yield a strong baseline and used them in the fine-tuning stage for all subsequent experiments. Next, we compared with pretext task-based pre-training with three self-supervised tasks: rotation Gidaris et al. (2018), inpainting Pathak et al. (2016) and context restoration Chen et al. (2019), and with global contrastive loss based pre-training Chen et al. (2020) where, was pre-trained according to Eq.1 (Sec.3.1.1) with a random contrasting strategy (), while was randomly initialized. Further, we compared with data augmentation Chaitanya et al. (2019); Zhang et al. (2017a) and semi-supervised Bai et al. (2017); Zhang et al. (2017d) learning methods that have shown promising results for medical image segmentation. Finally, we checked if the proposed initialization strategy can be combined with the other approaches of learning with limited annotations.


Method =1 =2 =8 =1 =2 =8 =1 =2 =8
Random init. 0.614 0.702 0.844 0.489 0.550 0.636 0.451 0.637 0.787
Contrastive loss pre-training
Global loss  Chen et al. (2020) 0.631 0.729 0.847 0.521 0.580 0.654 0.500 0.659 0.785
Proposed init. ( + ) 0.725 0.789 0.872 0.579 0.619 0.684 0.569 0.694 0.794
Pretext task pre-training
Rotation Gidaris et al. (2018) 0.599 0.699 0.849 0.502 0.558 0.650 0.433 0.637 0.785
Inpainting Pathak et al. (2016) 0.612 0.697 0.837 0.490 0.551 0.647 0.441 0.653 0.770
Context Restoration Chen et al. (2019) 0.625 0.714 0.851 0.552 0.570 0.651 0.482 0.654 0.783
Semi-supervised Methods
Self-train Bai et al. (2017) 0.690 0.749 0.860 0.551 0.598 0.680 0.563 0.691 0.801
Mixup Zhang et al. (2017a) 0.695 0.785 0.863 0.543 0.593 0.661 0.561 0.690 0.796
Data Aug. Chaitanya et al. (2019) 0.731 - - 0.585 - - - - -
Adversarial training Zhang et al. (2017d) 0.536 - - 0.487 - - - - -
Combination of Methods
Data Aug. Chaitanya et al. (2019) + Mixup Zhang et al. (2017a) 0.747 - - 0.577 - - - - -
Proposed init. + Self-train Bai et al. (2017) 0.745 0.802 0.881 0.607 0.634 0.698 0.647 0.727 0.806
Proposed init. + Mixup Zhang et al. (2017a) 0.757 0.826 0.886 0.588 0.626 0.684 0.617 0.710 0.794
Training with large () 0.907 () 0.697 () 0.787

Table 2: Comparison of the proposed method with other pre-training, data augmentation and semi-supervised learning methods. The proposed pre-training provides better results than other methods for all datasets and values, with Chaitanya et al. (2019) also providing similarly good results. Further, pre-training can be combined with other methods to obtain additional gains. In each column, best values among individual methods are underlined and best values overall are in bold.

The results of the comparative study are shown in Table 2. Let us first consider the individual methods, i.e. without combining a pre-training strategy with a semi-supervised or data augmentation method. Among these, the proposed method that uses both global contrasting strategy with domain knowledge and local contrastive loss performs substantially better that competing methods across all datasets and . The performance improvements are especially high when very small number of training volumes () are used. Comparisons with pretext-task based pre-training approaches indicate that the initialization learned with the pretext tasks provide lesser gains when fine-tuned to segmentation task in a semi-supervised setting, in comparison to proposed contrastive learning based pre-training. The results of the data augmentation Chaitanya et al. (2019) and adversarial training Zhang et al. (2017d) methods are taken from Chaitanya et al. (2019) for the 2 matched datasets and the only matched setting of =1. We note that the data augmentation strategy suggested in Chaitanya et al. (2019) provides similar results as the proposed method. In the last row of Table 2, we show the fully supervised benchmark experiment, where for each dataset, a network is trained with the maximum available training volumes. For all datasets, the proposed method reaches within ~0.1 DSC from the benchmark, with just 2 training volumes.

Finally, we observed that applying data augmentation and semi-supervised methods after the proposed pre-training rather than starting with random initialization leads to further performance gains (Table 2). Proposed pre-training combined with existing complementary augmentation and semi-supervised methods take a substantial step towards closing performance gap with the benchmark.

5 Conclusion

The requirement of large numbers of annotated images for obtaining good results using deep learning methods remains a persistent challenge in medical image analysis. In this work, in order to alleviate the need for large annotated training sets, we proposed several extensions in contrastive loss based pre-training Chen et al. (2020). Specifically, we proposed (1) a local extension of contrastive loss for learning local representations that are useful for dense prediction tasks such as image segmentation, and (2) propose problem-specific contrasting strategy that leverages naturally occurring clusters within the data to dictate similar and dissimilar image pairs that are used in the contrastive loss computation. Extensive experimentation showed that both the proposed improvements lead to substantial performance gains in the limited annotation settings for medical image segmentation in three MRI datasets. Further, we showed that the benefits conferred by the proposed initialization are orthogonal to those obtained by other methods such as data augmentation and semi-supervised learning. Overall, we believe that the proposed initialization, combined with a data augmentation technique such as Mixup Zhang et al. (2017a) provides a simple toolbox for vastly improving performance in dense prediction tasks in medical imaging, especially in the clinically relevant low annotations setting.


The presented work was partly funding by: 1. Clinical Research Priority Program Grant on Artificial Intelligence in Oncological Imaging Network, University of Zurich, 2. Swiss Data Science Center (DeepMicroIA), 3. Swiss Platform for Advanced Scientific Computing (PASC), coordinated by Swiss National Super-computing Centre (CSCS), 4. Personalized Health and Related Technologies (PHRT), project number 222, ETH domain. We also thank Nvidia for their GPU donation.


  • [1] Automated cardiac diagnosis challenge (acdc). Note: 2020-04-31 Cited by: §4.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509–15519. Cited by: §1, §2, §3.1.
  • W. Bai, C. Chen, G. Tarroni, J. Duan, F. Guitton, S. E. Petersen, Y. Guo, P. M. Matthews, and D. Rueckert (2019) Self-supervised learning for cardiac mr image segmentation by anatomical position prediction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 541–549. Cited by: §1.
  • W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tarroni, B. Glocker, A. King, P. M. Matthews, and D. Rueckert (2017) Semi-supervised learning for network-based cardiac mr image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 253–260. Cited by: §2, Table 2, §4.
  • O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester, et al. (2018) Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?. IEEE Transactions on Medical Imaging. Cited by: §4.
  • C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. A. Dickie, M. V. Hernández, J. Wardlaw, and D. Rueckert (2018) GAN augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863. Cited by: §2.
  • K. Chaitanya, N. Karani, C. F. Baumgartner, A. Becker, O. Donati, and E. Konukoglu (2019) Semi-supervised and task-driven data augmentation. In Information Processing in Medical Imaging, A. C. S. Chung, J. C. Gee, P. A. Yushkevich, and S. Bao (Eds.), Cham, pp. 29–41. External Links: ISBN 978-3-030-20351-1 Cited by: §2, Table 2, §4, §4.
  • O. Chapelle, B. Scholkopf, and A. Zien (2009) Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §2.
  • L. Chen, P. Bentley, K. Mori, K. Misawa, M. Fujiwara, and D. Rueckert (2019) Self-supervised learning for medical image analysis using image context restoration. Medical image analysis 58, pp. 101539. Cited by: §1, §2, Table 2, §4.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §1, §2, §3.1, §3.2, §3.3, §3.4, §3, Table 1, Table 2, §4, §4, §5, §6.2, Table 7, §9.
  • D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber (2011) High-performance neural networks for visual object classification. arXiv preprint arXiv:1102.0183. Cited by: §2, §4.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: Table 7, §9.
  • P. Costa, A. Galdran, M. I. Meyer, M. Niemeijer, M. Abràmoff, A. M. Mendonça, and A. Campilho (2018) End-to-end adversarial retinal image synthesis. IEEE transactions on medical imaging 37 (3), pp. 781–791. Cited by: §2.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §1, §2, §2.
  • C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060. Cited by: §2.
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    In Advances in neural information processing systems, pp. 766–774. Cited by: §2, §2, §3.1.
  • Z. Eaton-Rosen, F. Bragman, S. Ourselin, and M. J. Cardoso (2018) Improving data augmentation for medical image segmentation. In International Conference on Medical Imaging with Deep Learning, Cited by: §2.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §1, §2, §2, Table 2, §4.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1, §2, §3.1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §2, §3.2, §4, §8.1.1.
  • O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §2.
  • D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019)

    Learning deep representations by mutual information estimation and maximization

    In ICLR 2019, External Links: Link Cited by: §1, §2, §3.3.
  • E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §3.1.
  • J. Hong, B. Park, and H. Park (2017) Convolutional neural network classifier for distinguishing barrett’s esophagus and neoplasia endomicroscopy images. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2892–2895. Cited by: §2, §4.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §6.1.
  • A. Jamaludin, T. Kadir, and A. Zisserman (2017) Self-supervised learning for spinal mris. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 294–302. Cited by: §1.
  • K. Kamnitsas, E. Ferrante, S. Parisot, C. Ledig, A. V. Nori, A. Criminisi, D. Rueckert, and B. Glocker (2016) DeepMedic for brain tumor segmentation. In International workshop on Brainlesion: Glioma, multiple sclerosis, stroke and traumatic brain injuries, pp. 138–149. Cited by: §1.
  • K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker (2017) Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis 36, pp. 61–78. Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §6.2.
  • D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §2.
  • S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. Pluim (2009) Elastix: a toolbox for intensity-based medical image registration. IEEE transactions on medical imaging 29 (1), pp. 196–205. Cited by: §3.2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §1.
  • S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. ICLR. Cited by: §2.
  • D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2. Cited by: §2.
  • D. McClosky, E. Charniak, and M. Johnson (2006) Effective self-training for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, pp. 152–159. Cited by: §2.
  • [39] Medical segmentation decathlon chalenge. Note: 2020-04-31 Cited by: §4.
  • F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §1.
  • I. Misra and L. van der Maaten (2019) Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991. Cited by: §1, §2, §8.1.1.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.
  • [43] MM-whs: multi-modality whole heart segmentation challenge. Note: 2020-04-31 Cited by: §4.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §1, §2, §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §1, §2, §2, Table 2, §4.
  • F. Perez, C. Vasconcelos, S. Avila, and E. Valle (2018) Data augmentation for skin lesion analysis. In OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, pp. 303–311. Cited by: §2, §4.
  • A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In Advances in neural information processing systems, pp. 3546–3554. Cited by: §2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2, §3.1, §6.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in neural information processing systems, pp. 1163–1171. Cited by: §2.
  • H. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski (2018) Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In International Workshop on Simulation and Synthesis in Medical Imaging, pp. 1–11. Cited by: §2.
  • P. Y. Simard, D. Steinkraus, and J. C. Platt (2003) Best practices for convolutional neural networks applied to visual document analysis. In null, pp. 958. Cited by: §4.
  • N. Tajbakhsh, Y. Hu, J. Cao, X. Yan, Y. Xiao, Y. Lu, J. Liang, D. Terzopoulos, and X. Ding (2019) Surrogate supervision for medical image analysis: effective deep learning from limited quantities of labeled data. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 1251–1255. Cited by: §1.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §2.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2, §3.1.
  • M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2019) On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625. Cited by: §2.
  • M. T. Tschannen, J. Djolonga, M. Ritter, A. Mahendran, N. Houlsby, S. Gelly, and M. Lučić (2020) Self-supervised learning of video-induced visual invariances. In Conference on Computer Vision and Pattern Recognition, External Links: Link Cited by: §2.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §2, §8.1.1.
  • D. Yarowsky (1995) Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pp. 189–196. Cited by: §2.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017a) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §2, Table 2, §4, §5, Table 7, §9.
  • P. Zhang, F. Wang, and Y. Zheng (2017b) Self supervised deep representation learning for fine-grained body part recognition. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 578–582. Cited by: §1.
  • R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    In European conference on computer vision, pp. 649–666. Cited by: §2.
  • R. Zhang, P. Isola, and A. A. Efros (2017c)

    Split-brain autoencoders: unsupervised learning by cross-channel prediction

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1058–1067. Cited by: §2.
  • Y. Zhang, L. Yang, J. Chen, M. Fredericksen, D. P. Hughes, and D. Z. Chen (2017d) Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 408–416. Cited by: §2, Table 2, §4, §4.
  • A. Zhao, G. Balakrishnan, F. Durand, J. V. Guttag, and A. V. Dalca (2019) Data augmentation using learned transformations for one-shot medical image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8543–8553. Cited by: §2.
  • X. Zhu and A. B. Goldberg (2009) Introduction to semi-supervised learning.

    Synthesis lectures on artificial intelligence and machine learning

    3 (1), pp. 1–130.
    Cited by: §2.
  • X. J. Zhu (2005) Semi-supervised learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §2.
  • X. Zhuang, W. Bai, J. Song, S. Zhan, X. Qian, W. Shi, Y. Lian, and D. Rueckert (2015) Multiatlas whole heart segmentation of ct data using conditional entropy for atlas ranking and selection. Medical physics 42 (7), pp. 3822–3833. Cited by: §4.
  • X. Zhuang, K. S. Rhode, R. S. Razavi, D. J. Hawkes, and S. Ourselin (2010) A registration-based propagation framework for automatic whole heart segmentation of cardiac mri. IEEE transactions on medical imaging 29 (9), pp. 1612–1625. Cited by: §4.
  • X. Zhuang and J. Shen (2016) Multi-scale patch and multi-modality atlases for whole heart segmentation of mri. Medical image analysis 31, pp. 77–87. Cited by: §4.
  • X. Zhuang (2013) Challenges and methodologies of fully automatic whole heart segmentation: a review. Journal of healthcare engineering 4 (3), pp. 371–407. Cited by: §4.
  • X. Zhuang, Y. Li, Y. Hu, K. Ma, Y. Yang, and Y. Zheng (2019) Self-supervised feature learning for 3d medical images by playing a rubik’s cube. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 420–428. Cited by: §1.

6 Network architecture and training details

6.1 Network Architecture

We use a UNet Ronneberger et al. (2015) based encoder () - decoder () architecture. consists of 6 convolutional blocks, each consisting of two convolutions followed by a

maxpooling layer with stride 2. While training with global contrastive loss, we append a small network,

, on top of . consists of two dense layers with output dimensions 3200 and 128. While training with local contrastive loss, we add a partial decoder , with convolutional blocks, on top of the pre-trained . Each block consists of one upsampling layer with a factor of 2, followed by concatenation from corresponding level of via a skip connection, followed by two convolutions. Similar to , we have an additional small network, , on top of last decoder block with two convolutions to obtain the feature map that is used for the local loss computation. Lastly, we take the pre-trained and and append the remaining convolutional blocks of along with skip connections such that we a sufficient number of layers to output a segmentation with same dimensions as the input image. We use batch normalization Ioffe and Szegedy (2015)

and ReLU activations after all layers except the last convolutional layers of

and .

For the computation of the local loss, the number of local regions () chosen from each feature map was 13.

6.2 Training Details

For each training stage, we used the Adam optimizer Kingma and Ba (2015) for iterations, with a batch size of 40 and learning rate of . For pre-training with both contrastive losses, we experimented with two values of the temperature parameter, : 0.1 and 0.5, that provided the best performance in Chen et al. (2020). We observed a higher Dice score on the validation set for . Hence, we set in all experiments. A validation set was used for model selection during the fine-tuning stage: that is, we chose the fine-tuned model that provided the highest Dice score on a validation set.

7 Illustration of slice correspondence in medical volumetric images

We illustrate 2D slices taken from each of the four partitions, from three different volumes in Fig. 2, where the partition number is indicated by . In this figure, even though we see changes in shape and intensity characteristics for 2D slices of the same partition from different volumes, they contain the same global information about the cardiac anatomy.

Figure 2: A 2D slice is taken from each of the four partitions, across three different volumes, with the partition number indicated by . Here, each row presents four images from four partitions of a selected volume.

8 Ablation studies

Here, we present an ablation study to investigate the effect of some of the hyper-parameters involved in the proposed method. For all these experiments, we considered the proposed global loss contrasting strategy for pre-training the encoder as it yielded the best performance in earlier experiments. We used the ACDC dataset for these experiments.

8.1 Global Contrastive Loss

Firstly, we investigated the effect of batch size and the number of partitions per volume used in the pre-training of the encoder with global contrastive loss.

8.1.1 Batch Size

Here, we studied the effect of the batch size used during the encoder pre-training. Previous works have suggested that large batch sizes are crucial for good performance, with some works leveraging memory bank Wu et al. (2018) or momentum contrast He et al. (2019); Misra and van der Maaten (2019) to accommodate higher number of negative samples information per batch. In order to check if the same is applicable for our datasets as well, we pre-trained the encoder with 3 batch sizes: 40, 250, 450, with the number of partitions set to 4.

The results are presented in Table 3. We observed that for medical images, higher batch sizes do not improve the results any further, rather the performance deteriorated for the batch size of 450. This shows that unlike natural image datasets, medical images do not require large batch sizes in the pre-training stage to obtain high performance. Moreover, medical datasets generally contain a much lower number of unlabeled images compared to natural image datasets that can be leveraged for pre-training. For instance, the largest of the 3 datasets used in our experiments is the ACDC dataset with 100 volumes amounting to around 1000 2D images. So, evaluating batch size values like 2048, 4096 used in natural image datasets for pre-training is not a realistic setting for medical images.

Batch Size =1 =2 =8
40 0.691 0.784 0.870
250 0.685 0.779 0.862
450 0.668 0.770 0.857
Table 3: Mean dice score over test set () shown on ACDC dataset for the effect of batch size with the number of partitions set () to 4 in the pre-training of the encoder with strategy and later is fine-tuned to all training set sizes ().

8.1.2 Number of partitions per 3D volume

Here, we evaluated 3 values for number of partitions () per 3D volume = for a fixed batch size of 40. The number of partitions determines the number of clusters that are formed in the latent space for the proposed global loss strategy . The results are presented in Table 4. We observed that the dice score degraded for higher values of . As there are approximately 10 images per volume in ACDC data, using higher values of will force the network to create a cluster for each image in the volume. This results in the unrealistic composition of both positive and negative pairs of images across volumes, where an inaccurate association of dissimilar images will be forced to act as similar pairs due to a higher value of . Since the volumes are not perfectly aligned, this wrong association in representation space leads to a subsequent drop in performance.

No. of partitions () per 3D volume =1 =2 =8
3 0.686 0.776 0.868
4 0.691 0.789 0.870
6 0.652 0.760 0.859
Table 4: Mean dice score over test set () shown on ACDC dataset for the effect of the number of partitions per 3D volume for a fixed batch size of 40 in the pre-training of the encoder with strategy and later is fine-tuned to all training set sizes ().

8.2 Local Contrastive Loss

Secondly, we investigated the effect of decoder size (), local region size () that are used in the proposed local contrastive loss for pre-training the decoder network. We considered the proposed local contrastive loss strategy to pre-train the decoder blocks with the encoder kept frozen. The encoder was earlier pre-trained with global loss strategy .

8.2.1 Decoder size

We varied the number () of decoder blocks that are pre-trained. We investigated all five values: , where () means that the entire decoder is pre-trained. From Table 5, we observe an increase in the performance upon increasing the number of decoder blocks until the value of , and then it goes down again. We hypothesize that this is because, at , local regions have a smaller receptive field compared to other values and thereby might not contain enough information to learn useful representations.

Region size =1 =2
0.738 0.704 0.705 0.702 0.683 0.780 0.777 0.770 0.774 0.764
0.703 0.737 0.725 0.726 0.690 0.783 0.793 0.787 0.776 0.777

Table 5: Mean dice score over test set () on ACDC data for the proposed contrasting strategies of and used to pre-train the encoder and a different number of decoder blocks () for two different values of local region sizes () of and and later is fine-tuned to two training set sizes ().

8.2.2 Size of local regions

We experimented with 2 values: and , for the size of local region used to obtain local representations in a given feature map. This was to study if the size of the local regions influences the performance post pre-training. From Table 5, we observe a small difference in the performance of around 2% between the two region sizes considered for , with size yielding higher performance. This can be because local region contains more information due to a higher receptive field that is potentially more useful in the devised pre-training setting.

8.3 Combination of Local and Global Contrastive Losses

Here, we present the combinations of local contrastive loss strategies () with both global loss strategies () for different decoder block lengths (=2,3,4) that is moved from the Table 1 in the main article to Table 6.

Initialization of Dataset =1 =2 =8
Encoder Decoder =2 =3 =4 =2 =3 =4 =2 =3 =4
Local loss strategies on encoder pre-trained with random strategy
random init ACDC 0.631 0.729 0.847
0.642 0.668 0.655 0.754 0.760 0.732 0.860 0.850 0.860
0.614 0.638 0.642 0.744 0.740 0.744 0.854 0.855 0.852
random init Prostate 0.521 0.580 0.654
0.566 0.557 0.538 0.600 0.601 0.591 0.661 0.663 0.665
0.536 0.542 0.543 0.583 0.605 0.597 0.656 0.672 0.659
random init MMWHS 0.500 0.659 0.785
0.523 0.528 0.511 0.692 0.687 0.679 0.794 0.791 0.792
0.510 0.520 0.515 0.697 0.664 0.684 0.797 0.779 0.781
Local loss strategies on encoder pre-trained with proposed strategy
random init ACDC 0.691 0.784 0.870
0.708 0.725 0.720 0.784 0.789 0.785 0.868 0.872 0.871
0.737 0.725 0.726 0.793 0.787 0.776 0.865 0.874 0.868
random init PZ 0.579 0.600 0.677
0.577 0.579 0.581 0.617 0.619 0.620 0.683 0.684 0.685
0.562 0.567 0.564 0.608 0.607 0.599 0.675 0.686 0.680
random init MMWHS 0.553 0.686 0.793
0.556 0.569 0.572 0.671 0.694 0.693 0.796 0.794 0.796
0.545 0.574 0.551 0.677 0.681 0.689 0.803 0.791 0.789
Table 6: Mean Dice score over for the proposed local contrastive loss on all datasets for the decoder lengths and all training set sizes =1,2,8.

9 Experiments with Natural Image Datasets

In order to check the generality of the proposed method beyond medical imaging datasets, we evaluated the proposed local contrastive loss on a natural image dataset "Cityscapes" Cordts et al. (2016) for the segmentation task. We pre-trained the decoder using the proposed version of local contrastive loss () and the encoder with global contrastive loss (random strategy ) as in Chen et al. (2020). We compared it to a baseline with no pre-training and pre-training with only global contrastive loss as in Chen et al. (2020). Additionally, we also evaluated the combination of proposed initialization along with Mixup Zhang et al. (2017a) that yielded the best results on medical images.

For the evaluation, we split the whole training and validation data provided into (2770 images) and (705 images) like earlier. We used this test set only for the final evaluation. After pre-training with only images of (no labels are used), we fine-tuned the network for segmentation task with a set of labeled images and validation images chosen from ( ). We performed the fine-tuning in a limited annotation setting for three values of = {(100,100), (200,200), (400,200)}. Table 7 presents these results.

To implement the global and local contrastive losses we need a reasonable batch size value of around 40. Due to memory issues, it was difficult to implement such a batch size with images in their original dimensions (1024,2048). Therefore, we down-scaled the images by a factor of 4 to (256,512). Due to the downsampling, some of the smaller objects either vanished or were reduced to a negligible size. In order to avoid extreme class imbalance issues, we set these small objects as background. Thus, we considered the following 12 foreground labels: road, sidewalk, building, wall, vegetation, terrain, sky, person+rider, car, motorbike+bicycle, truck, bus, with the remaining labels set as background.

Training details: For augmentation, we used random cropping followed by random color jitter (brightness, contrast, saturation, hue). Rest of the training details remain same as described for medical imaging datasets.

Method (100,100) (200,200) (400,200)
Baseline (Random init.) 0.451 0.495 0.524
Global loss  Chen et al. (2020) 0.457 0.496 0.535
Proposed init. ( + ) 0.469 0.517 0.549
Proposed init. + Mixup Zhang et al. (2017a) 0.475 0.526 0.569
Benchmark 0.652
Table 7: Mean dice score over test set () for all the selected labels on Cityscapes Cordts et al. (2016) dataset for the proposed pre-training compared to a baseline with the random initialization, and pre-training with only global loss.