Log In Sign Up

CUTS: A Fully Unsupervised Framework for Medical Image Segmentation

In this work we introduce CUTS (Contrastive and Unsupervised Training for Segmentation) the first fully unsupervised deep learning framework for medical image segmentation, facilitating the use of the vast majority of imaging data that is not labeled or annotated. Segmenting medical images into regions of interest is a critical task for facilitating both patient diagnoses and quantitative research. A major limiting factor in this segmentation is the lack of labeled data, as getting expert annotations for each new set of imaging data or task can be expensive, labor intensive, and inconsistent across annotators: thus, we utilize self-supervision based on pixel-centered patches from the images themselves. Our unsupervised approach is based on a training objective with both contrastive learning and autoencoding aspects. Previous contrastive learning approaches for medical image segmentation have focused on image-level contrastive training, rather than our intra-image patch-level approach or have used this as a pre-training task where the network needed further supervised training afterwards. By contrast, we build the first entirely unsupervised framework that operates at the pixel-centered-patch level. Specifically, we add novel augmentations, a patch reconstruction loss, and introduce a new pixel clustering and identification framework. Our model achieves improved results on several key medical imaging tasks, as verified by held-out expert annotations on the task of segmenting geographic atrophy (GA) regions of images of the retina.


page 2

page 4

page 5

page 7


Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image Segmentation

Semi-supervised learning (SSL), which aims at leveraging a few labeled i...

Contrastive learning for unsupervised medical image clustering and reconstruction

The lack of large labeled medical imaging datasets, along with significa...

MixCL: Pixel label matters to contrastive learning

Contrastive learning and self-supervised techniques have gained prevalen...

PC-SwinMorph: Patch Representation for Unsupervised Medical Image Registration and Segmentation

Medical image registration and segmentation are critical tasks for sever...

Unsupervised Local Discrimination for Medical Images

Contrastive representation learning is an effective unsupervised method ...

Interactive Medical Image Segmentation via Point-Based Interaction and Sequential Patch Learning

Due to low tissue contrast, irregular object appearance, and unpredictab...

Parotid Gland MR Image Segmentation Based on Contrastive Learning

Compared with natural images, medical images are difficult to acquire an...

1 Introduction

While supervised deep learning approaches have been wildly successful in segmenting images, there are several issues in applying them to biomedical images, particularly in order to make clinical inferences. First, these networks are reliant on expert-driven annotations which form a smaller and smaller subset of generated images, as the volume of image generation in various contexts explodes. Second, supervised classification networks trained on one set of annotated images can fail to generalize to similar image sets collected in very slightly different contexts (like in different hospitals or on different instruments). A large reason for this is because supervised networks can easily overfit to the classification, memorizing idiosyncratic pixel patterns or artifacts in the images rather than learning generally meaningful representations. To remedy this, we propose an entirely unsupervisedneural network that leverages recent advances in representation learning and is built to automatically segment biomedical images.

Our framework, which we denote Contrastive and Unsupervised Training for Segmentation (CUTS), was named as an homage to the renowned painter Henri Matisse, who famously used a “cut-up” method he called “drawing with scissors” to assemble an image based on patches of material from different sources. This method motivates our foundational view that meaningful image segmentations are comprised of generally contiguous regions of similar color and texture being grouped together within an image.

In order to induce a neural network to naturally group these “cut-up” parts of images at fine resolution, our network produces pixel-level embedding but operates on patches centered around each pixel, finding a latent space embedding of the pixel space. We refer to this approach as learning functions of pixel-centered patches. Second, since we want our network to be unsupervised, we wish to create an embedding in a hidden layer of a neural network such that pixels in the same parts of the image are grouped together, while pixels in different areas are separated. To induce this, we combine two losses that are motivated by two distinct strands of research in the community: contrastive learning and autoencoding.

Figure 1:

The framework of our unsupervised image segmentation network. It combines a contrastive loss that uses a SSIM-based pixel proximity heuristic and an autoencoder-inspired reconstruction loss that together guide the learning of a meaningful embedding space.

In the contrastive learning aspect of the loss, each pixel-centered patch is processed along with another patch and the network produces embeddings that are either close to each if the patches are similar or distant from each other if the patches are dissimilar. This contrastive loss, when properly tuned with cleverly chosen sets of similar and dissimilar patches obtained with a heuristic-guided search of each image, provides a framework for a latent space embedding that is similar to word embeddings based on context. We utilize this to ensure that nearby regions, and thus natural segments of the image, are embedded nearby in the learned latent space, while still remaining separated along natural boundaries of different segments.

In the autoencoding aspect of the loss, the network must be able to reconstruct the patch around each pixel from just its embedding representation. This ensures that fine-grained detail of the local patch is well represented in the embedding and no information is lost in the layers of the network. For example, consider the embedding of two pixel-centered patches that each contain a boundary in the image, but the first is centered slightly on one side of the boundary and the second is centered slightly on the other side of the boundary. These two patches are largely similar to each other in aggregate, but to segment the boundary precisely, we would need to retain the information of exactly where the boundary is located in the patch. This reconstruction loss ensures this information is retained as necessary.

Finally once we have a sufficiently structured latent space, we can find regions of interest based on first using a modified spectral clustering of the embedding space. We demonstrate mapping these clusters obtained from the embedding back into the corresponding pixels produces a meaningful segmentation in the original image space.

Our primary motivation comes from applications in medical imaging specifically. We demonstrate that our method is widely generalizable to a variety of different contexts, and could be used on anything from natural images to tumors. This is an important criteria for widespread adoption in the medical imaging community, rather than being narrowly tailored to work in only one particular application area, as imaging is a broad and fast-growing field consisting of an increasing number of technologies that produce data with various statistical properties. We first apply our method on natural images as an initial performance check in an easily interpretable setting. We then show that our network performs unsupervised segmentation of difficult-to-partition regions of geographic atrophy in a specially curated set of retinal images. We finally demonstrate that our method can segment regions of the brain in MRI data.

Our main contributions include:

  • Specification of a pixel-centered, patch-level convolutional architecture for producing contextualized representations of the pixels.

  • Introduction of a contrastive loss (including pixel proximity heuristics supplemented with SSIM-based checks).

  • A framework that combines contrastive learning with a reconstruction loss for learning a meaningful embedding space.

  • A spectral clustering method for zeroing in on contiguous regions of potential interest to clinicians.

The remainder of this paper is organized as follows. We first discuss related work in the field to provide context. We then introduce our model framework, including detailing the novel losses and neural network architecture. Finally, we demonstrate our model on a series of natural image and biological image datasets, coupled with qualitative and quantitative evaluations of its performance and comparison to baseline alternatives.

2 Related Work

Contrastive learning has been used for image classification to solve the issue of limited annotated images. However, commonly used contrastive learning methods have often focused on extracting image-level representations with a global contrastive loss [3]. The use of image-level contrastive learning is fundamentally different from our patch-level approach. Image-level contrastive learning yields no information about intra-image variation or features, and thus is inapplicable to the common setting of analyzing different areas within an image. In contrast, image-level contrastive learning generally tries to solve the problem of whether two images both have the same property, like whether they are of the same person or not. Recent efforts have moved towards developing local pixel/patch-level representations that can be used for downstream tasks. In this study we focused on the latter: learning pixel level representations for segmentation tasks. However, an important difference is that our work uses unsupervised training end-to-end in obtaining its representations.

Previous work has used patch-level contrastive training but limited its application to pre-training of a supervised model that was then fine tuned with labeled data [2]. In that work, they combined a global contrastive loss and a local contrastive loss to learn both image representations and patch representations. On the other hand, other work has used a similar U-net structure but combined image-level and pixel-level representations [21]. In both cases, notably, this was a pre-training step prior to supervised training rather than a standalone representation-learning method.

While these are supervised models that are pre-trained with unsupervised contrastive learning, an alternative setting in the previous literature is supervised contrastive learning. In this setting, labeled data is utilized (which we do not have in our applications) to construct positive and negative pairs for learning pixel level representations [24]

. In contrast, we directly use convolutional neural networks to learn pixel embeddings with pixel-level contrastive loss without labels. The segmentation is achieved through clustering, which makes our model totally unsupervised, unlike previous methods that incorporate labels at some point in the segmentation.

We note that the proliferation of U-net-based models still target the task of supervised learning: with a U-net, the output is directly in pixel space, rather than our model which outputs representations in a learned embedding space. Thus, this work is easily differentiated from all works focused on minor variations on the U-net architecture such as adding skip connections, attention layers, adaptive kernels,  

[25, 17, 10, 4, 11].

3 Methods

3.1 The CUTS Framework

In this section, we detail the framework for our model (Figure 1). The framework is composed of a convolutional neural network (CNN) that maps from the pixel-centered patch space to an embedding space, where two losses are applied: a contrastive loss and a patch reconstruction loss. We will discuss each of these parts in detail next.

Convolutional neural network

We start by processing the image with a CNN to produce a representation of each pixel in an embedding space which we learn during training. Our goal is to include local information about the area around the pixel in the embedding, as well, and thus we process the image with convolutional layers that are exposed to a patch centered at the pixel of interest. We use initially a convolutional layer with a wide filter width to incorporate the broadest contextual information about the patch, and then progressively shrink the width of the kernel as we go deeper into the network to concentrate on the area closest to the pixel itself. While this pyramid shape is general, we utilize filter widths of 15, 11, and 7, respectively for this progressively shrinking set of filter widths. After processing with convolutional layers, each pixel representation has a fully-connected layer applied to it to map it into the -dimensional embedding space.

Contrastive loss

To ensure the network places patches that are similar to each other nearby and patches that are dissimilar far away in the embedding, we employ a contrastive loss. A contrastive loss takes in pairs of so-called “positive examples” and “negative examples” where positive examples are pairs of points that should be embedded as close to each other as possible while the negative examples are pairs of points that should be embedded as far from each other as possible.

As our method is completely unsupervised, we utilize a heuristic to help us identify pairs of patches to use as positive and negative examples. Our heuristic initially uses the location within the image to generate candidates. We identify potential positive pairs by starting with a patch centered at pixel , which we denote , and scanning patches centered at nearby pixels (within 5 pixels of the given pixel).

We then use structural similarity index (SSIM) to verify that these patches are similar to each other [9]

. If they meet or exceed a threshold of similarity set as a hyperparameter for our model, then they are used as a positive pair. Negative pairs for

are selected in a similar fashion: patches centered at pixels that are not nearby are the initial candidates, subject to the same SSIM verification except they must be below the threshold for similarity. In this way, we can generate (equal numbers of) positive and negative pairs for our contrastive loss without needing supervised labels in any way and build invariances into our embedding. Note each pixel-centered patch has its own set of positive patches (which we will denote ) and set of negative patches (which we denote

Figure 2: Example segmentations produced by our model on the Berkeley natural image dataset. Despite not using any labels in the training, our method is able to segment regions of interest and similarity in the natural image dataset.

Let be the convolutional neural network mapping. The patch-level contrastive loss for each positive patch pair in is defined as:


Note that this equation will encourage the network to embed pairs of patches in the similarity set close to each other, with closeness scaled by how close dissimilar pairs of patches in are. No additional term is needed to encourage dissimilar pairs to become more distant, as this term accomplishes the goal of both encouraging similar pairs to be embedded more closely and dissimilar pairs to be embedded more distantly.

The temperature parameter controls how the distances used in the contrastive loss are scaled and was fixed to for all experiments here.

is the similarity function, which we chose to be cosine similarity for our implementations.

Patch reconstruction loss

In addition to the contrastive loss, we ensure our embedding of each pixel retains meaningful information about the patch around it by utilizing a reconstruction loss. Specifically, for an embedding of patch , denote the embedding . The loss is:

where is a single fully-connected linear layer learned during training for reconstruction. By forcing the network to store all of the information about the patch necessary to reconstruct it from just the embedding, it ensures that the embedding has not lost any information during the convolutional processing.

Total loss

The total objective function for our network is then the sum of the contrastive loss and the reconstruction loss, balanced with hyperparameters controlling the magnitude of each:

We find our network to be strongly robust to the choice of these hyperparameters and fix their values at , , which work effectively in all datasets considered here.


Once we have our trained latent space, we then identify distinct segments of the image by clustering the embedding vectors. Because the pixels have been mapped to the embedding space such that they now form a locally Euclidean data manifold, we are able to use spectral k-means clustering to capture similar pixels 

[22]. Notably, this relatively simple clustering strategy would not be sufficient on the ambient pixel data as is, but the learned embedding has these convenient properties like being locally Euclidean that the ambient data lacks. The choice of k for this clustering is naturally related to the complexity of the image and the desired segmentation, but heuristically we found simply using a fixed value of 10 achieved satisfactory results on all of the experiments considered in this work.

3.2 Experiments

In our experiments, we choose three datasets designed to demonstrate the breadth of applications our model can be used to achieve successful segmentations. These datasets are a natural image dataset followed by two biomedical imaging datasets: retinal images and MRI images. These datasets contain images with different characteristics and the regions of interest are distinguished in different ways, and thus a model would need to be flexible to capture the meaningful axes of variation in each of these contexts. As previous deep learning segmentation methods are supervised, we compare to unsupervised methods that are not deep learning frameworks: an Otsu’s method waterfall segmentation and Felzenszwalb segmentation  [16, 7]. We utilize several different metrics to add robustness to our analysis, including those based on intrinsic similarities of the patches and those based on human annotations. In addition, we demonstrate that all aspects of our framework individually add value to the overall model by performing ablation studies of the different components.

3.2.1 Natural images

Our first experiment consists of natural images taken from the Berkeley segmentation benchmark [13]. This dataset provides an important initial check of our model as while they have different characteristics from the medical images that are the primary focus of our work, we seek a model that is flexible enough to handle arbitrary image structures as well. Furthermore, natural images are often more intuitively analyzed by humans and thus serve as an important initial sanity check.

The dataset contains 200 natural images, processed at a resolution of 128x128. Our model contains three convolutional layers with progressively shrinking filter widths of 15, 11, and 7 respectively, filter depths of 32, 64, and 128 respectively, ReLU activation, and the standard batch norm between layers. Because our method yields pixel-level embeddings, no striding or pooling is used to reduce dimensionality. The final layer then maps the output of the convolutional layers into the embedding space with a fully-connected layer of size 128.

Ours 0.348 0.002 24563 173 0.198 0.002
Random 0.303 0.001 39069 1229 0.249 0.003
Watershed 0.312 0.002 36178 981 0.242 0.002
Felzenszwalb 0.335 0.002 33715 879 0.224 0.001
Table 1: Similarity and distance scores across all models on the Berkeley natural image dataset. Our model produces segments with the highest similarity (SSIM) and lowest within-cluster distances (ERGAS, RMSE).
Figure 3: Example segmentations produced by our model on the retina GA region dataset. As can be seen, our model is able to identify the central region of atrophy that is of interest without being perturbed by the presence of blood vessels, a similarly colored retinal region, or the variability of color throughout the dataset of the regions of atrophy.

We analyze the segmentation produced by our model with a variety of metrics. To ensure our segments contain pixels which center patches that are similar to each other, we calculate several measures of similarity and compare these values for pixels in the same segment to pixels in a different segment. We capture different notions of similarity, from structural and textural image-specific metrics to a more general distance-based metric: the previously discussed SSIM, as well as the Relative Dimensionless Global Error (ERGAS), and the root mean squared error (RMSE) [9, 23, 20].

Example output from our model on this dataset can be seen in Figure 2. The contours of the different segmentations are seen to match those in the natural image it corresponds to. For example, the outlines of the different colors in the horse are highlighted with the automatically identified segmentation, as are the different grass shades. In the image of the flower pot, the flower is a clear segment while the background and the table are separate segments. These results provide an initial sanity check that the segmentations produced by our method are reasonable on natural images.

We next verify the segmentation via a series of quantitative experiments, whose results are reported in Table 1. We compare our method to baselines starting with random segmentation assignment of pixels using the same number of segments as our method (to give an uninformed baseline for each metric). Unsurprisingly, this random method underperforms our method at ensuring segments are relatively homogeneous within and heterogeneous between segments. We then compare to two additional segmentation methods that, like ours, are unsupervised. We first compare to a Otsu’s method-thresholded, peak-finding watershed algorithm [1] and a graph-based image segmentation method we refer to as Felzenszwalb [7]

. Results are reported as a mean and standard deviation across three independent trials.

3.2.2 Geographic Atrophy Detection in Color Fundus Photographs

Our next experiment considers an application in biological imaging, our main target motivation in developing our method. The dataset consists of images of retinas, with the goal of identifying and segmenting regions of geographic atrophy (GA). GA is a chronic progressive degeneration of the macula, as part of late-stage age-related macular degeneration (AMD). Automatically segmenting these regions computationally offers the ability to save time and resources of human ophthalmologists who would have to perform the painstaking task of hand-labeling these images. Moreover, as atrophy can take many forms and degrees of severity, two doctors may not agree on the segmentation. Quantitative and objective methods offer the opportunity for a more disciplined and rigorous approach to GA region segmentation.

Data Generation

Specifically, we extracted digitized color fundus photographs (CFPs) from all visits of eyes that had GA in at least 1 visit based on the original gradings by the age-related eye disease study (AREDS) group. To ensure the accuracy of GA segmentations, we excluded images with poor quality (i.e., the border of atrophic lesions could not be identified reliably by our graders), GA lesions extending beyond the imaging field, or GA lesions contiguous with peripapillary atrophy. ImageJ software [19] was employed by graders to manually segment GA lesions, outline the optic disc, and mark the foveal center. The gradings of each CFP were first performed by an independent trained non-expert grader, and were then reviewed and adjusted by an independent expert grader (senior ophthalmology residents) through standard ophthalmological processes [5, 14, 8, 12, 6].

Ours 0.564 35339 882 0.152 0.001
Random 0.399 76756 1021 0.255 0.001
Watershed 0.404 0.002 76253 1387 0.256 0.002
Felzenszwalb 0.448 0.003 59191 14243 0.226 0.001
Table 2: Similarity and distance scores across all models on the retina GA region dataset. Our model produces segments with the highest similarity (SSIM) and lowest within-cluster distances (ERGAS, RMSE).
Figure 4: A comparison of our unsupervised model’s dice coefficient on the retina GA region dataset versus a completely supervised U-Net given increasing numbers of labels. Our model, which uses no labels, performs the supervised model when it is given less than 60% of the labels.
Experimental Results

We applied our unsupervised segmentation method to this dataset with the same architecture and hyperparameters as in the previous experiment. Likewise, we compare to the same baselines and using the same metrics.

We add one extra step to confirm that one of our segments corresponds accurately to the region of interest (the GA atrophy region). We use the expert labels to identify a single pixel located near the center of the region, calculated by finding the closest pixel to the median horizontal and vertical pixel value in the region. We then select the segment that this one pixel belongs to, analogistic to how a practitioner would use our model by identifying which of the model-created segments to extract by selecting just a single pixel.

Example output can be seen in Figure 3. In each case, our method accurately picks out the region of atrophy, despite there being significant noise in some cases. For example, in each of the images shown the optic nerve presents as a similarly colored, but separate, area to the side of the region of atrophy. Our model successfully avoids including this in the same segment as the GA region. Moreover, the presence of blood vessels as well as varying degrees of atrophy (varying levels of color differentiation) exemplify the challenges in this task.

Beyond the qualitative visualization of results, quantitative experiments confirm the accuracy of our segmentation, which we show in Table 2. As in the previous experiment, our model creates segmentations with higher similarity (SSIM) and lower within-segment distances (ERGAS, RMSE) compared to the baselines.

In this dataset, we are able to add an additional quantitative measure of accuracy that uses the expert-curated ground truth identification of GA regions. We calculate the dice coefficient between the labels for each pixel belonging to the GA region or not and the predicted labels for each method  [15]. In all methods, we select one segment as the prediction for the GA region via the median pixel method discussed above. As we can see in Table 3, our method produces a vastly superior dice coefficient on this dataset as compared to the baselines.

Dice Coefficient
Ours 0.665 0.004
Random 0.084 0.002
Watershed 0.144 0.003
Felzenszwalb 0.213 0.004
Table 3: Dice coefficients between predicted and expert-annotated labels for all models on the retina GA region dataset. Our model produces labels that have the highest agreement with the ground truth labels, which are not used during training as ours is an unsupervised model.
Full model 0.665 0.564 35339 0.152
No contrastive loss 0.658 0.561 35423 0.153
No reconstruction loss 0.597 0.546 45831 0.166
Table 4: Ablation tests on our model with and without each aspect of the loss on the retina GA region dataset. The model with both contrastive and reconstruction losses performs the best across all metrics.
Figure 5: Example segmentations of ventricles on brain MR images. As can be seen, our model can identify the ventricles despite the variety of shapes, sizes, and pixel intensities of the ventricles and the surrounding brain tissues. Notably, the segmentation target can be non-contiguous on the image, though the disconnected portions are proximate within the MR image.
Comparison to supervised accuracy

In this section, we compare our the results of our unsupervised method to a standard supervised method with varying amounts of supervision. We demonstrate here that while a supervised model given 100% of the labels outperforms our model, we obtain a similar accuracy to a model with 50% of the labels despite not using any labels ourselves. And most notably, our model significantly outperforms the supervised model given only a small amount (e.g. one pixel) of the labels.

To perform this experiment, we divide the dataset into a standard 90%/10% training/evaluation split and then train a U-Net with pixel-level supervision trained to match the ground truth labels [18]

. We start with giving it 100% of the labels, and then gradually introduce more and more noise in the form of multiplying the training labels by a salt-and-pepper vector of 0s and 1s pulled from a binomial with decreasing probability of a 1 coming up. As shown in Figure 

4, the accuracy of the supervised model drastically decreases with fewer un-noised supervised labels passing through.

Ablation test

We also motivate our use of both the contrastive and reconstruction losses together on this dataset by performing ablation tests. In each test, we set the coefficient for the loss to zero ( and respectively for the “no contrastive loss” and “no reconstruction loss” models) and run the model with the same parameters as before. We find that the model with both losses performs the best across all the metrics we’ve considered in this experiment (Table 4). This is understandable, as each provides a different necessary component to producing a meaningful embedding, and eventually, a meaningful segmentation of the original image.

3.2.3 MRI images

In our next experiment, we analyze the performance of our network on a dataset of brain magnetic resonance (MR) images obtained from clinical patients. This dataset includes 100 patients with Alzheimer’s dementia or mild cognitive impairment. We attempt to automatically segment the brain ventricles on the MR images of these patients. Segmenting ventricles is clinically important because the volume of brain ventricles can predict the progression of dementia . The ground truth has been identified from human annotated labels and is used in evaluating performance exclusively, as in the other cases, as our method is entirely unsupervised in training.

Challenges in this dataset include an increased complexity in shape and size of the segmentation target. Since the MR imaging plane (i.e. orientation of image acquisition relative to the brain) can vary between patients, the shape and size of the ventricles can significantly vary between patients. Additionally, the pixel intensity alone cannot segment brain ventricles since both the background and other brain regions can have similar pixel intensity as the ventricles.

We show examples of the output of our model in Figure 5. Our model is able to identify the area of the ventricles in a wide variety of settings. The ventricles can assume different shapes on each MR image, as well as different sizes and different pixel intensities. Also, it is notable that the segments are not contiguous in some cases. Despite this challenge, our model is able to keep the similar regions of ventricles in the same segmentation unit.

We look at our results quantitatively using the same criteria as in the previous datasets. In Table 5, we see the same pattern occur as compared to the alternative methods using the similarity and distance based metrics of SSIM, ERGAS, and RMSE. Our method produces segments that have the highest similarity within each of the produced segments. In Table 6, we see the results are also upheld when calculating the dice coefficient with the human annotated labels.

Ours 0.131 40434 732 0.392 0.007
Random 0.097 44913 1103 0.413 0.002
Watershed 0.098 0.002 42911 1217 0.410 0.004
Felzenszwalb 0.118 0.004 42530 1068 0.394 0.006
Table 5: Similarity and distance scores across all models on the MRI ventricles dataset. Our model produces segments with the highest similarity (SSIM) and lowest within-cluster distances (ERGAS, RMSE).
Dice Coefficient
Ours 0.562 0.017
Random 0.091 0.002
Watershed 0.203 0.021
Felzenszwalb 0.204 0.008
Table 6: Dice coefficients between predicted and expert-annotated labels for all models on the MRI ventricles dataset. Our model produces labels that have the highest agreement with the ground truth labels, which are not used during training as ours is an unsupervised model.

4 Discussion

Our model represents a step forward in automatic image segmentation, using a combination of contrastive learning and autoencoder-like reconstruction to segment meaningful regions of an image without the use of any labels. As this offers the opportunity to increase both efficiency and objectivity in the image segmenting task, it offers exciting applications to the field medical imaging, where these features and both vital. Since general applicability is a major goal of this work, namely a model that can work in a wide variety of applications in medical imaging, we view future work as exploring the performance of our model on even more modalities and technologies. As new technologies are being developed at a rapid pace, a model being applicable to as many as possible is an important quality for widespread adoption.


  • [1] S. L. Bangare, A. Dubal, P. S. Bangare, and S. Patil (2015) Reviewing otsu’s method for image thresholding. International Journal of Applied Engineering Research 10 (9), pp. 21777–21783. Cited by: §3.2.1.
  • [2] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu (2020) Contrastive learning of global and local features for medical image segmentation with limited annotations. External Links: 2006.10511 Cited by: §2.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §2.
  • [4] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §2.
  • [5] R. P. Danis, A. Domalpally, E. Y. Chew, T. E. Clemons, J. Armstrong, J. P. SanGiovanni, and F. L. Ferris (2013) Methods and reproducibility of grading optimized digital color fundus photographs in the age-related eye disease study 2 (areds2 report number 2). Investigative ophthalmology & visual science 54 (7), pp. 4548–4554. Cited by: §3.2.2.
  • [6] M. D. Davis, R. E. Gangnon, L. Y. Lee, L. D. Hubbard, B. Klein, R. Klein, F. L. Ferris, S. B. Bressler, R. C. Milton, et al. (2005) The age-related eye disease study severity scale for age-related macular degeneration: areds report no. 17.. Archives of ophthalmology (Chicago, Ill.: 1960) 123 (11), pp. 1484–1498. Cited by: §3.2.2.
  • [7] P. F. Felzenszwalb and D. P. Huttenlocher (2004) Efficient graph-based image segmentation.

    International journal of computer vision

    59 (2), pp. 167–181.
    Cited by: §3.2.1, §3.2.
  • [8] M. Fleckenstein, S. Schmitz-Valckenberg, C. Martens, S. Kosanetzky, C. K. Brinkmann, G. S. Hageman, and F. G. Holz (2011) Fundus autofluorescence and spectral-domain optical coherence tomography characteristics in a rapidly progressing form of geographic atrophy. Investigative ophthalmology & visual science 52 (6), pp. 3761–3766. Cited by: §3.2.2.
  • [9] A. Hore and D. Ziou (2010) Image quality metrics: psnr vs. ssim. In

    2010 20th international conference on pattern recognition

    pp. 2366–2369. Cited by: §3.1, §3.2.1.
  • [10] F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert, et al. (2018) Nnu-net: self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486. Cited by: §2.
  • [11] S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. Maier-Hein, S. Eslami, D. Jimenez Rezende, and O. Ronneberger (2018) A probabilistic u-net for segmentation of ambiguous images. Advances in neural information processing systems 31. Cited by: §2.
  • [12] M. Marsiglia, S. Boddu, S. Bearelly, L. Xu, B. E. Breaux, K. B. Freund, L. A. Yannuzzi, and R. T. Smith (2013) Association between geographic atrophy progression and reticular pseudodrusen in eyes with dry age-related macular degeneration. Investigative ophthalmology & visual science 54 (12), pp. 7362–7369. Cited by: §3.2.2.
  • [13] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001-07) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, Vol. 2, pp. 416–423. Cited by: §3.2.1.
  • [14] M. M. Mauschitz, S. Fonseca, P. Chang, A. P. Göbel, M. Fleckenstein, G. J. Jaffe, F. G. Holz, S. Schmitz-Valckenberg, G. S. Group, et al. (2012) Topography of geographic atrophy in age-related macular degeneration. Investigative ophthalmology & visual science 53 (8), pp. 4932–4939. Cited by: §3.2.2.
  • [15] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pp. 565–571. Cited by: §3.2.2.
  • [16] H. Oh, K. Lim, and S. Chien (2005)

    An improved binarization algorithm based on a water flow model for document image with inhomogeneous backgrounds

    Pattern Recognition 38 (12), pp. 2612–2625. Cited by: §3.2.
  • [17] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al. (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §2.
  • [18] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.2.2.
  • [19] C. T. Rueden and K. W. Eliceiri (2019) ImageJ for the next generation of scientific image data. Microscopy and Microanalysis 25 (S2), pp. 142–143. Cited by: §3.2.2.
  • [20] L. Wald (2000) Quality of high resolution synthesised images: is there a simple criterion?. In Third conference" Fusion of Earth data: merging point measurements, raster maps and remotely sensed images", pp. 99–103. Cited by: §3.2.1.
  • [21] K. Yan, J. Cai, D. Jin, S. Miao, A. P. Harrison, D. Guo, Y. Tang, J. Xiao, J. Lu, and L. Lu (2020) Self-supervised learning of pixel-wise anatomical embeddings in radiological images. External Links: 2012.02383 Cited by: §2.
  • [22] H. Zha, X. He, C. Ding, M. Gu, and H. Simon (2001) Spectral relaxation for k-means clustering. Advances in neural information processing systems 14. Cited by: §3.1.
  • [23] Y. Zhang (2008) Methods for image fusion quality assessment-a review, comparison and analysis. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 37 (PART B7), pp. 1101–1109. Cited by: §3.2.1.
  • [24] X. Zhao, R. Vemulapalli, P. Mansfield, B. Gong, B. Green, L. Shapira, and Y. Wu (2021) Contrastive learning for label-efficient semantic segmentation. External Links: 2012.06985 Cited by: §2.
  • [25] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 3–11. Cited by: §2.