While supervised deep learning approaches have been wildly successful in segmenting images, there are several issues in applying them to biomedical images, particularly in order to make clinical inferences. First, these networks are reliant on expert-driven annotations which form a smaller and smaller subset of generated images, as the volume of image generation in various contexts explodes. Second, supervised classification networks trained on one set of annotated images can fail to generalize to similar image sets collected in very slightly different contexts (like in different hospitals or on different instruments). A large reason for this is because supervised networks can easily overfit to the classification, memorizing idiosyncratic pixel patterns or artifacts in the images rather than learning generally meaningful representations. To remedy this, we propose an entirely unsupervisedneural network that leverages recent advances in representation learning and is built to automatically segment biomedical images.
Our framework, which we denote Contrastive and Unsupervised Training for Segmentation (CUTS), was named as an homage to the renowned painter Henri Matisse, who famously used a “cut-up” method he called “drawing with scissors” to assemble an image based on patches of material from different sources. This method motivates our foundational view that meaningful image segmentations are comprised of generally contiguous regions of similar color and texture being grouped together within an image.
In order to induce a neural network to naturally group these “cut-up” parts of images at fine resolution, our network produces pixel-level embedding but operates on patches centered around each pixel, finding a latent space embedding of the pixel space. We refer to this approach as learning functions of pixel-centered patches. Second, since we want our network to be unsupervised, we wish to create an embedding in a hidden layer of a neural network such that pixels in the same parts of the image are grouped together, while pixels in different areas are separated. To induce this, we combine two losses that are motivated by two distinct strands of research in the community: contrastive learning and autoencoding.
In the contrastive learning aspect of the loss, each pixel-centered patch is processed along with another patch and the network produces embeddings that are either close to each if the patches are similar or distant from each other if the patches are dissimilar. This contrastive loss, when properly tuned with cleverly chosen sets of similar and dissimilar patches obtained with a heuristic-guided search of each image, provides a framework for a latent space embedding that is similar to word embeddings based on context. We utilize this to ensure that nearby regions, and thus natural segments of the image, are embedded nearby in the learned latent space, while still remaining separated along natural boundaries of different segments.
In the autoencoding aspect of the loss, the network must be able to reconstruct the patch around each pixel from just its embedding representation. This ensures that fine-grained detail of the local patch is well represented in the embedding and no information is lost in the layers of the network. For example, consider the embedding of two pixel-centered patches that each contain a boundary in the image, but the first is centered slightly on one side of the boundary and the second is centered slightly on the other side of the boundary. These two patches are largely similar to each other in aggregate, but to segment the boundary precisely, we would need to retain the information of exactly where the boundary is located in the patch. This reconstruction loss ensures this information is retained as necessary.
Finally once we have a sufficiently structured latent space, we can find regions of interest based on first using a modified spectral clustering of the embedding space. We demonstrate mapping these clusters obtained from the embedding back into the corresponding pixels produces a meaningful segmentation in the original image space.
Our primary motivation comes from applications in medical imaging specifically. We demonstrate that our method is widely generalizable to a variety of different contexts, and could be used on anything from natural images to tumors. This is an important criteria for widespread adoption in the medical imaging community, rather than being narrowly tailored to work in only one particular application area, as imaging is a broad and fast-growing field consisting of an increasing number of technologies that produce data with various statistical properties. We first apply our method on natural images as an initial performance check in an easily interpretable setting. We then show that our network performs unsupervised segmentation of difficult-to-partition regions of geographic atrophy in a specially curated set of retinal images. We finally demonstrate that our method can segment regions of the brain in MRI data.
Our main contributions include:
Specification of a pixel-centered, patch-level convolutional architecture for producing contextualized representations of the pixels.
Introduction of a contrastive loss (including pixel proximity heuristics supplemented with SSIM-based checks).
A framework that combines contrastive learning with a reconstruction loss for learning a meaningful embedding space.
A spectral clustering method for zeroing in on contiguous regions of potential interest to clinicians.
The remainder of this paper is organized as follows. We first discuss related work in the field to provide context. We then introduce our model framework, including detailing the novel losses and neural network architecture. Finally, we demonstrate our model on a series of natural image and biological image datasets, coupled with qualitative and quantitative evaluations of its performance and comparison to baseline alternatives.
2 Related Work
Contrastive learning has been used for image classification to solve the issue of limited annotated images. However, commonly used contrastive learning methods have often focused on extracting image-level representations with a global contrastive loss . The use of image-level contrastive learning is fundamentally different from our patch-level approach. Image-level contrastive learning yields no information about intra-image variation or features, and thus is inapplicable to the common setting of analyzing different areas within an image. In contrast, image-level contrastive learning generally tries to solve the problem of whether two images both have the same property, like whether they are of the same person or not. Recent efforts have moved towards developing local pixel/patch-level representations that can be used for downstream tasks. In this study we focused on the latter: learning pixel level representations for segmentation tasks. However, an important difference is that our work uses unsupervised training end-to-end in obtaining its representations.
Previous work has used patch-level contrastive training but limited its application to pre-training of a supervised model that was then fine tuned with labeled data . In that work, they combined a global contrastive loss and a local contrastive loss to learn both image representations and patch representations. On the other hand, other work has used a similar U-net structure but combined image-level and pixel-level representations . In both cases, notably, this was a pre-training step prior to supervised training rather than a standalone representation-learning method.
While these are supervised models that are pre-trained with unsupervised contrastive learning, an alternative setting in the previous literature is supervised contrastive learning. In this setting, labeled data is utilized (which we do not have in our applications) to construct positive and negative pairs for learning pixel level representations 
. In contrast, we directly use convolutional neural networks to learn pixel embeddings with pixel-level contrastive loss without labels. The segmentation is achieved through clustering, which makes our model totally unsupervised, unlike previous methods that incorporate labels at some point in the segmentation.
We note that the proliferation of U-net-based models still target the task of supervised learning: with a U-net, the output is directly in pixel space, rather than our model which outputs representations in a learned embedding space. Thus, this work is easily differentiated from all works focused on minor variations on the U-net architecture such as adding skip connections, attention layers, adaptive kernels,[25, 17, 10, 4, 11].
3.1 The CUTS Framework
In this section, we detail the framework for our model (Figure 1). The framework is composed of a convolutional neural network (CNN) that maps from the pixel-centered patch space to an embedding space, where two losses are applied: a contrastive loss and a patch reconstruction loss. We will discuss each of these parts in detail next.
Convolutional neural network
We start by processing the image with a CNN to produce a representation of each pixel in an embedding space which we learn during training. Our goal is to include local information about the area around the pixel in the embedding, as well, and thus we process the image with convolutional layers that are exposed to a patch centered at the pixel of interest. We use initially a convolutional layer with a wide filter width to incorporate the broadest contextual information about the patch, and then progressively shrink the width of the kernel as we go deeper into the network to concentrate on the area closest to the pixel itself. While this pyramid shape is general, we utilize filter widths of 15, 11, and 7, respectively for this progressively shrinking set of filter widths. After processing with convolutional layers, each pixel representation has a fully-connected layer applied to it to map it into the -dimensional embedding space.
To ensure the network places patches that are similar to each other nearby and patches that are dissimilar far away in the embedding, we employ a contrastive loss. A contrastive loss takes in pairs of so-called “positive examples” and “negative examples” where positive examples are pairs of points that should be embedded as close to each other as possible while the negative examples are pairs of points that should be embedded as far from each other as possible.
As our method is completely unsupervised, we utilize a heuristic to help us identify pairs of patches to use as positive and negative examples. Our heuristic initially uses the location within the image to generate candidates. We identify potential positive pairs by starting with a patch centered at pixel , which we denote , and scanning patches centered at nearby pixels (within 5 pixels of the given pixel).
We then use structural similarity index (SSIM) to verify that these patches are similar to each other 
. If they meet or exceed a threshold of similarity set as a hyperparameter for our model, then they are used as a positive pair. Negative pairs forare selected in a similar fashion: patches centered at pixels that are not nearby are the initial candidates, subject to the same SSIM verification except they must be below the threshold for similarity. In this way, we can generate (equal numbers of) positive and negative pairs for our contrastive loss without needing supervised labels in any way and build invariances into our embedding. Note each pixel-centered patch has its own set of positive patches (which we will denote ) and set of negative patches (which we denote
Let be the convolutional neural network mapping. The patch-level contrastive loss for each positive patch pair in is defined as:
Note that this equation will encourage the network to embed pairs of patches in the similarity set close to each other, with closeness scaled by how close dissimilar pairs of patches in are. No additional term is needed to encourage dissimilar pairs to become more distant, as this term accomplishes the goal of both encouraging similar pairs to be embedded more closely and dissimilar pairs to be embedded more distantly.
The temperature parameter controls how the distances used in the contrastive loss are scaled and was fixed to for all experiments here.
is the similarity function, which we chose to be cosine similarity for our implementations.
Patch reconstruction loss
In addition to the contrastive loss, we ensure our embedding of each pixel retains meaningful information about the patch around it by utilizing a reconstruction loss. Specifically, for an embedding of patch , denote the embedding . The loss is:
where is a single fully-connected linear layer learned during training for reconstruction. By forcing the network to store all of the information about the patch necessary to reconstruct it from just the embedding, it ensures that the embedding has not lost any information during the convolutional processing.
The total objective function for our network is then the sum of the contrastive loss and the reconstruction loss, balanced with hyperparameters controlling the magnitude of each:
We find our network to be strongly robust to the choice of these hyperparameters and fix their values at , , which work effectively in all datasets considered here.
Once we have our trained latent space, we then identify distinct segments of the image by clustering the embedding vectors. Because the pixels have been mapped to the embedding space such that they now form a locally Euclidean data manifold, we are able to use spectral k-means clustering to capture similar pixels. Notably, this relatively simple clustering strategy would not be sufficient on the ambient pixel data as is, but the learned embedding has these convenient properties like being locally Euclidean that the ambient data lacks. The choice of k for this clustering is naturally related to the complexity of the image and the desired segmentation, but heuristically we found simply using a fixed value of 10 achieved satisfactory results on all of the experiments considered in this work.
In our experiments, we choose three datasets designed to demonstrate the breadth of applications our model can be used to achieve successful segmentations. These datasets are a natural image dataset followed by two biomedical imaging datasets: retinal images and MRI images. These datasets contain images with different characteristics and the regions of interest are distinguished in different ways, and thus a model would need to be flexible to capture the meaningful axes of variation in each of these contexts. As previous deep learning segmentation methods are supervised, we compare to unsupervised methods that are not deep learning frameworks: an Otsu’s method waterfall segmentation and Felzenszwalb segmentation [16, 7]. We utilize several different metrics to add robustness to our analysis, including those based on intrinsic similarities of the patches and those based on human annotations. In addition, we demonstrate that all aspects of our framework individually add value to the overall model by performing ablation studies of the different components.
3.2.1 Natural images
Our first experiment consists of natural images taken from the Berkeley segmentation benchmark . This dataset provides an important initial check of our model as while they have different characteristics from the medical images that are the primary focus of our work, we seek a model that is flexible enough to handle arbitrary image structures as well. Furthermore, natural images are often more intuitively analyzed by humans and thus serve as an important initial sanity check.
The dataset contains 200 natural images, processed at a resolution of 128x128. Our model contains three convolutional layers with progressively shrinking filter widths of 15, 11, and 7 respectively, filter depths of 32, 64, and 128 respectively, ReLU activation, and the standard batch norm between layers. Because our method yields pixel-level embeddings, no striding or pooling is used to reduce dimensionality. The final layer then maps the output of the convolutional layers into the embedding space with a fully-connected layer of size 128.
|Ours||0.348 0.002||24563 173||0.198 0.002|
|Random||0.303 0.001||39069 1229||0.249 0.003|
|Watershed||0.312 0.002||36178 981||0.242 0.002|
|Felzenszwalb||0.335 0.002||33715 879||0.224 0.001|
We analyze the segmentation produced by our model with a variety of metrics. To ensure our segments contain pixels which center patches that are similar to each other, we calculate several measures of similarity and compare these values for pixels in the same segment to pixels in a different segment. We capture different notions of similarity, from structural and textural image-specific metrics to a more general distance-based metric: the previously discussed SSIM, as well as the Relative Dimensionless Global Error (ERGAS), and the root mean squared error (RMSE) [9, 23, 20].
Example output from our model on this dataset can be seen in Figure 2. The contours of the different segmentations are seen to match those in the natural image it corresponds to. For example, the outlines of the different colors in the horse are highlighted with the automatically identified segmentation, as are the different grass shades. In the image of the flower pot, the flower is a clear segment while the background and the table are separate segments. These results provide an initial sanity check that the segmentations produced by our method are reasonable on natural images.
We next verify the segmentation via a series of quantitative experiments, whose results are reported in Table 1. We compare our method to baselines starting with random segmentation assignment of pixels using the same number of segments as our method (to give an uninformed baseline for each metric). Unsurprisingly, this random method underperforms our method at ensuring segments are relatively homogeneous within and heterogeneous between segments. We then compare to two additional segmentation methods that, like ours, are unsupervised. We first compare to a Otsu’s method-thresholded, peak-finding watershed algorithm  and a graph-based image segmentation method we refer to as Felzenszwalb 
. Results are reported as a mean and standard deviation across three independent trials.
3.2.2 Geographic Atrophy Detection in Color Fundus Photographs
Our next experiment considers an application in biological imaging, our main target motivation in developing our method. The dataset consists of images of retinas, with the goal of identifying and segmenting regions of geographic atrophy (GA). GA is a chronic progressive degeneration of the macula, as part of late-stage age-related macular degeneration (AMD). Automatically segmenting these regions computationally offers the ability to save time and resources of human ophthalmologists who would have to perform the painstaking task of hand-labeling these images. Moreover, as atrophy can take many forms and degrees of severity, two doctors may not agree on the segmentation. Quantitative and objective methods offer the opportunity for a more disciplined and rigorous approach to GA region segmentation.
Specifically, we extracted digitized color fundus photographs (CFPs) from all visits of eyes that had GA in at least 1 visit based on the original gradings by the age-related eye disease study (AREDS) group. To ensure the accuracy of GA segmentations, we excluded images with poor quality (i.e., the border of atrophic lesions could not be identified reliably by our graders), GA lesions extending beyond the imaging field, or GA lesions contiguous with peripapillary atrophy. ImageJ software  was employed by graders to manually segment GA lesions, outline the optic disc, and mark the foveal center. The gradings of each CFP were first performed by an independent trained non-expert grader, and were then reviewed and adjusted by an independent expert grader (senior ophthalmology residents) through standard ophthalmological processes [5, 14, 8, 12, 6].
|Ours||0.564||35339 882||0.152 0.001|
|Random||0.399||76756 1021||0.255 0.001|
|Watershed||0.404 0.002||76253 1387||0.256 0.002|
|Felzenszwalb||0.448 0.003||59191 14243||0.226 0.001|
We applied our unsupervised segmentation method to this dataset with the same architecture and hyperparameters as in the previous experiment. Likewise, we compare to the same baselines and using the same metrics.
We add one extra step to confirm that one of our segments corresponds accurately to the region of interest (the GA atrophy region). We use the expert labels to identify a single pixel located near the center of the region, calculated by finding the closest pixel to the median horizontal and vertical pixel value in the region. We then select the segment that this one pixel belongs to, analogistic to how a practitioner would use our model by identifying which of the model-created segments to extract by selecting just a single pixel.
Example output can be seen in Figure 3. In each case, our method accurately picks out the region of atrophy, despite there being significant noise in some cases. For example, in each of the images shown the optic nerve presents as a similarly colored, but separate, area to the side of the region of atrophy. Our model successfully avoids including this in the same segment as the GA region. Moreover, the presence of blood vessels as well as varying degrees of atrophy (varying levels of color differentiation) exemplify the challenges in this task.
Beyond the qualitative visualization of results, quantitative experiments confirm the accuracy of our segmentation, which we show in Table 2. As in the previous experiment, our model creates segmentations with higher similarity (SSIM) and lower within-segment distances (ERGAS, RMSE) compared to the baselines.
In this dataset, we are able to add an additional quantitative measure of accuracy that uses the expert-curated ground truth identification of GA regions. We calculate the dice coefficient between the labels for each pixel belonging to the GA region or not and the predicted labels for each method . In all methods, we select one segment as the prediction for the GA region via the median pixel method discussed above. As we can see in Table 3, our method produces a vastly superior dice coefficient on this dataset as compared to the baselines.
|No contrastive loss||0.658||0.561||35423||0.153|
|No reconstruction loss||0.597||0.546||45831||0.166|
Comparison to supervised accuracy
In this section, we compare our the results of our unsupervised method to a standard supervised method with varying amounts of supervision. We demonstrate here that while a supervised model given 100% of the labels outperforms our model, we obtain a similar accuracy to a model with 50% of the labels despite not using any labels ourselves. And most notably, our model significantly outperforms the supervised model given only a small amount (e.g. one pixel) of the labels.
To perform this experiment, we divide the dataset into a standard 90%/10% training/evaluation split and then train a U-Net with pixel-level supervision trained to match the ground truth labels 
. We start with giving it 100% of the labels, and then gradually introduce more and more noise in the form of multiplying the training labels by a salt-and-pepper vector of 0s and 1s pulled from a binomial with decreasing probability of a 1 coming up. As shown in Figure4, the accuracy of the supervised model drastically decreases with fewer un-noised supervised labels passing through.
We also motivate our use of both the contrastive and reconstruction losses together on this dataset by performing ablation tests. In each test, we set the coefficient for the loss to zero ( and respectively for the “no contrastive loss” and “no reconstruction loss” models) and run the model with the same parameters as before. We find that the model with both losses performs the best across all the metrics we’ve considered in this experiment (Table 4). This is understandable, as each provides a different necessary component to producing a meaningful embedding, and eventually, a meaningful segmentation of the original image.
3.2.3 MRI images
In our next experiment, we analyze the performance of our network on a dataset of brain magnetic resonance (MR) images obtained from clinical patients. This dataset includes 100 patients with Alzheimer’s dementia or mild cognitive impairment. We attempt to automatically segment the brain ventricles on the MR images of these patients. Segmenting ventricles is clinically important because the volume of brain ventricles can predict the progression of dementia . The ground truth has been identified from human annotated labels and is used in evaluating performance exclusively, as in the other cases, as our method is entirely unsupervised in training.
Challenges in this dataset include an increased complexity in shape and size of the segmentation target. Since the MR imaging plane (i.e. orientation of image acquisition relative to the brain) can vary between patients, the shape and size of the ventricles can significantly vary between patients. Additionally, the pixel intensity alone cannot segment brain ventricles since both the background and other brain regions can have similar pixel intensity as the ventricles.
We show examples of the output of our model in Figure 5. Our model is able to identify the area of the ventricles in a wide variety of settings. The ventricles can assume different shapes on each MR image, as well as different sizes and different pixel intensities. Also, it is notable that the segments are not contiguous in some cases. Despite this challenge, our model is able to keep the similar regions of ventricles in the same segmentation unit.
We look at our results quantitatively using the same criteria as in the previous datasets. In Table 5, we see the same pattern occur as compared to the alternative methods using the similarity and distance based metrics of SSIM, ERGAS, and RMSE. Our method produces segments that have the highest similarity within each of the produced segments. In Table 6, we see the results are also upheld when calculating the dice coefficient with the human annotated labels.
|Ours||0.131||40434 732||0.392 0.007|
|Random||0.097||44913 1103||0.413 0.002|
|Watershed||0.098 0.002||42911 1217||0.410 0.004|
|Felzenszwalb||0.118 0.004||42530 1068||0.394 0.006|
Our model represents a step forward in automatic image segmentation, using a combination of contrastive learning and autoencoder-like reconstruction to segment meaningful regions of an image without the use of any labels. As this offers the opportunity to increase both efficiency and objectivity in the image segmenting task, it offers exciting applications to the field medical imaging, where these features and both vital. Since general applicability is a major goal of this work, namely a model that can work in a wide variety of applications in medical imaging, we view future work as exploring the performance of our model on even more modalities and technologies. As new technologies are being developed at a rapid pace, a model being applicable to as many as possible is an important quality for widespread adoption.
-  (2015) Reviewing otsu’s method for image thresholding. International Journal of Applied Engineering Research 10 (9), pp. 21777–21783. Cited by: §3.2.1.
-  (2020) Contrastive learning of global and local features for medical image segmentation with limited annotations. External Links: Cited by: §2.
-  (2020) A simple framework for contrastive learning of visual representations. External Links: Cited by: §2.
-  (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §2.
-  (2013) Methods and reproducibility of grading optimized digital color fundus photographs in the age-related eye disease study 2 (areds2 report number 2). Investigative ophthalmology & visual science 54 (7), pp. 4548–4554. Cited by: §3.2.2.
-  (2005) The age-related eye disease study severity scale for age-related macular degeneration: areds report no. 17.. Archives of ophthalmology (Chicago, Ill.: 1960) 123 (11), pp. 1484–1498. Cited by: §3.2.2.
Efficient graph-based image segmentation.
International journal of computer vision59 (2), pp. 167–181. Cited by: §3.2.1, §3.2.
-  (2011) Fundus autofluorescence and spectral-domain optical coherence tomography characteristics in a rapidly progressing form of geographic atrophy. Investigative ophthalmology & visual science 52 (6), pp. 3761–3766. Cited by: §3.2.2.
Image quality metrics: psnr vs. ssim.
2010 20th international conference on pattern recognition, pp. 2366–2369. Cited by: §3.1, §3.2.1.
-  (2018) Nnu-net: self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486. Cited by: §2.
-  (2018) A probabilistic u-net for segmentation of ambiguous images. Advances in neural information processing systems 31. Cited by: §2.
-  (2013) Association between geographic atrophy progression and reticular pseudodrusen in eyes with dry age-related macular degeneration. Investigative ophthalmology & visual science 54 (12), pp. 7362–7369. Cited by: §3.2.2.
-  (2001-07) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, Vol. 2, pp. 416–423. Cited by: §3.2.1.
-  (2012) Topography of geographic atrophy in age-related macular degeneration. Investigative ophthalmology & visual science 53 (8), pp. 4932–4939. Cited by: §3.2.2.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pp. 565–571. Cited by: §3.2.2.
An improved binarization algorithm based on a water flow model for document image with inhomogeneous backgrounds. Pattern Recognition 38 (12), pp. 2612–2625. Cited by: §3.2.
-  (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.2.2.
-  (2019) ImageJ for the next generation of scientific image data. Microscopy and Microanalysis 25 (S2), pp. 142–143. Cited by: §3.2.2.
-  (2000) Quality of high resolution synthesised images: is there a simple criterion?. In Third conference" Fusion of Earth data: merging point measurements, raster maps and remotely sensed images", pp. 99–103. Cited by: §3.2.1.
-  (2020) Self-supervised learning of pixel-wise anatomical embeddings in radiological images. External Links: Cited by: §2.
-  (2001) Spectral relaxation for k-means clustering. Advances in neural information processing systems 14. Cited by: §3.1.
-  (2008) Methods for image fusion quality assessment-a review, comparison and analysis. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 37 (PART B7), pp. 1101–1109. Cited by: §3.2.1.
-  (2021) Contrastive learning for label-efficient semantic segmentation. External Links: Cited by: §2.
-  (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 3–11. Cited by: §2.